Ability to ignore OCRing of huge PDF documents

jacotec · February 19, 2020, 2:46pm

Hi,

is there any chance to limit the OCR of PDF’s with Tesseract to a certain number of pages?

I have a few manuals/books in PDF format in my cloud, some with >500 pages - and my index with Tesseract takes forever. I don’t really need a fulltext search in these books, is there a way to set a limit (max pages) for OCRing these PDF?

I know of the .noindex to exclude whole folders - but I can’t just put all books into one subfolder. Also I have no control what other users are storing.

The ability that tesseract ignores all PDF’s with more than i.e. 50 pages would be great.