Index OCRed PFD (using tesseract-ocr) with elasticsearch

#1

Hello.

In a production Nextcloud deployment (v14.0.3.0) I have recently installed:

Using the basic installation tutorial, and some other guides to install Elasticsearch and Tesseract-OCR as services in the server.

Live indexing is working fine and text files (pdf, docx, xlsx, etc.) are added to the index and seem working fine, but OCR is not working as I expected. Scanned documents as PDF are no indexed.

I think is a limitation in Tesseract or Elasticsearch, because I have scanned documents as jpg (in a test) and is correctly indexed. But I can not find how to do it in PDF documents. I have reviewed Nextcloud and plugins documentation. In Tesseract documentation I haven’t found any mention to PDF as input (may be that’s the point?).

My company have a lot of scanned documents in PDF and would be great to index them without converting or any other processing. Just indexing.

Have anyone achieved PDF (scanned documents) indexing with Full text search + Elasticsearch + Tesseract OCR? Is that even possible?

Thank you very much.

2 Likes
#2

Hi, I have the same problem. Own PDFs are not indexed.

#3

Same issue, no OCR on PDF is done apparently