Index OCRed PFD (using tesseract-ocr) with elasticsearch

josemaria · November 13, 2018, 11:43am

Hello.

In a production Nextcloud deployment (v14.0.3.0) I have recently installed:

Using the basic installation tutorial, and some other guides to install Elasticsearch and Tesseract-OCR as services in the server.

Live indexing is working fine and text files (pdf, docx, xlsx, etc.) are added to the index and seem working fine, but OCR is not working as I expected. Scanned documents as PDF are no indexed.

I think is a limitation in Tesseract or Elasticsearch, because I have scanned documents as jpg (in a test) and is correctly indexed. But I can not find how to do it in PDF documents. I have reviewed Nextcloud and plugins documentation. In Tesseract documentation I haven’t found any mention to PDF as input (may be that’s the point?).

My company have a lot of scanned documents in PDF and would be great to index them without converting or any other processing. Just indexing.

Have anyone achieved PDF (scanned documents) indexing with Full text search + Elasticsearch + Tesseract OCR? Is that even possible?

Thank you very much.

hermann1514 · November 13, 2018, 3:25pm

Hi, I have the same problem. Own PDFs are not indexed.

Lox · April 20, 2019, 4:39am

Same issue, no OCR on PDF is done apparently

N_M · March 19, 2020, 2:41am

Same issues, do you guys have workaround for this?

xcojonny · February 25, 2021, 12:57pm

Samer issue, any solutions?

chrissi55 · March 1, 2021, 11:46am

i followed this instruction

https://www.c-rieger.de/volltextsuche-mit-nextcloud-20-elasticsearch-und-tessaract/

and here the author gives a hint to install the plugin for PDF support by using

/usr/share/elasticsearch/bin/elasticsearch-plugin install ingest-attachment