In a production Nextcloud deployment (v18.104.22.168) I have recently installed:
- Full text search.
- Full text search - Elasticsearch Platform.
- Full text search - Files.
- Full text search - Files - Tesseract OCR.
- Full text search - Bookmarks.
Using the basic installation tutorial, and some other guides to install Elasticsearch and Tesseract-OCR as services in the server.
Live indexing is working fine and text files (pdf, docx, xlsx, etc.) are added to the index and seem working fine, but OCR is not working as I expected. Scanned documents as PDF are no indexed.
I think is a limitation in Tesseract or Elasticsearch, because I have scanned documents as jpg (in a test) and is correctly indexed. But I can not find how to do it in PDF documents. I have reviewed Nextcloud and plugins documentation. In Tesseract documentation I haven’t found any mention to PDF as input (may be that’s the point?).
My company have a lot of scanned documents in PDF and would be great to index them without converting or any other processing. Just indexing.
Have anyone achieved PDF (scanned documents) indexing with Full text search + Elasticsearch + Tesseract OCR? Is that even possible?
Thank you very much.