[ocr] Optical character recognition for your image and pdf files

Nextcloud OCR (optical character recoginition) for images and PDF with tesseract-ocr and OCRmyPDF brings OCR capability to your Nextcloud 10 and 11.


The app uses tesseract-ocr, OCRmyPDF and a php internal message queueing service in order to process images (png, jpeg, tiff) and PDF (currently not all PDF-types are supported, for more information see here) asynchronously and save the output file to the same folder in nextcloud, so you are able to search in it and copy&paste the text. The source data won’t get lost. Instead:

  • in case of a PDF a copy will be saved with an extra layer of the text.
  • in case of a image the result of the OCR processing will be saved in a .txt file next to the image (same folder).

One big feature is the asynchronous ocr processing brought by the internal php message queueing system (Semaphore functions), which supports workers to handle tasks asynchronously / parallel to the rest of nextcloud.

Prerequisites, Requirements and Dependencies

The OCR app has some prerequisites:

  • Nextcloud 10 or higher
  • Linux server as environment. (tested with Debian 8, Raspbian and Ubuntu 14.04 (Trusty))
  • OCRmyPDF >v2.x (tested with v4.1.3 (v4 is recommended))
  • tesseract-ocr >v3.02.02 with corresponding language files (e.g. tesseract-ocr-eng)

Please consider: The app is and will not working with any activated encryption AND the OCRWorker.php script has to run for the app.

For further information see the appstore page.

2 Likes

Will this work with several instances of Nextcloud on the same server, but different php-users (php5-fpm)?

I had made a similar setup, but did the ocrmypdf conversion in a separate directory (out of reach of the php5-fpm), and then moved the converted pdf to the respective Nextcloud folder and updated it with the corresponding username.

Your app would have the advantage of simplicity (covering all files) and automation (I have to place the files to convert into a specific directory).

The procedure of this app should allow multiple instances (I think only with different web-server users) as it uses an external php worker process which fires the ocrmypdf or tesseract command:

sudo -u www-data nohup php /var/www/nextcloud/apps/ocr/worker/OCRWorker.php >/dev/null 2>&1 &

You should be able to change the user for this process by changing the name for the -u option.

The processing will use the /tmp folder for the conversion result. The respecting temp file should already exist as it should have been created by the respecting nextcloud-server instance before. After it has processed completely the worker starts an ./occ command for copying the tempfile to the right directory.

I assume there is nothing that should be in conflict with serveral instances :wink:

New Release Candidate available. Please checkout: https://apps.nextcloud.com/apps/ocr

New approach for cloud environment and many improvements.

Do you have also a Version for NC13/14?

1 Like

Hello,
I’m trying to get OCR working. New nextcloud installation on dedicated linux 18.04 VM.
OCRmyPDF and tesseract are installed and tested. The OCRWorker.php script is not running. It’s not available on my machine. Where can I find it?
Best regards,
Paul

1 Like

I’m interested in this as well. I’ve finally got Fulltextsearch working with Elasticsearch. I’m using Elasticsearch docker and Nextcloud 18.6 docker respectively, with nextcloud connecting directly to Elasticsearch.

What isn’t clear is how to use Tesseract? I can install it as a docker, but its not clear how to link it into Nextcloud. Does it need to be installed inside the Nextcloud container?

Currently, I’m running OCRmyPDF on the Nextcloud data directory externally and recopying the output folder structure back into Nextcloud.

Don’t know if that helps the previous poster’s question.

I’ve also had to use ElasticSearch 6.9.1 as 7.0 > has issues with Fulltextsearch which is a separate issue.

I would LOVE some help getting this working on v23. I am running NC in a Docker container. OCRmyPDF is working with Tesseract and I installed the Full Text OCR Files app in NC. My objective is to have an NC workflow OCR PDF files that are placed in a specific folder. It would also be great if I could OCR a specific file from the menu. I also have Elastic Search installed (doesn’t seem to work) but I don’t believe that is essential to do what I have described.
Can anyone help?