How to OCR all files at once

xenophil901 · June 23, 2022, 6:16am

Hello everyone,

I have a Nextcloud instance running with several thousand files uploaded. I have set up a workflow that automatically performs OCR upon file upload using the OCR file workflow.

My problem is that I have lots of files that were uploaded before the workflow was installed and I would like to OCR those.

I have already set up a OCR workflow that performs the OCR operation upon tag assignment, which works. For this, though, a user must assign the OCR tag to all the files that should be OCR’d.

Is there a way to OCR all the old files at once?

Thanks and have a good day!

Phil

sxge · June 23, 2022, 6:27am

As far as i understood it, you’ll need to indext them once:

sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:index

Im currently trying to setup it too, but had a few Installation issues thus im troubleshooting atm.

j-ed · June 24, 2022, 10:04am

The readme document of the OCR workflow app describes that an external command line tool is called by the workflow to do the OCR processing. Check-out how this tool need to be called on the console to process files. Once you’ve clarified this, you could create a batch script which e.g. uses the find command to find all .pdf files and parse them to the external OCR tool.