Hi, I recently have been playing around with Next Cloud and moved our law firm to host it to replace SugarSync. We have over 60k files including doc, pdfs, images, etc and we deal with multiple languages. The first problem was to make sure we have a reliable way to access files and have them be easy to sync which I believe has been accomplished. I was thinking about making a federated server at our office and not a AWS instance or another AWS instance which does more work on the files. The work would be -
ocrmypdf (or tika against the PDF files)
tesseract language detection and text file output of any images (including deskewing, rotating, etc.)
and identification of any PDFs or Images in other languages and outputting the English translation in a similar file name.txt
I don’t need this to be configurable through the Admin interface and could run these on the same NC box, but I was wondering if there was any thoughts out there on how to accomplish this - I would be willing to even go into maintenance mode, process the local files during a schedule, or constant background process and file:rescan cleanup and reindex if it could all be done this method. I could also do it via some search interface - but so far, it seems that if I did this type of work, then the built-in simple searching would pick up these types of files.
Ideally i think it would be some python script that I could just have run in line with OCRmyPDF in parallel the sample batch script and adding these other functional pieces.
Thoughts out there? Ort thread to point me to?
There is Full Text Search via Elastic search and Tesseract OCR apps available in NC itself to do the extraction part for you. Then you can pipe the result to whatever you want?
So elastic search is super computational and resource intensive. I’ve been using the trial cloud and it’s a beast. And to pay for it is $25/month, and even building a AWS instance you need 16GB for it to run. Making it a $10 a month expense for searching. The existing nextcloud I built up is an AWS instance at 1GB RAM and 1GHz proc making it free for a year and $2/month with reserved resources.
As for the OCR Tesseract, that is installed but doesn’t seem to have much documentation when it runs or what it does with searched index data. I’d like to learn more - is there an in-depth discussion of it?
I rebuilt the 3.05 version for Mac OSX with gives me 102 languages supported with training data versus the 4.0 2 languages and customs code watching folders but it’s a resource hog, but I don’t care to run it locally. If I get full image and pdf OCRing and sidecar txt files that are useful for metadata.
When you say pipe the result, do you mean that I run Tess or Search command line and pipe the result elsewhere? What happens after I configure the settings on the admin pages?
And big thanks for your time to answer.
So something to remember, is it is all proportional. If you have a million or 10 million files, then yeah, you need a beast. But for 60 000+ files, I would think a quad core with 8Gig Memory should suffice in building your indices. The compromise for money will be speed, so if you are willing to wait a little bit longer during the initial build, it does not have to be a beast. As I said the initial build might take a bit longer, but after that the machine would stand idle if you don’t upload 1000+ each day.
How it works in Nextcloud, from what I have seen in my PoC environment, is that you build the initial index with Elasticsearch, and this can take a long time, or not, depending on your amount of files. I had 5000 files in my PoC, of different types, PDF, doc, odt, excel etc. The indexing took roughly a minute to complete. This was with an ElasticSearch machine with 4 cores and 8GB memory. What happens after this, is that during the normal cron job for Nextcloud which runs every 15 minutes, the index will be updated with all the files added in the last 15 minutes. See this small thread for more info.
I will only presume Tesseract will work the same, but I am not sure.
I hope my above explanation covered this. Please note, the result of the ElasticSearch will only make full text search capable, nothing more. So if you want to do more with it, I think you will need to run a thirdparty (middleware) application outside of Nextcloud for this. HTH.
Thanks, your information helped a lot.
If your questions are answered to satisfaction please mark them as solved to help future newcomers to this forum.
Glad I could help.