Hi, I recently have been playing around with Next Cloud and moved our law firm to host it to replace SugarSync. We have over 60k files including doc, pdfs, images, etc and we deal with multiple languages. The first problem was to make sure we have a reliable way to access files and have them be easy to sync which I believe has been accomplished. I was thinking about making a federated server at our office and not a AWS instance or another AWS instance which does more work on the files. The work would be -
ocrmypdf (or tika against the PDF files)
tesseract language detection and text file output of any images (including deskewing, rotating, etc.)
and identification of any PDFs or Images in other languages and outputting the English translation in a similar file name.txt
I don’t need this to be configurable through the Admin interface and could run these on the same NC box, but I was wondering if there was any thoughts out there on how to accomplish this - I would be willing to even go into maintenance mode, process the local files during a schedule, or constant background process and file:rescan cleanup and reindex if it could all be done this method. I could also do it via some search interface - but so far, it seems that if I did this type of work, then the built-in simple searching would pick up these types of files.
Ideally i think it would be some python script that I could just have run in line with OCRmyPDF in parallel the sample batch script and adding these other functional pieces.
Thoughts out there? Ort thread to point me to?