Fulltextsearch/OCR: strange problem with some PDFs and subfolders

p3pp0 · January 19, 2019, 5:48pm

Hey Guys,

i have successfully setup nextcloud with elasticsearch/ocr/ etc. However, there are still 2 strange problems

1) Lots of pdfs cause an error
I was surprised that some pdfs where not found when using fulltextsearch with some keywords that i know they have inside.This does not happen to just a single faulty pdf, but to lots of pdf files (feeling 50:50). Thus, i did the following tests
a) Create a pdf file on my own using word --> working & recognized
b) Save the problematic pdf as a jpg file --> working & recognized
c) Using the sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:live command to observ what happens. I Got the following error:
┌─ Errors ────
│ Error: 6/6
│ Index: files:60301
│ Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException
│ Message: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [content]]; nested: TikaException[Unable to extract PDF content]; nested: IOException[java.util.zip.DataFormatException: invalid
│ bit length repeat]; nested: DataFormatException[invalid bit length repeat];

Any ideas on how to fix that?

2) I use “external storage” from type “local”. A script collects attachments from an imap server and saves it in this path. Recognition works perfectly fine, as long as the file is not in a subfolder (filesystemrights 0777)… any ideas?

I really appreciate your help!
Thank you!