Search TIFF Files

How can i search in TIFF Files?
I installed Tesseract (apt-get install tesseract-ocr)
it installed without error
After that i made a full rescan of the files

Can i see all indexed word of a document?

You have also to enable the Image filter in the Admin Interface.

You can search in a document with nextant:pick but I do not offer the possibility to see the extracted content of a document. Do you see a real use of this feature ?

Note that using nextant:pick might display some infos about Tesseract in the nextant_attr_x_parsed_by attribute

i just activated it an made a rescan
now i see more in the pick but nothing about tesseract

nextant_attr_x_parsed_by -> org.apache.tika.parser.DefaultParser, org.apache.tika.parser.image.TiffParser

Do i need somewhere to activate or set the tesseract for nextant? i only installed it.

This should be:

nextant_attr_x_parsed_by -> org.apache.tika.parser.DefaultParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.jpeg.JpegParser

You should first test with a JPEG, as TIFF might needs some library. If I remember well, I installed the tesseract-ocr from github not from package.

There is some documentation on the internet about Solr, Tika and Tesseract, but if you want to write your experience with tesseract on nextant’s wiki, be my guest !

1 Like

There is few test file at the root of Nextant’s app (test.jpg, test.tiff, test.pdf) ; the keyword in those image is ‘success’

1 Like

hmmm i didn’t get it working …

nextant_attr_x_parsed_by -> org.apache.tika.parser.DefaultParser, org.apache.tika.parser.jpeg.JpegParser
this is the output of the JPG after running the index again

I compiled tesseract fresh from git, downloaded the language files
and a manuel
/usr/local/bin/tesseract test.jpg test
works fine for both tif and jpg, only the pdf file can’t be OCR i get some warnings and the result is empty.

But in nextcloud/nextant the OCR isn’t used … is there somewhere a setting or someting else so that nextant can use the OCR?

and a second question, is it possible to specify the language i mean the -l parameter for tesseract?

Oh it needs a reboot (don’t know if needed, but after a reboot i found the success!!!) :smiley:

the only problem is that i need to switch the language for searching (because of german umalte ae, ue, oe … )

where can i do this`?

might not need a full reboot, but a restart of the solr :smiley:

There is no option right now to select the language, but what’s happening right now when you’re searching ?

The search seems fine now
But the ocr doesn’t recognise the “umlaute” öäü
and “Hütte” is ocr to “Hiutte”

if a make a manuell tesseract with the parameter -l deu it works fine :smiley:

i will try to switch the two language files - maybe it works …

switching the language file works … the index is now generated with the german language file.

But an other problem.
The search itself doesn’t work with german chars.

if i manually start a search
sudo -u www-data php occ nextant:pick --search hütte 27814
i get an OK

but if i search on the webpage … nothing is found

should i start a new thread for this?

create an issue, and add a text file as example

Did you install the german library for Tesseract?

apt-get install tesseract-deu

1 Like