Search TIFF Files

Andreas_Steibl · December 22, 2016, 6:50pm

How can i search in TIFF Files?
I installed Tesseract (apt-get install tesseract-ocr)
it installed without error
After that i made a full rescan of the files

Can i see all indexed word of a document?

Cult · December 22, 2016, 7:01pm

You have also to enable the Image filter in the Admin Interface.

You can search in a document with nextant:pick but I do not offer the possibility to see the extracted content of a document. Do you see a real use of this feature ?

Note that using nextant:pick might display some infos about Tesseract in the nextant_attr_x_parsed_by attribute

Andreas_Steibl · December 22, 2016, 7:11pm

i just activated it an made a rescan
now i see more in the pick but nothing about tesseract

nextant_attr_x_parsed_by -> org.apache.tika.parser.DefaultParser, org.apache.tika.parser.image.TiffParser

Do i need somewhere to activate or set the tesseract for nextant? i only installed it.

Cult · December 22, 2016, 7:25pm

This should be:

nextant_attr_x_parsed_by -> org.apache.tika.parser.DefaultParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.jpeg.JpegParser

You should first test with a JPEG, as TIFF might needs some library. If I remember well, I installed the tesseract-ocr from github not from package.

There is some documentation on the internet about Solr, Tika and Tesseract, but if you want to write your experience with tesseract on nextant’s wiki, be my guest !

Cult · December 22, 2016, 7:27pm

There is few test file at the root of Nextant’s app (test.jpg, test.tiff, test.pdf) ; the keyword in those image is ‘success’

Andreas_Steibl · December 23, 2016, 7:25am

hmmm i didn’t get it working …

nextant_attr_x_parsed_by -> org.apache.tika.parser.DefaultParser, org.apache.tika.parser.jpeg.JpegParser
this is the output of the JPG after running the index again

I compiled tesseract fresh from git, downloaded the language files
and a manuel
/usr/local/bin/tesseract test.jpg test
works fine for both tif and jpg, only the pdf file can’t be OCR i get some warnings and the result is empty.

But in nextcloud/nextant the OCR isn’t used … is there somewhere a setting or someting else so that nextant can use the OCR?

Andreas_Steibl · December 23, 2016, 8:13am

and a second question, is it possible to specify the language i mean the -l parameter for tesseract?

Andreas_Steibl · December 23, 2016, 2:57pm

Oh it needs a reboot (don’t know if needed, but after a reboot i found the success!!!)

the only problem is that i need to switch the language for searching (because of german umalte ae, ue, oe … )

where can i do this`?

Cult · December 23, 2016, 3:09pm

might not need a full reboot, but a restart of the solr

There is no option right now to select the language, but what’s happening right now when you’re searching ?

Andreas_Steibl · December 23, 2016, 3:11pm

The search seems fine now
But the ocr doesn’t recognise the “umlaute” öäü
and “Hütte” is ocr to “Hiutte”

if a make a manuell tesseract with the parameter -l deu it works fine

i will try to switch the two language files - maybe it works …

Andreas_Steibl · December 23, 2016, 3:20pm

okey
switching the language file works … the index is now generated with the german language file.

But an other problem.
The search itself doesn’t work with german chars.

if i manually start a search
sudo -u www-data php occ nextant:pick --search hütte 27814
i get an OK

but if i search on the webpage … nothing is found

should i start a new thread for this?

Cult · December 23, 2016, 3:37pm

create an issue, and add a text file as example

Sanook · January 24, 2017, 12:25am

Did you install the german library for Tesseract?

apt-get install tesseract-deu