How can i search in TIFF Files?
I installed Tesseract (apt-get install tesseract-ocr)
it installed without error
After that i made a full rescan of the files
Can i see all indexed word of a document?
How can i search in TIFF Files?
I installed Tesseract (apt-get install tesseract-ocr)
it installed without error
After that i made a full rescan of the files
Can i see all indexed word of a document?
You have also to enable the Image filter in the Admin Interface.
You can search in a document with nextant:pick but I do not offer the possibility to see the extracted content of a document. Do you see a real use of this feature ?
Note that using nextant:pick might display some infos about Tesseract in the nextant_attr_x_parsed_by
attribute
i just activated it an made a rescan
now i see more in the pick but nothing about tesseract
nextant_attr_x_parsed_by -> org.apache.tika.parser.DefaultParser, org.apache.tika.parser.image.TiffParser
Do i need somewhere to activate or set the tesseract for nextant? i only installed it.
This should be:
nextant_attr_x_parsed_by -> org.apache.tika.parser.DefaultParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.jpeg.JpegParser
You should first test with a JPEG, as TIFF might needs some library. If I remember well, I installed the tesseract-ocr from github not from package.
There is some documentation on the internet about Solr, Tika and Tesseract, but if you want to write your experience with tesseract on nextantâs wiki, be my guest !
There is few test file at the root of Nextantâs app (test.jpg, test.tiff, test.pdf) ; the keyword in those image is âsuccessâ
hmmm i didnât get it working âŠ
nextant_attr_x_parsed_by -> org.apache.tika.parser.DefaultParser, org.apache.tika.parser.jpeg.JpegParser
this is the output of the JPG after running the index again
I compiled tesseract fresh from git, downloaded the language files
and a manuel
/usr/local/bin/tesseract test.jpg test
works fine for both tif and jpg, only the pdf file canât be OCR i get some warnings and the result is empty.
But in nextcloud/nextant the OCR isnât used ⊠is there somewhere a setting or someting else so that nextant can use the OCR?
and a second question, is it possible to specify the language i mean the -l parameter for tesseract?
Oh it needs a reboot (donât know if needed, but after a reboot i found the success!!!)
the only problem is that i need to switch the language for searching (because of german umalte ae, ue, oe ⊠)
where can i do this`?
might not need a full reboot, but a restart of the solr
There is no option right now to select the language, but whatâs happening right now when youâre searching ?
The search seems fine now
But the ocr doesnât recognise the âumlauteâ Ă¶Ă€ĂŒ
and âHĂŒtteâ is ocr to âHiutteâ
if a make a manuell tesseract with the parameter -l deu it works fine
i will try to switch the two language files - maybe it works âŠ
okey
switching the language file works ⊠the index is now generated with the german language file.
But an other problem.
The search itself doesnât work with german chars.
if i manually start a search
sudo -u www-data php occ nextant:pick --search hĂŒtte 27814
i get an OK
but if i search on the webpage ⊠nothing is found
should i start a new thread for this?
Did you install the german library for Tesseract?
apt-get install tesseract-deu