Nextant: How to change the OCR language - config file not found

Hi,

on github it is explained how to change the OCR language.

But unfortunately I do not understand/find how to change the “TesseractOCRConfig.properties” file. Where is this config file located?

vim /opt/solr/contrib/extraction/lib/tika-parsers-1.13.jar

open the file TesseractOCRConfig.properties

change

language=eng

Any ideas?

Thanks a lot.
Tombar

A friend of me did not change the language to German and it worked for him too. :slight_smile:

have you tried looking here?

i think you can specify the lang with -l lang option in the command line

hm this does not help me (or I do not understand it). Yes I already saw that the language can be passed as parameter on the command line. But how does this help me? I am not starting it directly. The app or solr is calling it.

Thanks
Tombar

this is what i was able to find.

you can find out what languages are loaded by running
tesseract --list-langs. your output will be something like this

List of available languages (4):
eng
deu
equ
osd

In nextcloud VM the language data is located here /usr/share/tesseract-ocr/tessdata

else try this

find / -name ‘eng.traineddata’

find the folder & download whatever language you want from here

for e.g for german you need to do this

https://github.com/tesseract-ocr/tessdata/blob/master/deu.traineddata

now run tessaract --list-langs & you’ll see german enabled. worst case you may want to restart nextcloud & force a re-index by running

./occ nextaxt:index

all answers until now just refer to the installation and availability of the different language files. But just being available does not yet mean that they the different languages are used/called by the OCR engine.
The OCR engine is only using the language that is specified while calling the OCR. the OCR engine does not automatically select the correct one.

I now understood the explanation on github regqarding the *.properties file

I have to extract (unzip) the original JAR file (/opt/solr/contrib/extraction/lib/tika-parsers-1.13.jar), modify the “TesseractOCRConfig.properties” file, zip the JAR again and then replace the original with the modified JAR.

Based on that file the language that is used for the OCR is specified. This is not a dynamic process. All documents that are processed by nextant are run through OCR with the language that is specified within the properties file that is located within the JAR file.

Thanks
Tombar