Nextant: How to change the OCR language - config file not found

Tombar · June 5, 2017, 7:28am

Hi,

on github it is explained how to change the OCR language.

But unfortunately I do not understand/find how to change the “TesseractOCRConfig.properties” file. Where is this config file located?

vim /opt/solr/contrib/extraction/lib/tika-parsers-1.13.jar

open the file TesseractOCRConfig.properties

change

language=eng

Any ideas?

Thanks a lot.
Tombar

hustenfrei · June 5, 2017, 6:41pm

A friend of me did not change the language to German and it worked for him too.

YodaPhone · June 5, 2017, 7:34pm

have you tried looking here?

github.com

tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages

TESSERACT(1)
============
:doctype: manpage

NAME
----
tesseract - command-line OCR engine

SYNOPSIS
--------
*tesseract* 'imagename'|'stdin' 'outputbase'|'stdout' [options...] [configfile...]

DESCRIPTION
-----------
tesseract(1) is a commercial quality OCR engine originally developed at HP
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
at Google since then.

This file has been truncated. show original

i think you can specify the lang with -l lang option in the command line

Tombar · June 5, 2017, 8:50pm

hm this does not help me (or I do not understand it). Yes I already saw that the language can be passed as parameter on the command line. But how does this help me? I am not starting it directly. The app or solr is calling it.

Thanks
Tombar

YodaPhone · June 6, 2017, 11:28am

this is what i was able to find.

you can find out what languages are loaded by running
tesseract --list-langs. your output will be something like this

List of available languages (4):
eng
deu
equ
osd

In nextcloud VM the language data is located here /usr/share/tesseract-ocr/tessdata

else try this

find / -name ‘eng.traineddata’

find the folder & download whatever language you want from here

for e.g for german you need to do this

https://github.com/tesseract-ocr/tessdata/blob/master/deu.traineddata

now run tessaract --list-langs & you’ll see german enabled. worst case you may want to restart nextcloud & force a re-index by running

./occ nextaxt:index

Tombar · June 6, 2017, 4:55pm

all answers until now just refer to the installation and availability of the different language files. But just being available does not yet mean that they the different languages are used/called by the OCR engine.
The OCR engine is only using the language that is specified while calling the OCR. the OCR engine does not automatically select the correct one.

I now understood the explanation on github regqarding the *.properties file

I have to extract (unzip) the original JAR file (/opt/solr/contrib/extraction/lib/tika-parsers-1.13.jar), modify the “TesseractOCRConfig.properties” file, zip the JAR again and then replace the original with the modified JAR.

Based on that file the language that is used for the OCR is specified. This is not a dynamic process. All documents that are processed by nextant are run through OCR with the language that is specified within the properties file that is located within the JAR file.

Thanks
Tombar