Live example of a Tesseract OCR

MUT-TOUR · September 19, 2022, 7:51am

Hi there!
Am I blind or isn’t there any live example of how Tesseract can process pixel stuff into text via ocr - implemented in a NC?! Or the other way round: Does anybody know an example we could just test a few files before investing many hours & ressources installing all that?

Thaaank, Sebastian

n0plan · September 19, 2022, 11:15am

follow this

for your quest u need

apt-get install tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng
service elasticsearch restart

and of course the apps from the store (all 4 of em)

then new index
sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:index

works here like a charm

brNP

MUT-TOUR · September 20, 2022, 9:20am

Thanks, but as I wrote I wanted to see on so’s elses example how it works. Usually there are live examples for add ons, plug ins , scrips etc, or am I mistaken? Is my idea to quickly test something so stupid? Wouldn’t it be a cool thing in the sense of open source to have a free open ocr service somewhere?

Sebastian

devnull · September 20, 2022, 9:32am

Sorry i do not use OCR. But a few years ago there was a Javascript !!! (tesseract-js) for this feature.

The app was (till Nextcloud 18): Optical character recognition
Video (sorry german): Nextcloud OCR - Kostenlos PDFs in Text umwandeln und markierbar machen - YouTube

Unfortunately i don’t know why the app is no longer supported.

awelzel · September 21, 2022, 7:56am

Tesseract is free. You just neeed a server where you can install it.

Edit: also see Full text search in Nextcloud | Arno Welzel how to setup full text search in Nextcloud including Tesseract.

MUT-TOUR · September 21, 2022, 8:24am

Maybe my english is not good enough? I want to save HOURS of installation and configuration. Thats all. Usually time costs more than licenses. I know that tesseract is free and open source.
Maybe there is somewhere in internet an free accessible example of tesseract?

awelzel · September 21, 2022, 8:26am

What is your mother language? My article is also available in German:

https://arnowelzel.de/volltextsuche-in-nextcloud-einrichten

Tesseract itself just needs to be installed along with one or more langugae packs (depending on what languages you want to use). After that just enable it in the full text setup and let Nextcloud build the search index - that’s it. There is nothing else to configure. Tesseract will automatically be used by the full text search to scan images as well if enabled.

Edit: Tesseract itself is just a command line tool which gets an image as parameter and outputs text as a result.

n0plan · September 21, 2022, 10:20am

dont get me wrong but following the instructions on the web
~ 16 Steps or less on a default nextcloud machine will take about less than 1/2 hour or @awelzel

the part that consumes time is building the index
(depends on how much files and if you excluded folders from the index)

The excluding part from the index is missing in @awelzel howtos

If this is so simple, exclude a folder from index (see link)
link Howto exclude folder from elastic Search

Spoiler:
Put a file “.noindex” in a folder which shall be excluded.

so give it a try

brNP

awelzel · September 21, 2022, 2:38pm

Where does this information originally come from? I don’t like to put things to a tutorial which are just mentioned in some forum posts - and https://github.com/nextcloud/fulltextsearch/wiki/Basic-Installation does not mention anything about this and I could also not find anything about using “.noindex” in Table of contents — Nextcloud latest Administration Manual latest documentation or Nextcloud latest user manual introduction — Nextcloud latest User Manual latest documentation either.

n0plan · September 21, 2022, 2:43pm

true found it searchin the forums,
and yes its working after testing it

on larger scale installations its nice to know that some folders are excluded
to save some time on creating the index and keeping it updated