Tesseract on Nextcloud AIO docker

OCRmyPDF succeeded with warning(s): 2 [tesseract] Error opening data file /usr/share/tessdata/eng.traineddata 1 [tesseract] Error opening data file /usr/share/tessdata/eng.traineddata SubprocessOutputError,

Problem is solved

How did you solve it?

I can now sarcatically say that you should read the documentation. But unfortunately it is not in the documentation. In the beginning we first had Nextcloud Snap, but the snap config only works halfway. Then I switched to Docker, but this also has its limits.
In recent days we have been looking for a blurring of nextcloud and in particular the photo data. We moved to Filerun, so far all the options we want are working. Okay file run is not free but saved us a lot of time…

Is there a way to check if the binary is properly present in the container? I am in the same situation where I am trying to figure how to make tesseract-ocr work. I had no issues installing the nextcloud app (no errors like the author)

Edit: I was able to install tesseract-ocr properly. But it still does not function.

              now
-------------------------------
 2023-11-16 23:28:15.871033+00
(1 row)

+ '[' -f /dev-dri-group-was-added ']'
++ find /dev -maxdepth 1 -mindepth 1 -name dri
+ '[' -n '' ']'
+ set +x
Installing imagemagick via apk...
Installing tesseract-ocr via apk...
Enabling Imagick...
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/community: No such file or directory
Configuring Redis as session handler...
Applying one-click-instance settings...
System config value one-click-instance set to boolean true
System config value one-click-instance.user-limit set to integer 100
System config value one-click-instance.link set to string https://nextcloud.com/all-in-one/
support already enabled
Adjusting log files...
System config value upgrade.cli-upgrade-link set to string https://github.com/nextcloud/all-in-one/discussions/2726
System config value logfile set to string /var/www/html/data/nextcloud.log
Config value logfile for app admin_audit set to /var/www/html/data/audit.log
System config value updatedirectory set to string /nc-updater
Applying network settings...
System config value davstorage.request_timeout set to integer 3600
System config value trusted_domains => 1 set to string (domain)
System config value overwrite.cli.url set to string (domain)
System config value htaccess.RewriteBase set to string /
.htaccess has been updated
System config value dbpersistent set to boolean false
System config value files_external_allow_create_new_local set to boolean false
System config value trusted_proxies => 0 set to string 127.0.0.1
System config value trusted_proxies => 1 set to string ::1
Config value base_endpoint for app notify_push set to (domain)/push
Config value wopi_url for app richdocuments set to (domain)
System config value allow_local_remote_servers set to boolean true
No ipv6-address found for test.sadawarte.dev.
Config value wopi_allowlist for app richdocuments set to (my ip)
System config value enabledPreviewProviders => 0 set to string OC\Preview\Imaginary
System config value preview_imaginary_url set to string http://nextcloud-aio-imaginary:9000
waiting for Fulltextsearch to become available...
{
    "search_platform": "OCA\\FullTextSearch_Elasticsearch\\Platform\\ElasticSearchPlatform",
    "app_navigation": "0",
    "provider_indexed": "",
    "cron_err_reset": "1700168632",
    "tick_ttl": "1800",
    "collection_indexing_list": "50",
    "migration_24": "1",
    "collection_internal": "local"
}
{
    "elastic_host": "http:\/\/elastic:669221203cec34ebf6a08b5f848c29be2bfdd2010431f2d9@nextcloud-aio-fulltextsearch:9200",
    "elastic_index": "nextcloud-aio",
    "fields_limit": "10000",
    "es_ver_below66": "0",
    "elastic_logger_enabled": "1",
    "analyzer_tokenizer": "standard",
    "allow_self_signed_cert": "false"
}
{
    "files_local": "1",
    "files_external": "0",
    "files_group_folders": "0",
    "files_encrypted": "0",
    "files_federated": "0",
    "files_size": "20",
    "files_pdf": "1",
    "files_office": "1",
    "files_image": "0",
    "files_audio": "0",
    "files_chunk_size": "2"
}
[16-Nov-2023 23:28:53] NOTICE: fpm is running, pid 379
[16-Nov-2023 23:28:53] NOTICE: ready to handle connections
Activating Collabora config...
Activated any config changes

Based on this logs you forgot to add a language pack like e.g. tesseract-ocr-data-eng on top of it.

Based on the package on ubuntu/debian, the eng data was included so I thought that wasn’t needed. Thanks for letting me know, I’ll try the config and update you soon.

@szaimen I tired doing that but it still does not work. Here’s the log

              now
-------------------------------
 2023-11-18 19:02:58.341188+00
(1 row)

+ '[' -f /dev-dri-group-was-added ']'
++ find /dev -maxdepth 1 -mindepth 1 -name dri
+ '[' -n '' ']'
+ set +x
Installing imagemagick via apk...
Installing tesseract-ocr via apk...
Installing tesseract-ocr-data-eng via apk...
Enabling Imagick...
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/community: No such file or directory
Configuring Redis as session handler...
Applying one-click-instance settings...
System config value one-click-instance set to boolean true
System config value one-click-instance.user-limit set to integer 100
System config value one-click-instance.link set to string https://nextcloud.com/all-in-one/
support already enabled
Adjusting log files...
System config value upgrade.cli-upgrade-link set to string https://github.com/nextcloud/all-in-one/discussions/2726
System config value logfile set to string /var/www/html/data/nextcloud.log
Config value logfile for app admin_audit set to /var/www/html/data/audit.log
System config value updatedirectory set to string /nc-updater
Applying network settings...
System config value davstorage.request_timeout set to integer 3600
System config value trusted_domains => 1 set to string (domain)
System config value overwrite.cli.url set to string https://(domain)/
System config value htaccess.RewriteBase set to string /
.htaccess has been updated
System config value dbpersistent set to boolean false
System config value files_external_allow_create_new_local set to boolean false
System config value trusted_proxies => 0 set to string 127.0.0.1
System config value trusted_proxies => 1 set to string ::1
Config value base_endpoint for app notify_push set to https://(domain)/push
Config value wopi_url for app richdocuments set to https://(domain)/
System config value allow_local_remote_servers set to boolean true
No ipv6-address found for (domain).
Config value wopi_allowlist for app richdocuments set to 88.198.130.165,127.0.0.1/8,192.168.0.0/16,172.16.0.0/12,10.0.0.0/8,fd00::/8,::1
System config value enabledPreviewProviders => 0 set to string OC\Preview\Imaginary
System config value preview_imaginary_url set to string http://nextcloud-aio-imaginary:9000
waiting for Fulltextsearch to become available...
{
    "search_platform": "OCA\\FullTextSearch_Elasticsearch\\Platform\\ElasticSearchPlatform",
    "app_navigation": "0",
    "provider_indexed": "",
    "cron_err_reset": "1700168632",
    "tick_ttl": "1800",
    "collection_indexing_list": "50",
    "migration_24": "1",
    "collection_internal": "local"
}
{
    "elastic_host": "http:\/\/elastic:669221203cec34ebf6a08b5f848c29be2bfdd2010431f2d9@nextcloud-aio-fulltextsearch:9200",
    "elastic_index": "nextcloud-aio",
    "fields_limit": "10000",
    "es_ver_below66": "0",
    "elastic_logger_enabled": "1",
    "analyzer_tokenizer": "standard",
    "allow_self_signed_cert": "false"
}
{
    "files_local": "1",
    "files_external": "0",
    "files_group_folders": "0",
    "files_encrypted": "0",
    "files_federated": "0",
    "files_size": "20",
    "files_pdf": "1",
    "files_office": "1",
    "files_image": "0",
    "files_audio": "0",
    "files_chunk_size": "2"
}
[18-Nov-2023 19:03:36] NOTICE: fpm is running, pid 380
[18-Nov-2023 19:03:36] NOTICE: ready to handle connections
Activating Collabora config...
Activated any config changes

I reindexed everything with sudo docker exec --user www-data -it nextcloud-aio-nextcloud php occ fulltextsearch:index
Also, here is the config:

Well you need to install Libre office to get this working.

I have collabra aio working already. Also, I’m looking to ocr images mainly. The pdfs are already scanned with default fulltext docker. See the unmarked pdf option in the image

Collabora is different to Libre office. OCR in nextcloud depends on libre office.

It is not working in AIO, if you make a new installation without AIO install libre office etc. then is will working

@user2358 have you tried adding libreoffice as additional package on top of the other ones?

[files_fulltextsearch] Waarschuwing: Exception while improving searchresult: - trace: [{“file”:“/var/www/html/custom_apps/files_fulltextsearch/lib/Service/SearchService.php”,“line”:272,“function”:“getFileFromId”,“class”:“OCA\Files_FullTextSearch\Service\FilesService”,“type”:“->”,“args”:[“admin”,159248]},{“file”:“/var/www/html/custom_apps/files_fulltextsearch/lib/Service/SearchService.php”,“line”:232,“function”:“setDocumentInfo”,“class”:“OCA\Files_FullTextSearch\Service\SearchService”,“type”:“->”,“args”:[{“id”:“159248”,“providerId”:“files”,“access”:

Nextcloud Server installeren met ElasticSearch, Collabora Office, Docker Compose en Traefik | goNeuland

@FarisZR

FarisZR11 days ago

Owner

This shouldn’t be too hard to fix.
you need to add libreoffice to the command sub-element for the container in compose.
they don’t do this by default because installing libreoffice requires a lot of dependencies.

more details here:

You can actually add libreoffice to the container using NEXTCLOUD_ADDITIONAL_APKS…

Maybe but will it work then?

Or just meshup the config.

@szaimen I’m looking to use it for just images if possible so I don’t think I need libreoffice. Nevertheless, I’ve added libreoffice via apk, but neither the images nor the pdf files are indexed by tesseract. Same config as in the last mentioned comment.