I am on Nextcloud AIO v9.1.0 + Nextcloud Hub 8 29.0.3 on AWS (after fiddling around a bit)
as we have lot of scanned pdf files to process a functioning OCR solution is a requirement.
I was a bit surprised to learn that the provided fulltextsearch container comes without OCR capabilities.
any hints (recipe to install!) please
my script for now is
aio=`sudo docker ps | cut -f 1 -d' '|grep -v CONT`
sudo docker stop $aio
sudo docker rm $aio
sudo docker pull linuxserver/libreoffice
sudo docker pull jbarlow83/ocrmypdf-alpine
echo "Will start AIO master docker now"
sudo docker run \
--init \
--sig-proxy=false \
--name nextcloud-aio-mastercontainer \
--restart always \
--publish 80:80 \
--publish 8080:8080 \
--publish 8443:8443 \
--volume nextcloud_aio_mastercontainer:/mnt/docker-aio-config \
--volume /var/run/docker.sock:/var/run/docker.sock:ro \
--env NEXTCLOUD_ADDITIONAL_APKS="libreoffice ocrmypdf" \
-d \
nextcloud/all-in-one:latest
echo "Got to AIO console and start containers"
next step - until I find a way to add ocrmypdf to the PATH:
sudo docker exec -it <nextcloud container> bash
change “ocrmypdf” to “/usr/bin/ocrmypdf”
vi
custom_apps/workflow_ocr/lib/OcrProcessors/OcrMyPdfBasedProcessor.php
still missing - how to add languages
mentions
By default the Docker image includes English, German, Simplified Chinese, French, Portuguese and Spanish, the most popular languages for OCRmyPDF users based on feedback. You may add other languages by creating a new Dockerfile based on the public one.
so I do not know why / what i should do to avoid the following message
OCRmyPDF succeeded with warning(s): OCR engine does not have language data for the following requested languages: eng Please install the appropriate language data for your OCR engine. See the online documentation for instructions: Installing additional language packs — ocrmypdf 16.4.2.dev2+gd544342 documentation Note: most languages are identified by a 3-letter ISO 639-2 Code. For example, English is ‘eng’, German is ‘deu’, and Spanish is ‘spa’. Simplified Chinese is ‘chi_sim’ and Traditional Chinese is ‘chi_tra’.
I can not select a language here
after adding the language packs
–env NEXTCLOUD_ADDITIONAL_APKS=“libreoffice ocrmypdf tesseract-ocr-data-eng tesseract-ocr-data-deu tesseract-ocr-data-fra” \
I can select languages in the flow app, but get another error now
cURL error 7: Failed to connect to xxx.xxx.xx port 443 after 1 ms: Couldn’t connect to server (see libcurl - Error Codes) for https://xxx.xxx.xx/hosting/capabilities
Failed to fetch capabilities: cURL error 7: Failed to connect to aws.chricar.at port 443 after 1 ms: Couldn’t connect to server (see libcurl - Error Codes) for https://xxx.xxx.xx/hosting/capabilities
it seem to me that ocrpypdf is not fully “integrated” in the nc container
AIO works now for me - to be honest, I do not know why OCR works now, probably some container restart helped
sudo docker run
–init
–sig-proxy=false
–name nextcloud-aio-mastercontainer
–restart always
–publish 80:80
–publish 8080:8080
–publish 8443:8443
–volume nextcloud_aio_mastercontainer:/mnt/docker-aio-config
–volume /var/run/docker.sock:/var/run/docker.sock:ro
–env NEXTCLOUD_ADDITIONAL_APKS=“libreoffice ocrmypdf tesseract-ocr-data-eng tesseract-ocr-data-deu tesseract-ocr-data-fra”
nextcloud/all-in-one:latest
change “ocrmypdf” to “/usr/bin/ocrmypdf”
How do you do this? In the nextcloud container?
I abandoned AIO for other reasons, so I can not tell you which container it was.
sudo docker exec -it <nextcloud container> bash
vi
custom_apps/workflow_ocr/lib/OcrProcessors/OcrMyPdfBasedProcessor.php
probably you need to install an editor in the docker first, which gets lost after restart
apt update
apt install vi
What are you using now? I am not very happy with aio.
Hi @ferdiga and @servicenet would you elaborate why are you dissatisfied with AIO?
AIO is great for new installations.
I have not found an (easy) way to migrate existing NC installations (Hetzner Storage Share ) while retaining internal and external links cross referencing directories and documents as these links use the Hetzner Storage Share Domain and the file-ID of this installation.
Hi,
I just tried to use the federation feature sharing the group folders as a workaround.
- I would need to keep the “old” server forever - or at least as lang as the links are in use - which I as an administrator can not decide.
- performance is bad (as the process seems to fetch all files of the source to determine the size)
- it’s not feasible to share a group folder to each user (of the same group) on the target server
- from a user perspective the only way this could work is to share the source group folder with the target group folder (and hence to the target group ) hence the user does not know where the files are stored. Having to identical structures (with old and new files) is not workable . I see 2 possibilities:
- the files from the source are copied/moved to the target server and the (existing) source link points internally to the target file (“reverse federation”)
- the old files remain on the source server and the federation shows these on the target server
Obviously it’s possible to use the new target server only as cloud of nextclouds and keep the various old servers as full featured and store no data on the target server.
Nevertheless I am not sure how collabora / onlyoffice would offer simultaneous editing if the same document is opened on different servers.
just my 2c
On Top there is a conflict if the Shared Group Folder has the same name as the Group Folder.
- Federated group data are not displayed, but instead the directory of the local group folder is displayed
Well a lot of option don’t work, for example OCR, IA etc.
Its a strugle to get things working.
What is IA?
ai you know what i mean
Tesseract list languages succeeded with warning(s): [DS] Profile read from file (tesseract_opencl_profile_devices.dat). [DS] Device[1] 0:(null) score is 0.726116 [DS] Selected Device[1]: “(null)” (Native),
Exception
API request error: rpc error: code = Unavailable desc = error reading from server: EOF
OpenAI/LocalAI’s text to image generation failed with: API request error: rpc error: code = Unavailable desc = error reading from server: EOF
I am pretty sure this is unrelated to AIO but instead a bug in the AI apps of Nextcloud or LocalAI…
This sounds like a bug in the ocr app
So why does is work in a NC installation without docker/AIO?
This warning comes from workflow_ocr/lib/Service/OcrBackendInfoService.php at ad3c1d56ad992500c366bea65179da20876009e4 · R0Wi-DEV/workflow_ocr · GitHub
The only (almost reliable) way to determine the installed tesseract languages I found was to execute the command tesseract --list-langs
, which I implemented into the workflow_ocr app. The result of this command will determine the list of languages which is shown in the UI mentioned by @ferdiga .
To troubleshoot this, you could execute tesseract --list-langs
manually in your environment with the user running the NC server and compare the results.
Looks like this is the bug: https://github.com/ocrmypdf/OCRmyPDF/issues/1395#issuecomment-2351836729