Tesseract OCR for images

user2358 · November 16, 2023, 10:55pm

Hi, I am trying to get tesseract OCR working with fulltextsearch. So far, I think I’m able to index pdf files (since I can search them, lmk if there is a way to check for sure). But I am not able to search for text in images. I tried fulltextsearch:index and also uploaded new photos to check, but I’m not able to search them. Does the OCR addon not support image OCR? I thought it does because tesseract is able to ocr images just fine through cli.

Edit: It doesn’t seem to be working. Apparantly, those were the default features of the fulltextsearch docker. After I disabled those and tried it, OCR does not work at all.

codejp3 · April 9, 2024, 4:54pm

If you run the OOC command php occ fulltextsearch:check, you’ll see

- Content Providers:
Files 28.0.0
{
    "files_local": "1",
    "files_external": "1",
    "files_group_folders": "1",
    "files_encrypted": "0",
    "files_federated": "0",
    "files_size": "102488",
    "files_pdf": "1",
    "files_office": "1",
    "files_image": "0",
    "files_audio": "0",
    "files_chunk_size": "2",
    "files_fulltextsearch_tesseract": {
        "version": "27.0.0",
        "enabled": "1",
        "psm": "",
        "lang": "eng",
        "pdf": "1",
        "pdf_limit": ""
    }
}

While I haven’t tested it, I would think changing the value for “files_image” to 1 may do it.

That setting has been intentionally left out of the NC admin settings page, which makes me think that feature currently has bugs/issues.

Certainly not a confirmed solution for you, but hopefully that puts you on the right path.

codejp3 · April 10, 2024, 1:01am

I’ll try to confirm the content actually get’s indexed, but when I run the php occ fulltextsearch:index command, I do see image files showing up in the documents being scanned (even with the “files_image” value mentioned above set to 0).