Cannot complete initial FullTextSearch index

Hello community!

I’ve been trying to get through the initial FullTextSeach index for about 2 weeks now.

Everytime I run php occ fulltextsearch:index, it stops/freezes at the same file:

Memory: 49 MB
┌─ Indexing  ────
│ Action: fillDocument
│ Provider: Files                Account: my_username
│ Document: 3062935
│ Info: application/pdf
│ Title: Shop Manuals/Isuzu Rodeo TF 1988 - 2002/Holden Rodeo TF 6VD1 99-02 SM.pdf
│ Content size: 
│ Chunk:    255/1277
│ Progress:      5/37
└──
┌─ Results ────
│ Result:      0/0
│ Index: 
│ Status: 
│ Message: 
│ 
│ 
└──
┌─ Errors ────
│ Error:      1/1
│ Index: files:3033570
│ Exception: Elastic\Elasticsearch\Exception\ClientResponseException
│ Message: unknown error
│ 
│ 
└──
## x:first result ## c/v:prec/next result ## b:last result
## f:first error ## h/j:prec/next error ## d:delete error ## l:last error
## q:quit ## p:pause

It is an auto mechanic shop PDF and it’s quite large (about 400 pages). It’s only one of a few dozen I have, and most of those are similar and also just as large. Others seem to index just fine, but this particular file causes the indexing to freeze.

I’ve let it sit for several days to see if it will progress any further and it never does.

Overall, I have about 600GB worth of files to index and it’s only getting about 1/5th of the way through, stopping at this same file every time.

I do get the one error shown above, but it’s for a different file. There are no other errors to speak of, but I do get the following notices when I try to run the index.:

openjpeg warning: unspec CS. 1 component so assuming gray.
Dereference of free object 3, next object number as offset failed (code = -18), returning NULL object.
openjpeg warning: unspec CS. 3 components. Assuming data RGB.

Every time I try to run the initial index, those messages change. Sometimes I don’t get any, and sometimes I get a lot. They seem more informational and not like hard-fault errors though.

Every time it gets to this file and freezes, I’ve tried to end the index with php occ fulltextsearch:stop, which fails to actually stop the index because if I try to run the index again, it errors out saying index is already running. I have to actually kill and restart PHP to start a new initial index.

I’ve considered just changing the file extension on this one file to see if I can get through the initial index, but I’d prefer for it to actually complete successfully without manipulation.

I’ve tried various settings related to PDF indexing in the NC admin page, but same issue each time. Here’s the current setup:

php occ config:list | less

"fulltextsearch": {
	"app_navigation": "1",
	"cron_err_reset": "1712500204",
	"enabled": "yes",
	"installed_version": "28.0.1",
	"search_platform": "OCA\\FullTextSearch_Elasticsearch\\Platform\\ElasticSearchPlatform",
	"types": ""
},
"fulltextsearch_elasticsearch": {
	"analyzer_tokenizer": "standard",
	"elastic_host": "http:\/\/INTERNAL_IP_ADDRESS:9200",
	"elastic_index": "nc_indexnextcloud",
	"enabled": "yes",
	"installed_version": "28.0.1",
	"types": ""
},
"files_fulltextsearch": {
	"enabled": "yes",
	"files_audio": "0",
	"files_encrypted": "0",
	"files_external": "1",
	"files_federated": "0",
	"files_group_folders": "1",
	"files_image": "0",
	"files_local": "1",
	"files_office": "1",
	"files_pdf": "1",
	"files_size": "1024",
	"installed_version": "28.0.0",
	"types": "filesystem"
},
"files_fulltextsearch_tesseract": {
	"enabled": "yes",
	"installed_version": "27.0.0",
	"tesseract_enabled": "1",
	"tesseract_lang": "eng",
	"tesseract_pdf": "1",
	"tesseract_pdf_limit": "",
	"tesseract_psm": "",
	"types": ""
},
php occ fulltextsearch:check

Full text search 28.0.1
{
    "search_platform": "OCA\\FullTextSearch_Elasticsearch\\Platform\\ElasticSearchPlatform",
    "app_navigation": "1",
    "provider_indexed": "",
    "cron_err_reset": "1712500204",
    "tick_ttl": "1800",
    "collection_indexing_list": "50",
    "migration_24": "1",
    "collection_internal": "local"
}

- Search Platform:
Elasticsearch 28.0.1 (Selected)
{
    "elastic_host": [
        "http://INTERNAL_IP_ADDRESS:9200"
    ],
    "elastic_index": "nc_indexnextcloud",
    "fields_limit": "10000",
    "es_ver_below66": "0",
    "elastic_logger_enabled": "1",
    "analyzer_tokenizer": "standard",
    "allow_self_signed_cert": "false"
} 

- Content Providers:
Files 28.0.0
{
    "files_local": "1",
    "files_external": "1",
    "files_group_folders": "1",
    "files_encrypted": "0",
    "files_federated": "0",
    "files_size": "1024",
    "files_pdf": "1",
    "files_office": "1",
    "files_image": "0",
    "files_audio": "0",
    "files_chunk_size": "2",
    "files_fulltextsearch_tesseract": {
        "version": "27.0.0",
        "enabled": "1",
        "psm": "",
        "lang": "eng",
        "pdf": "1",
        "pdf_limit": ""
    }
}

Thank you in advance for any assistance with this!

You may be able to see what error is seeping through that FTS can’t parse by adding the following two lines:

$error = json_last_error();
var_dump($arr, $error);

…just above the return line that is here:

Thank you @jtr !

I ran the command occ fulltextsearch:document:provider MY_USERNAME files 3062935 --content to try to debug that single file. It’s been a couple of hours since I ran it and the command has not completed yet. Even the debugging command is freezing on the very same file that the initial index is. I may try your suggestion next if I strike out with my latest findings.

Update:

I may have been slightly wrong about the number of pages though. It’s not several hundred pages, it’s 23,403 pages long :grimacing:

I’ve verified that the file is not corrupt and opens as expected with zero issues in file content. I’ve also verified that similarly large PDF file was indexed successfully to eliminate it being a size issue.

While there’s no way to test tesseract directly on PDF, I did run the command ocrmypdf Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM.pdf ./test.pdf --output-type pdf --redo-ocr. It’s saying the PDF is encrypted, but it’s not:

EncryptedPdfError: Input PDF is encrypted. The encryption must be removed to
perform OCR.

I verified that the file is showing encrypted (even though it’s not) with qpdf --show-encryption Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM.pdf:

R = 2
P = -28
User password = 
Supplied password is owner password
Supplied password is user password
extract for accessibility: not allowed
extract for any purpose: not allowed
print low resolution: allowed
print high resolution: allowed
modify document assembly: not allowed
modify forms: allowed
modify annotations: allowed
modify other: not allowed
modify anything: not allowed

I removed the encryption with qpdf --decrypt Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM.pdf ./Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM\ decrypted.pdf

Retested the encryption with qpdf --show-encryption Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM\ decrypted.pdf:

File is not encrypted

Deleted the original “encypted” file and moved the decrypted one to the same identical original filename.

For good measure I’m re-running a file scan OCC command now. Once that’s done, I’m going to try the index again. If that fails, your suggestion is next on the list.

If this indeed turns out to be the issue, I’ll be submitting a github thread suggesting a workaround to ignore password-protected/encrypted files.

Still no luck with that same file. I let it sit on that file for about 24hours to see if it would progress and it didn’t.

I added the var_dump as @jtr suggested, but no errors get spit out.

I thought it might be a resource limitation issue, so I bumped up the CPU cores, RAM and swap for both the NC container and the elasticsearch container before that last attempt.

Still a no-go.

I don’t know what the issue is. Perhaps it’s the sheer file size (148MB) and # of pages, but like I said, a different PDF that’s a little over 100MB and almost as many pages indexed fine.

I’ve resorted to zipping that file up so that it’s no longer a PDF to be indexed. I’ve reset the index yet again, and I’m running it fresh to see if it will finally complete. I’m counting this as a loss that I can’t get it to work as-is, but if the index does finally complete, I’ll count it as a win overall.

UPDATE / RESOLUTION:

The default GhostScript 10.00 version that ships with most newer distros hangs when extracting text from complex PDF files. The NC FTS will wait for seemingly indefinitely while it waits for GS to extract the text, which never happens and cause everything to come to a complete stop.

I found this topic in the GitHub support that seems to have resolved the issue. building the latest version of GS and replacing the gs bin file with the latest version seems to be working so far.

Several files (like the one this topic was about) that were “hanging” have now been successfully indexed. A few more days of indexing and I might actually get through the initial index after-all! :joy: