Cannot complete initial FullTextSearch index

codejp3 · April 9, 2024, 4:17pm

Hello community!

I’ve been trying to get through the initial FullTextSeach index for about 2 weeks now.

Everytime I run php occ fulltextsearch:index, it stops/freezes at the same file:

Memory: 49 MB
┌─ Indexing  ────
│ Action: fillDocument
│ Provider: Files                Account: my_username
│ Document: 3062935
│ Info: application/pdf
│ Title: Shop Manuals/Isuzu Rodeo TF 1988 - 2002/Holden Rodeo TF 6VD1 99-02 SM.pdf
│ Content size: 
│ Chunk:    255/1277
│ Progress:      5/37
└──
┌─ Results ────
│ Result:      0/0
│ Index: 
│ Status: 
│ Message: 
│ 
│ 
└──
┌─ Errors ────
│ Error:      1/1
│ Index: files:3033570
│ Exception: Elastic\Elasticsearch\Exception\ClientResponseException
│ Message: unknown error
│ 
│ 
└──
## x:first result ## c/v:prec/next result ## b:last result
## f:first error ## h/j:prec/next error ## d:delete error ## l:last error
## q:quit ## p:pause

It is an auto mechanic shop PDF and it’s quite large (about 400 pages). It’s only one of a few dozen I have, and most of those are similar and also just as large. Others seem to index just fine, but this particular file causes the indexing to freeze.

I’ve let it sit for several days to see if it will progress any further and it never does.

Overall, I have about 600GB worth of files to index and it’s only getting about 1/5th of the way through, stopping at this same file every time.

I do get the one error shown above, but it’s for a different file. There are no other errors to speak of, but I do get the following notices when I try to run the index.:

openjpeg warning: unspec CS. 1 component so assuming gray.
Dereference of free object 3, next object number as offset failed (code = -18), returning NULL object.
openjpeg warning: unspec CS. 3 components. Assuming data RGB.

Every time I try to run the initial index, those messages change. Sometimes I don’t get any, and sometimes I get a lot. They seem more informational and not like hard-fault errors though.

Every time it gets to this file and freezes, I’ve tried to end the index with php occ fulltextsearch:stop, which fails to actually stop the index because if I try to run the index again, it errors out saying index is already running. I have to actually kill and restart PHP to start a new initial index.

I’ve considered just changing the file extension on this one file to see if I can get through the initial index, but I’d prefer for it to actually complete successfully without manipulation.

I’ve tried various settings related to PDF indexing in the NC admin page, but same issue each time. Here’s the current setup:

php occ config:list | less

"fulltextsearch": {
	"app_navigation": "1",
	"cron_err_reset": "1712500204",
	"enabled": "yes",
	"installed_version": "28.0.1",
	"search_platform": "OCA\\FullTextSearch_Elasticsearch\\Platform\\ElasticSearchPlatform",
	"types": ""
},
"fulltextsearch_elasticsearch": {
	"analyzer_tokenizer": "standard",
	"elastic_host": "http:\/\/INTERNAL_IP_ADDRESS:9200",
	"elastic_index": "nc_indexnextcloud",
	"enabled": "yes",
	"installed_version": "28.0.1",
	"types": ""
},
"files_fulltextsearch": {
	"enabled": "yes",
	"files_audio": "0",
	"files_encrypted": "0",
	"files_external": "1",
	"files_federated": "0",
	"files_group_folders": "1",
	"files_image": "0",
	"files_local": "1",
	"files_office": "1",
	"files_pdf": "1",
	"files_size": "1024",
	"installed_version": "28.0.0",
	"types": "filesystem"
},
"files_fulltextsearch_tesseract": {
	"enabled": "yes",
	"installed_version": "27.0.0",
	"tesseract_enabled": "1",
	"tesseract_lang": "eng",
	"tesseract_pdf": "1",
	"tesseract_pdf_limit": "",
	"tesseract_psm": "",
	"types": ""
},

php occ fulltextsearch:check

Full text search 28.0.1
{
    "search_platform": "OCA\\FullTextSearch_Elasticsearch\\Platform\\ElasticSearchPlatform",
    "app_navigation": "1",
    "provider_indexed": "",
    "cron_err_reset": "1712500204",
    "tick_ttl": "1800",
    "collection_indexing_list": "50",
    "migration_24": "1",
    "collection_internal": "local"
}

- Search Platform:
Elasticsearch 28.0.1 (Selected)
{
    "elastic_host": [
        "http://INTERNAL_IP_ADDRESS:9200"
    ],
    "elastic_index": "nc_indexnextcloud",
    "fields_limit": "10000",
    "es_ver_below66": "0",
    "elastic_logger_enabled": "1",
    "analyzer_tokenizer": "standard",
    "allow_self_signed_cert": "false"
} 

- Content Providers:
Files 28.0.0
{
    "files_local": "1",
    "files_external": "1",
    "files_group_folders": "1",
    "files_encrypted": "0",
    "files_federated": "0",
    "files_size": "1024",
    "files_pdf": "1",
    "files_office": "1",
    "files_image": "0",
    "files_audio": "0",
    "files_chunk_size": "2",
    "files_fulltextsearch_tesseract": {
        "version": "27.0.0",
        "enabled": "1",
        "psm": "",
        "lang": "eng",
        "pdf": "1",
        "pdf_limit": ""
    }
}

Thank you in advance for any assistance with this!

jtr · April 9, 2024, 6:54pm

You may be able to see what error is seeping through that FTS can’t parse by adding the following two lines:

$error = json_last_error();
var_dump($arr, $error);

…just above the return line that is here:

github.com

nextcloud/fulltextsearch_elasticsearch/blob/9c5649f6834b9cd047a1d4b71f6660853b475f6b/lib/Platform/ElasticSearchPlatform.php#L318


      
          
          
          	/**
          	 * @param Exception $e
          	 *
          	 * @return array
          	 */
          	private function parseIndexErrorException(Exception $e): array {
          		$arr = json_decode($e->getMessage(), true);
          		if (!is_array($arr)) {
          			return ['error', 'unknown error'];
          		}
          
          		if (empty($this->getArray('error', $arr))) {
          			return ['error', $e->getMessage()];
          		}
          
          		try {
          			return $this->parseCausedBy($arr['error']);
          		} catch (InvalidArgumentException $e) {
          		}

codejp3 · April 9, 2024, 8:29pm

Thank you @jtr !

I ran the command occ fulltextsearch:document:provider MY_USERNAME files 3062935 --content to try to debug that single file. It’s been a couple of hours since I ran it and the command has not completed yet. Even the debugging command is freezing on the very same file that the initial index is. I may try your suggestion next if I strike out with my latest findings.

Update:

I may have been slightly wrong about the number of pages though. It’s not several hundred pages, it’s 23,403 pages long

I’ve verified that the file is not corrupt and opens as expected with zero issues in file content. I’ve also verified that similarly large PDF file was indexed successfully to eliminate it being a size issue.

While there’s no way to test tesseract directly on PDF, I did run the command ocrmypdf Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM.pdf ./test.pdf --output-type pdf --redo-ocr. It’s saying the PDF is encrypted, but it’s not:

EncryptedPdfError: Input PDF is encrypted. The encryption must be removed to
perform OCR.

I verified that the file is showing encrypted (even though it’s not) with qpdf --show-encryption Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM.pdf:

R = 2
P = -28
User password = 
Supplied password is owner password
Supplied password is user password
extract for accessibility: not allowed
extract for any purpose: not allowed
print low resolution: allowed
print high resolution: allowed
modify document assembly: not allowed
modify forms: allowed
modify annotations: allowed
modify other: not allowed
modify anything: not allowed

I removed the encryption with qpdf --decrypt Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM.pdf ./Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM\ decrypted.pdf

Retested the encryption with qpdf --show-encryption Holden\ Rodeo\ TF\ 6VD1\ 99-02\ SM\ decrypted.pdf:

File is not encrypted

Deleted the original “encypted” file and moved the decrypted one to the same identical original filename.

For good measure I’m re-running a file scan OCC command now. Once that’s done, I’m going to try the index again. If that fails, your suggestion is next on the list.

If this indeed turns out to be the issue, I’ll be submitting a github thread suggesting a workaround to ignore password-protected/encrypted files.

codejp3 · April 11, 2024, 2:08pm

Still no luck with that same file. I let it sit on that file for about 24hours to see if it would progress and it didn’t.

I added the var_dump as @jtr suggested, but no errors get spit out.

I thought it might be a resource limitation issue, so I bumped up the CPU cores, RAM and swap for both the NC container and the elasticsearch container before that last attempt.

Still a no-go.

I don’t know what the issue is. Perhaps it’s the sheer file size (148MB) and # of pages, but like I said, a different PDF that’s a little over 100MB and almost as many pages indexed fine.

I’ve resorted to zipping that file up so that it’s no longer a PDF to be indexed. I’ve reset the index yet again, and I’m running it fresh to see if it will finally complete. I’m counting this as a loss that I can’t get it to work as-is, but if the index does finally complete, I’ll count it as a win overall.

codejp3 · April 27, 2024, 3:20pm

UPDATE / RESOLUTION:

The default GhostScript 10.00 version that ships with most newer distros hangs when extracting text from complex PDF files. The NC FTS will wait for seemingly indefinitely while it waits for GS to extract the text, which never happens and cause everything to come to a complete stop.

I found this topic in the GitHub support that seems to have resolved the issue. building the latest version of GS and replacing the gs bin file with the latest version seems to be working so far.

github.com/nextcloud/fulltextsearch

Fulltextsearch hangs with complex PDF due to Ghostscript bug in version 10.0.0

opened 08:55AM - 02 Oct 23 UTC

cue108

I know this is not related to Fulltextsearch but I thought to put the informatio…n in here to let people find a simple solution a little quicker: Ubuntu 23.04 serves with Ghostscript version 10.0.0. Also the Nextcloud docker image serves with GS 10.0.0. During an indexing process, I noticed that it got stuck on a particular PDF file, and I found out that a simple text extraction via Ghostscript was hanging. I went to the official GS site and downloaded Ghostscript 10.02.0 Source: https://ghostscript.com/releases/gsdnld.html - uncompress it: (f.e. tar -xvf ghostscript-10.02.0.tar.gz) - go into this folder - sudo ./configure - sudo make install - restart Terminal - Test with gs -v If you get "cannot find -lXext" during the linker stage simply install Under Ubuntu: ```sudo apt-get install libxext-dev``` Under Fedora ```sudo dnf install libXext-devel``` Arch Linux ```sudo pacman -S libxext``` And do the build again. Find out where your gs is located with ``which gs`` and replace the binary. I have built it on an Ubuntu 23.04 and copied the binary into the official Nextcloud docker image. ``` gs -version GPL Ghostscript 10.02.0 (2023-09-13) Copyright (C) 2023 Artifex Software, Inc. All rights reserved. ``` That solved my issue with a hanging index run. May it help!

Several files (like the one this topic was about) that were “hanging” have now been successfully indexed. A few more days of indexing and I might actually get through the initial index after-all!