Fulltextsearch index with groupfolders terribly slow

Nextcloud 26.0.5, ElasticSearch 8.10.0 plus groupfolders.

EDIT: nginx with php-fpm 8.2

Rebuilding the Fulltextsearch index takes ages. On a comparable instance without group folders, a full rebuild for 15 users, approx. 50000 files takes half an hour. On the instance with ~20 group folders, it takes ~5s for one file => have to wait three days.

The machine (VM) is equipped with 12G, 4VCPUs. Even while indexing, the load average does not rise above 1.5. I’ve observed that occ fulltextsearch:index runs as a single thread that spends 1/3 of the time in mariadb and 2/3 in PHP. The system does not wait for I/O: sum of time spent waiting for I/O is mostly below 20%.

Is there something I can do to tune this process?

Nobody answers here? The observed behavior makes this app really unusable for productive use – imagine some thousands of users, hundreds of groups and group folders, and millions of files! They would wait years for the index to be built.

Really, does nobody care?

You are asking in a community forum, thus, I assume most users here don’t manage thousands of users.
Maybe you can provide more detailed steps how to reproduce this issue and how to measure the execution time of the search. Also the filetype would be important to know.

You are asking in a community forum, thus, I assume most users here don’t manage thousands of users.

You’re right, but the community is deemed the testbed for bigger installations, and it’s the same code being used in production. Therefore I think it’s my responsibility to alert about the problem. Of course, for a private installation the wait time of three days isn’t that problematic. But look at it at a bigger scale: it is more than factor 100, and it could easily mean that reindexing takes years for somebody using this software in production.

Maybe you can provide more detailed steps how to reproduce this issue and how to measure the execution time of the search.

It is not about the execution time of the search. It is about upgrading your Elasticsearch platform, where you are required to rebuild the index. (or for completeness: there can be other situations where you want to rebuild the index, you know). How to reproduce the problem?

  1. you have a Nextcloud instance, (versions already said in thread start). Fulltextsearch is installed and working.
  2. For some reason, you are required to rebuild the Fulltextsearch index. So you do that according to the docs: occ fulltextsearch:reset (to recreate the index), occ fulltextsearch:test (to check it basically works) and occ fulltextsearch:index (to rebuild the index).
  3. For some time, you watch the index status display. Until you notice that the file name being processed changes very slow. So you start watching the Nextcloud log, and you see: for EVERY file there are as much as log entries as you have users, and the periodicy of these messagess (checking if file X is accessible by user Y) is approx. one second. Uuuh, let’s make another coffee, and wait…
  4. You ask your favourite search engine about slow indexing, and some results say ‘using group folders terribly slows down indexing’. You remember you had enabled group folders once ago, just to see what they are, and forgot to deactivate when you didn’t have any use for them.
  5. You deactivate group folders and repeat the indexing: and instead of three days, now it takes half an hour to index all your files.

So here I am, asking for people sharing their experience, and hoping for the devs to listen if there’s a serious problem.

My personal experience is that indexing itself is quite fast if you don’t use ocr. Did you have tesseract enabled?

Did you have tesseract enabled?

No. I wrote about group folders that make a difference of factor 100 or more.

I cannot reproduce your issue.

Do I understand you right:

  • for you, indexing is fast because you don’t use OCR.
  • Enabling the groupfolders addin, adding 20 groupfolders and repeating the indexing does not make it any slower?

Indexing is slow if using files_fulltextsearch_tesseract because of several issues. One of them beeing High CPU usage on multicore · Issue #61 · nextcloud/files_fulltextsearch_tesseract · GitHub.

Initial indexing seems to run user by user (not storage by storage). Thus adding more groupfolders will add some overhead in any case. But you write indexing takes 5 seconds for each file. Typical pdf/text files are indexed within milliseconds on my instances and only if i use ocr, it takes a couple of seconds for each file. Thus, the question would be what kind of files you have that it takes so long.

On the other side there seems to be some improvement with the latest release to prevent double indexing.
The following issue may be related: fulltextsearch:index always scans all files even if index is already available · Issue #767 · nextcloud/fulltextsearch · GitHub
PRs which might be related:
set internal collection by ArtificialOwl · Pull Request #776 · nextcloud/fulltextsearch · GitHub
collections and reset by ArtificialOwl · Pull Request #786 · nextcloud/fulltextsearch · GitHub

Not sure if this improves your situation in regards to groupfolders. Maybe you want to try out and report back.
If your issue is unrelated, I assume you’ll have to create a new issue to let the developers know!

@paule58 Did you check the latest fulltextsearch releases? I assume your issue was fixed. I suggest to reset the index and start from scratch.

Now that the 27.x.x beta carousel is maybe over, I upgraded to 27.1.3. No change so far. I can always observe the following logic (which makes me shiver)

for user in all_users:
   for file in files_visible_to_this_user:
       for xuser in all_users:
           retrieve_pathname_of_file_in_xusers_store (getPathFromRoot)
       index_the_file

This did not change. Even adding the newly recommended memcaches does not change anything in terms of speed.

@paule58 If you still think there’s an issue, you’ll have to create a bug report at github. Your issue will not resolve itself, just because you are posting it in a community forum ;).

1 Like

No need to clarify again that this is a community forum. I had the hope that a forum is a place to exchange thoughts about the software, and that it is normal thing that developers stop by here from time to time, just to see if their code is behaving well in the field. Or – at least – that a community manager is there that would at least make them aware.

The community person is definitely here, and let me take the freedom to assume you likely get paid to resolve this issue, a support contract might be appropriate. I think we all agree the features you depend upon are enterprisy and we ask you to respect the time and mood of our non-paid volunteers on this forum who help you with all of our best interest in mind in their free time.

I’m closing this issue because I think this specific request for support is inappropriate and the tone in the thread is turning down.