Problem in design of Fileindex and Fulltextindex and Kommunikation between Client and Server

chonta · August 17, 2023, 11:27am

Nextcloud is a great product, and the company’s philosophy is exemplary as well. However, as a long-time user, it pains me to see how new features are constantly introduced while the foundation upon which everything is built is entirely overlooked, and deficiencies are left unaddressed.

So, what do I mean by this?

At its core, Nextcloud is a tool for storing files and exchanging them. And here is where, from my perspective, the foundation is “broken by design.”

The focus is on the user, rather than the data that needs to be shared and stored! This becomes very apparent through two problems, both of which stem from the architecture and usage of the database. A detailed description of the problem can be found here: GitHub Issue Link

Brief Description of Problem 1:
Users and groups from the Active Directory.
Integration of shares from the Windows file server with the option to store logins in the database. This essentially turns Nextcloud into a web frontend for the MS file server. Permissions for actions on files are determined by ACL on the file server, which is not an issue, as external shares in Nextcloud can be mounted as read-only or writable.

However, problems arise when the Nextcloud client is used to sync files to a client. If User A modifies a file and User B is allowed to access the same file, User B will not receive the changes made by User A.

Brief Description of Problem 2 (actually two problems):
Here again, the data is stored on an external storage, and users are coming from the Active Directory.

There are 53 users in Nextcloud, of which 24 are from the AD, and the rest are local Nextcloud users. The problem is that the Nextcloud database is HUGE! It’s a whopping 28.2 GB according to the web interface. The externally mounted share encompasses over 3TB of data, with more than 3,000,000 files and 300,000 directories. Another problem arises when individual directories are shared from AD users to Nextcloud-local users to exchange data. While this works, the displayed data isn’t truly up-to-date.

Normally, accessing data through a web browser would update as changes occur on the external storage, even if those changes weren’t made through Nextcloud itself. This is because the browser triggers a refresh of the directory, detecting new and modified data. However, in this case, the refresh isn’t effective, taking too long or not being limited to the directory being viewed.

So, why is the database so massive?
Unfortunately, a separate file index in the oc_filecache is created for each user! An entry is generated for each of the 3,000,000 files for every user. For 24 AD users, that’s 72,000,000 (seventy-two million entries) in the oc_filecache. Each new user adds another 3,000,000 rows.

This is highly inefficient and problematic. Nextcloud should behave more like a file system at this point, requiring just one entry per file.

The problem lies in how paths are constructed in Nextcloud: “user/files/share.”
Whether it’s external or internal storage, the structure and fundamental problem are the same. Running “files:scan –path=“user/files/share/” user” takes three hours to complete. This seems to be the reason the browser display isn’t current. The browser-initiated refresh doesn’t seem to be restricted to the current directory or limited to just F5; it starts at the share level and never reaches the correct directory.

For an updated display, it would be helpful if there were a continuous refresh parallel to the scan, specifically for the directory being viewed, thus enabling faster access to new data.

Optimization could be achieved by implementing a Windows service for the file server that scans the MFT and relays changes to Nextcloud. This way, it would only need to compare which data is more current, that from the service or what Nextcloud already has.

The oc_filecache is also the issue for Problem 1. There are no columns indicating who changed what and when, nor is there information about who has access to the file. If Nextcloud behaved more like a file system, these problems wouldn’t exist. The oc_filecache should include details about when a file was changed by whom, and whether a push signal to reload the file is necessary for all clients. One entry per file should be sufficient. The oc_filecache should only indicate which groups are allowed to read/write/delete/share.

This could even be managed through system groups, whose membership is determined by other tables. Access rights through groups are already a part of Nextcloud; they just need to extend down to the necessary level.

Unfortunately, the same issue affecting the file index applies to the full-text index, where each user has their own full-text index. However, it should exist only once per file, with the same access rights as the file it’s for.

I can only hope that these fundamental design issues are addressed before the next AI integration is introduced. Currently, the behavior of Filecache and the Full-Text Index is compounding an issue that grows exponentially with the number of users and files.

Kind Regards

Chonta

PS: Don’t get my wrong, I love Nextcloud realy. Only i would love to see it uses the full potential.
I wil gladly answer questions.

christianlupus · August 17, 2023, 3:28pm

OK, may I ask what is your point here?

Are you just moaning and grousing about a sub-optimal situation (well that needs to be done from time to time, I understand perfectly myself)?
Are you trying to get people’s opinion on the topic?
Are you trying to find fellow devs to file PRs against the problem?

In fact, you are here in the dev forum. So, I suspect you have a more or less clear agenda that you want to achieve. Would you please share this so that others might (or might not) acknowledge and support your efforts?

chonta · August 17, 2023, 4:16pm

My hope is, that the developer look into how the database of nextcloud work and hopfolly see the problem.
Sometimes if you work on something you dont see possible design mistakes.
Nextcloud is an fork of owncloud and so owncloud as well will have this problem.

but of course it is an problem for me, becaus ouf this struktures i am running into problems while using nextcloud.
it maby sound like moaning and grousing but i dont.

i want to understand why nextclout work as ist dos and not the way i suggested.
do dev see the problems that i described as problems or do they think thats not a problem and if so, than why?

Example 3.000.000 files and 1 user = 3.000.000 entrys for the index but every user who has access to the same data generates 3.000.000 more entrys.
is this considerated as not problematic and fast? think of 100 users or more.

maby there are reasons for how it is, other than it was always lieke that.
maby it has to be like that becaus it is faster, if how?
why does the nextcloudserver not push filechanges automaticly to all user that have access to the file simply becaus one user was changeing the file, and the changes alway goes trough the server.

and of couse i wouhld like to have others opinion, maby my sight of things is wrong.
sadly i dont have programmingknowledge so i can’t distrebute to the projekt.

kind regards chonta

chonta · September 18, 2023, 7:54am

Nobody has any position to the topic?
Is ist a design flaw?
It is not considered as a problem?
Now one thought it was a problem till now?
Next cloud should not be used with this many Files/Users?
Worth looking into it to make Nextcloud an even greater product and usable on a large scale.
It is a problem but can’t be changed because of…

Some response would be great, to see some with the technical knowledge about and what is there point of view.

Kind regards

Chonta

christianlupus · September 18, 2023, 11:19am

I will try to give you my personal experience and understanding. However, this is not a response from the core devs as you intended. I can ping someone but cannot promise any useful output.

If you were close to Berlin, you could have come these days and asked yourself…

Sort of. If you use the Nextcloud only to cover the MS server, yes, this is just another way to access your files in fact. Nextcloud is no black magic. What do you expect it to do? Maybe you need additional apps to satisfy your requirements?

I highly doubt that. It is true that the changes are not transmitted instantaneously. Once, user A has updated the local file, the client will upload the file to the NC server. This will take a few seconds to minutes depending on the size of the data to be transmitted.
Once, you have the file completed, you can see it in the web frontend. The client on user B’s machine will check for new files in the server in regular intervals It will again take some seconds to minuted in order to detect a changed file and download it.

If for some rason, the sync is not working, there might be a problem with your configuration or your server, as well as a bug. It might be worth investigating.

Well, this is no problem of nextcloud but just your setup that has a certain size. So, it is the baseline for any further analysis.

What do you mean here? What data is not in sync?

I guess that you changed the data extenally using other means in the MS server and expect NC to deliver the lated data as well, right?

The problem is the database any your usage of the system. You use the external storage. This is not really intended to allow for exchanging in a live manner but to extend the storage to other partitions still completey under control of Nextcloud.

The database in Nextcloud exists in order to speed up access to the files enormously. I have one example of an app that did not use the database but instead crawled the file system. I had complaints about long loading times and modified to use a cache in the database. This allows to scan in regular intervals the filesystem and use the cached data for answering the simple calls by the users. The result: Before the change, the request was taking literally minutes to load the app with huge data sets After the caching, it went down to merely 1-2 seconds.

You see the database (as a caching instrument) is useful in general. By the way, the common file systems have also some sort of database included to store the files. As these are much more efficiently implemented in the kernel, the speed is just higher but the problem is the same.

By changing the external storage outside the Nextcloud, you break caching. Depending on your settings, Nextcloud will scan regularly for changed files as a fallback in case something was changed. This is for sure not in real-time as the scanning takes a significant amount of time.

Again, this is your setup. I interpret it that you have one external storage location that is shared with all users in the instance thus adding the 3,000,000 file cache entries for new users. Nextcloud was originally developed as a way t store personal data. So, each user has his/her individual set of files, resulting in only a few table rows per new user (the welcome file etc).

There was an architecture established to register custom storage backends. In order to simplify this sort of integration, the amount of data to be exchanged and cached was minimized. One could think (I admit) that it might be useful to allow the individual external storages to handle their own cache. That way, the external 3,000,000 files would only generate one line each independent of the number of users accessing the files. As this puts some burden on the (extension) devs, I suspect this was decided against or it was not thought through at the time in detail. Changing that is rather hard as multile integration apps will need updating in order to create a Storage with its own cache (may there is but I do not know of it).

External data should not be in that folder but just in the corresponding external folder. Otherwise it would be duplicated, wouldn’t it?

This command will enforce invalidation of the complete cache of the folder share. This will rescan the complete DB cache as you have found and fetch all data from the file system. You can restrict further by adding more path entries BTW.
It will terminate a some point but this might take some time . For sure this is no solution for a live system but as a repair step.

I don’t know I get your point here. There is already one entry in the table. It is the etag that defines the latest version known by the server. If a client asks for an entry, he will pass the etags of the known files and only download what has been updated. If the NC server does not know about any changes, this is obviously impossible.

A file system dows not even have such a thing. So going closer to a FS would enforce to download every time, if I get that correclty…

Well, it is formally in O(n*m) with n the number of users and m the number of (shared) files. So bilinearly not exponentially.

All in all, I would summarize it

Your usage of external storage is not optimal, maybe an integration could help here (like inotify on Linux)
The data structure for external storages (and group storages) might be enhanced in terms of cache handling
The missing pushes seems strange and need further investigation (on your side) to track down the problem

chonta · September 18, 2023, 12:29pm

We taking about Textfiles oder Exe Files from 100kb to 5MB so nothing realy big.
Please before doubting me, and other that hat the same Problem and made requests on github that never got solved, try it for yourself in an Testenviroment and prove me wrong.
I offer also to visit by teamviewer to see the problem first hand.
For some reason, it always take a PC Restart to bring A and B to the same…

Yes it is the problem of me and everyone that has a high number of ether user or files in the Nextcloud and not the resources to resolve it with hardware instead of design.
That is what i am trying to do. I am trying to make aware of design flaws (in my opinion flaws) that it is nt a good idea to have for every user an separated fileindex and also for every user an separated searchindex.
Or are ther specific reasons for that?
Will an user loose automaticly fulltextinformation about an file he loses the right to read?

Yes, the Fiele was changed on the MS Fileserver. Normaly if you use the Webbrowser th acces the share over Nextcloud, the changes should be refreshed and also the Data and Date should show the right timestamp.
On this Client, the Fileindex ist sadly for some user brocken and needs reindex from time to time and i meen fullreindex. Normaly the access of the directory with the webbrowser should be enugh.

There no data on the Linux-filesystem! But all schare paths are build like this.
share is the name of the external share that is shown in the user root after login in nextcloud.

I know it will rescan all files the user has access, and for me it is the only way to repair the fileindex and not using Username and password because here i dont need username and password and the stored date is used.
Problem it takes 6 houres per user! And i have to have an script that validates all ldapuser and runs the job for them…

christianlupus:

chonta:

The oc_filecache is also the issue for Problem 1. There are no columns indicating who changed what and when, nor is there information about who has access to the file. If Nextcloud behaved more like a file system, these problems wouldn’t exist. The oc_filecache should include details about when a file was changed by whom, and whether a push signal to reload the file is necessary for all clients. One entry per file should be sufficient. The oc_filecache should only indicate which groups are allowed to read/write/delete/share.

I don’t know I get your point here. There is already one entry in the table. It is the etag that defines the latest version known by the server. If a client asks for an entry, he will pass the etags of the known files and only download what has been updated. If the NC server does not know about any changes, this is obviously impossible.

A file system dows not even have such a thing. So going closer to a FS would enforce to download every time, if I get that correclty…

Where?
And if it is not enugh.
The data base needs information who changed it, when and who needs to beinformed.

Not working for my setup.

I hope ther will be investigation
Thank you for your answer

Kind regards Chonta

marcelklehr · September 18, 2023, 12:53pm

Without having been around at the time the file cache was introduced and without having much knowledge about its history or the rationale for it either, I would wager a guess that the goal was to have a flat table structure with all files in it that is fast to query (but potentially slower to update).

As the name says it’s basically a cache. While Having only one entry per physical file would be possible, this would involve resolving mounts and shares at query time, which the designers apparently deemed less performant. My guess is that the goal of the filecache table was to make a tradeoff to have really fast reads while pushing the burden to write operations.

chonta · September 18, 2023, 2:13pm

Why is one index for all slower than one index for every user?
It is in theory the always the same index.
If or if not the user has the right to read/write the file allway need to be resolved on accessing the file.
The cahce database needs only more tables to resolve access faster.
How filesystem works and make index and access fast could help i guess.

kind regards
Chonta

marcelklehr · September 18, 2023, 4:41pm

Ah, never mind, I just checked and some of my assumptions were wrong. I guess, other people are better equipped to answer this

anon75456558 · September 18, 2023, 6:27pm

Which assumptions? Curious what you found.

marcelklehr · October 15, 2023, 7:17pm

I was assuming that mounted storages are mapped in the file cache for each user, which they aren’t. As far as I can see each storage is present only once in the file cache table. I guess the reason why @chonta has filecache entries per storage and user has something to do with the setup of the external storage. If you try to use the same credentials for all user it might only create one storage for all users.

chonta · November 6, 2023, 9:59am

@marcelklehr

thank you for your replay.
It ist not possible to setup the external storrage with logindate for all.
If i would do so, i would have to do recreate all the Accessroules from the filserver in the nextcloud again and every changes ther needed to reproduce in the nextclud manually.

not to manage.
And every change on the files over nextcloud would also show the user for the share mount and not the user who did the change.

The only solution would be, if there is no matter what only one filecache for all user.
The access to the file it self is always checkt on access so shoult not be a problem.

Kind Regards

Chonta