I’ve been running NC just fine for several years now, backed by Mariadb and using Redis for distributed memcache and locking. The data and www-root directory are mounted over NFS. So far I ran a single instance of each, though on a docker swarm.
Today I tried to add some redundancy, so I went ahead and added a second replica of the NC server. At first, all appeared to be working perfectly, as it is expected. But then I noticed that the Contacts and Calendar apps (and maybe others) fail to load their contents: I only get the header and sidebar, and a countdown starts to reload the browser tab, saying it failed to load the page.
This suggests to me that over the shared storage, DB and Redis cache, some essential piece of state fails to be properly shared among the instances, and thus when the round-robin balancing within docker inevitably routes the subsequent requests to a different instance, they fail to be validated and served.
When I reduce the instance count back to 1, everything returns to be working.
Actually, the approach I would take is checking whether based on what I wrote about my setup, do I fulfill all formal requirements of a clustered NC installation? I’m sure a lot of people did successfully configure such a system, and would be able to pinpoint what obvious I forgot about, before going into actually debugging the error itself.
Is there anything in the nextcloud.log in the data directory ?
And do you have MariaDB/Redis Replicas or just Nextcloud Replicas (the latter would be a lot easier).
Can all Nextcloud Replicas reliably reach MariaDB and Redis ?
Have you configured Redis to be used for Transactional File Locking (you should) ?
Have you looked into forcing your load balancer to route all requests to the faulty instance temporarily so you can debug it more reliably ?
Actually I haven’t checked the nextcloud log yet, as first I wanted to make sure my setup is in theory correct, to prevent putting unnecessary effort into something that isn’t even supposed to work in its current form.
I only replicated the NC server itself, as I assumed it to be a properly written and synchronized web app, with all the locking and whatnot that comes with it. All the other components currently only run as a single instance on my storage node.
Sure, NC shows proper integrational state no matter which worker node it runs on, provided there’s only one instance. In fact, most of the NC functionality appears to work as supposed even with multiple instances, expect the apps I mentioned in the original post, and possibly some others I haven’t tested yet (again, to prevent unnecessary effort until it’s configured well in theory).
I’ve set Redis as my distributed cache and as file locking back-end.
Like I said, there’s no faulty instance. I suspect that the errors stem from subsequent requests going to the other instance, after the page itself has been loaded, and some state fails to be properly shared among the nodes, thus some access token failing to be verified, or something like that.
FWIW, I’d be happy with a simple link to an article which includes a guide and check-list for configuring NC to run distributed. I don’t recall this being discussed in-depth in the installation manual.