Some apps breaking when multiple instances

I’ve been running NC just fine for several years now, backed by Mariadb and using Redis for distributed memcache and locking. The data and www-root directory are mounted over NFS. So far I ran a single instance of each, though on a docker swarm.

Today I tried to add some redundancy, so I went ahead and added a second replica of the NC server. At first, all appeared to be working perfectly, as it is expected. But then I noticed that the Contacts and Calendar apps (and maybe others) fail to load their contents: I only get the header and sidebar, and a countdown starts to reload the browser tab, saying it failed to load the page.

This suggests to me that over the shared storage, DB and Redis cache, some essential piece of state fails to be properly shared among the instances, and thus when the round-robin balancing within docker inevitably routes the subsequent requests to a different instance, they fail to be validated and served.

When I reduce the instance count back to 1, everything returns to be working.

What may I be missing in configuring my cluster?

With this information alone it is hard to debug. So could be the code/config of Nextcloud, could be something in your cluster-setup as well.

Actually, the approach I would take is checking whether based on what I wrote about my setup, do I fulfill all formal requirements of a clustered NC installation? I’m sure a lot of people did successfully configure such a system, and would be able to pinpoint what obvious I forgot about, before going into actually debugging the error itself.

Is there anything in the nextcloud.log in the data directory ?
And do you have MariaDB/Redis Replicas or just Nextcloud Replicas (the latter would be a lot easier).
Can all Nextcloud Replicas reliably reach MariaDB and Redis ?
Have you configured Redis to be used for Transactional File Locking (you should) ?
Have you looked into forcing your load balancer to route all requests to the faulty instance temporarily so you can debug it more reliably ?

Actually I haven’t checked the nextcloud log yet, as first I wanted to make sure my setup is in theory correct, to prevent putting unnecessary effort into something that isn’t even supposed to work in its current form.

I only replicated the NC server itself, as I assumed it to be a properly written and synchronized web app, with all the locking and whatnot that comes with it. All the other components currently only run as a single instance on my storage node.

Sure, NC shows proper integrational state no matter which worker node it runs on, provided there’s only one instance. In fact, most of the NC functionality appears to work as supposed even with multiple instances, expect the apps I mentioned in the original post, and possibly some others I haven’t tested yet (again, to prevent unnecessary effort until it’s configured well in theory).

I’ve set Redis as my distributed cache and as file locking back-end.

Like I said, there’s no faulty instance. I suspect that the errors stem from subsequent requests going to the other instance, after the page itself has been loaded, and some state fails to be properly shared among the nodes, thus some access token failing to be verified, or something like that.

FWIW, I’d be happy with a simple link to an article which includes a guide and check-list for configuring NC to run distributed. I don’t recall this being discussed in-depth in the installation manual.

There is in my opinion nothing conceptually wrong with your distributed setup, based on the information you provided.

Yes, it is. All php-fpm processes do not rely on anything but the database (with proper transactions / locking), redis and data filesystem and don’t mind running on different nodes.

There are numerous results just a single internet search away.

It should be in the Nextcloud Enterprise Manuals, but those are not public, so I have not seen them.