Nextcloud AIO becomes unresponsive even though all containers are healthy

Description:
I have recently started experiencing a strange issue with my Nextcloud AIO installation. Roughly every 6 to 8 hours, the entire instance becomes completely unresponsive — the web interface doesn’t load, the desktop client cannot connect, and the Talk client stops working as well.

However, in Portainer all AIO containers still show the status “Healthy”, as if everything was running normally.

At first, I thought the issue was caused by the Nextcloud Mail app, so I disabled it, but the problem continues to occur even with Mail disabled.

At the moment, the only way I can temporarily restore functionality is by restarting the Docker container nextcloud-aio-nextcloud. After that, everything works again for several hours before the issue reappears.

I have not been able to determine what exactly causes this situation or where to find logs that would explain it.

I would like to ask:

  1. Are there any known reasons why the entire AIO instance could appear “healthy” but be completely dead from the user’s perspective?
  2. What is the best way to diagnose this kind of issue?
  3. Which logs should I check, and where exactly can I find them within the AIO setup?
  4. Is there any recommended way to fix or recover the instance permanently, instead of having to restart the Nextcloud container manually?

Environment:

  • Nextcloud AIO: 11.9.0 and also 11.10.0
  • Nextcloud Server: 31.0.9
  • OS: Ubuntu Server 24.04 LTS (latest updates)
  • Virtualized on: Proxmox VE 9
  • Reverse proxy: Nginx Proxy Manager
  • VM resources: 12 vCPUs and 16 GB RAM

Original topic here - Nextcloud AIO becomes unresponsive even though all containers are healthy · nextcloud/all-in-one · Discussion #6996 · GitHub

Hi, have you checked the server resource usage with htop for example once this happens?

@szaimen glad for your response.

I have checked btop
CPU usage - 3-6%

RAM over 8GB free to use.

docker stats

VM runs on NVME drive, so there is no I/O disk issues.

Hm… Can you post the output of sudo docker info here?

Here you are:

Client: Docker Engine - Community
 Version:    28.5.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.29.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.40.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 25
  Running: 25
  Paused: 0
  Stopped: 0
 Images: 35
 Server Version: 28.5.1
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b98a3aace656320842a23f4a392a33f46af97866
 runc version: v1.3.0-0-g4ca628d1
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-85-generic
 Operating System: Ubuntu 24.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 14.64GiB
 Name: nextcloud-aio
 ID: 4a4be920-ff83-4d48-bdbc-f4cd6ac04a2f
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false

I’m not really sure what the cause is, and this is just a wild guess, but I would try uninstalling ClamAV and see if the problem still occurs.

1 Like

During the time when Nextcloud was down, I checked the logs of the individual Nextcloud containers through Portainer. Unfortunately, I didn’t find anything that would point to an actual issue.
I was worried it could be caused by a sudden RAM spike, but both the Proxmox graphs and Grafana show that RAM usage stays constant without any sudden jumps.

I’ll wait for the next crash and might temporarily disable the ClamAV container.

1 Like

Just a quick addition as to why I think it could be ClamAV.

While I didn’t experience the exact same issue, I can confirm that ClamAV can impact performance and cause lock-ups. I’m currently testing AIO and experienced a similar issue when uploading large files, as the chunks were being reassembled. The reassembly took forever, and everything stopped responding. Without ClamAV, however, I was able to upload the same 5 GB file with no issues, and the reassembly took only a few seconds.

By the way, the next thing I’d look at, if it wasn’t ClamAV, would be the Fulltext Search.

2 Likes

It could just be a coincidence causing all this. I often upload large files myself, but I haven’t noticed any issues so far.
Maybe it’s because my Nextcloud server has enough performance and hardware resources, so the problem doesn’t show up that clearly.

Ironically, this issue happens kind of “randomly.” It doesn’t matter whether it’s during synchronization or when the system is “idle”.

I’m definitely curious to see what you find out about the Fulltext Search.

I want to add a larger context to this issue, because the instability started months ago, already on Nextcloud AIO 31, long before the recent changes.

At that time, @scubamuc and I created a federated cloud connection between our instances.
My instance runs Nextcloud AIO, his instance runs Nextcloud SNAP.
After that, we also created a federated Talk connection between the two servers.

Since that moment the problems described above started to appear — the AIO interface and the entire Nextcloud became unresponsive, even though all containers stayed healthy. It happened repeatedly, sometimes twice per hour, sometimes 5+ times per day, always without any logs indicating the root cause.

As a test, we removed the federated connection between our instances.
Right after that, the situation improved significantly — my AIO instance suddenly ran 24 hours without a single freeze.

Later I upgraded to AIO 32, but the symptoms still persist.
After the upgrade we re-connected our instances through federation again, and the freezes started to appear the same way as before — no logs, all containers healthy, but the entire AIO stack becoming unresponsive.

Because of this pattern, my current suspicion is that the instability might be caused by some incompatibility between AIO and SNAP in the federated stack, either:

  • the standard federated cloud connection

  • or the federated Talk integration

  • or some background process related to federation between two different distributions (AIO vs SNAP)

This is only a hypothesis, because no logs reveal anything useful — same as previously noted above.
I would appreciate if someone could comment on whether this scenario is realistically possible, or if the federation layer between AIO and SNAP works differently and this connection should not be able to cause such behaviour.

I would not expect any incompatibility based on the installation method, but from your writing, it seems to be related to federation at least. Are you still able to reproduce the issue on your instance?

Can you share some details here? What exactly did you do?

Experience from a pretty good while ago - just for the case it may help: I had 2 Nextcloud bare-metal installations on either same version. I connected one to the other via federation feature. What I experiences is: As soon as one of the 2 becomes offline or has connection issues, the other instance does not work either any more. It looked to me as long running directory listing queries. That’s why I disconnected the two instances again and use them separately. I was searching for a synchronisation solution to implement a hot standby backup, so the federation feature was did not help me anyway.

1 Like

Problem still persists.

I added the federated server here before.

And enable Talk federation

These functions are disabled now.

I didn’t experience any problems before turning on Federation for the first time. Turning it on changed something that affected the rest of the functionality.
Disabling both features only alleviated the symptoms but didn’t eliminate them.

Note: Nextcloud version 32 (currently 32.0.2) was a fresh install, not an upgrade from version 31

confirmed, tested federation between various snap instances (fresh install & running instances) without issues. but adding trusted servers takes ages to turn green… once green, things seem to be okay. but when yellow, servers going away on either side causes high resource usage sometimes leading to system freeze. nc-logs remain empty, syslogs remain empty.

talk federation seems to be working fine.

Note: Nextcloud version 32 (currently 32.0.2snap1) snap auto-update, running since version 9/stable

1 Like

I still think there is some leftover problem from the federation that didn’t get turned off even after the federation was disabled, or rather, its state wasn’t erased (not reset).

Because before, it worked for 2 years without a single crash or malfunction. Everything started when the federation was turned on.

But it is only my suspicion.

I was wondering if there’s some kind of background job added, but I haven’t had time to check.

1 Like

It would be great if you could take a look at it and check it out. :wink:

Not sure if I find the time :slight_smile: But since it is still reproducible for you, just some ideas:

  • Check oc_jobs if there are jobs with high execution_duration
  • Try to disable app cloud_federation_api
  • Check table oc_trusted_servers for any old entries

Just out of the top of my head

Glad for the suggestions.
I have now disabled the cloud_federation_api app and will observe if the freeze happens again.

The other checks you mentioned (oc_jobs execution_duration and oc_trusted_servers) are a bit outside my comfort zone, as I’m not an IT professional and I’m trying to avoid touching the database directly.

If there is a simple or safe way to check those two points (read-only or via occ, without manual SQL), I’d really appreciate some guidance.

I’ll report back with results from disabling federation.

I am not sure if, the jobs one might be possible (occ background-jobs:list or something), not sure it shows the execution time though.
Would you be comfortable doing Selects on the database when provided ?