Updating from 28.0.0 to 28.0.1 crashloops

Artichoke1 · December 23, 2023, 12:04am

I installed Nextcloud to my kubernetes cluster via the helm char on version 4.5.9. I have an external postgres cluster and redis cluster and it was all running very well.

Then I wanted to upgrade from 28.0.0 to 28.0.1. So as per the docker upgrade instructions, I swapped the old image out and pulled the new one. I did this by editing the helm chart, changing image.tag to 28.0.1-apache, and redeploying it.

Now, the pod never comes up, and in the nextcloud container logs, all I see is that it hangs on this:

Initializing nextcloud 28.0.1.1 ...
2023-12-23T10:22:32.170048346+10:30 Upgrading nextcloud from 28.0.0.11 ...

the pod will sit like this for a few minutes, before kubernetes kills it and starts crashlooping. I examined nextcloud.log and nothing of interest is there, apart from a few random failed login attempts.

I’m not sure what to do here. The lack of information from the logs has left me stuck and help would be appreciated.

nc0815 · December 23, 2023, 3:03pm

Same when pulling apache:latest tag. Update stuck while creating the Container (K8s) Maybe related to Docker or broken image?

jtr · December 24, 2023, 4:25pm

Maybe need to increase both your initialDelaySeconds parameters?

[1] Can't upgrade Nextcloud from chart version 4.5.3 to 4.5.5 · Issue #493 · nextcloud/helm · GitHub
[2] https://github.com/nextcloud/helm/blob/main/charts/nextcloud/README.md#configuration

Artichoke1 · December 24, 2023, 9:31pm

Thanks, for replying, I will take a closer look at both those links.

After looking around for a bit to find more clues, I noticed the pod/s return exit code 137 which appears to be due to memory consumption. However, no memory limits are set and the assigned node doesn’t seem to be under any memory pressure. The pod also only seems to peak out at around 137mb before getting killed.

I will try raising the initial delay too.

Artichoke1 · December 27, 2023, 1:26am

I tried setting the setting an initial delay of 10 minutes to the liveness and readiness probes, which should be plenty of time.

The pod just continues to hang on Upgrading nextcloud from 28.0.0.11 ... for seemingly forever. A more verbose output would be helpful.

jtr · December 27, 2023, 2:30am

It sounds like it’s either stalling or taking a very long time at the rsync step in the image entrypoint.sh.

What’s the underlying persistent storage? Is it NFS by chance?

Artichoke1 · December 27, 2023, 2:37am

No its longhorn, however there are nfs volumes for external storage mounted to that pod. I have an nfs server that I would’ve liked to get working with the data volume but it never ended up working so I ditched idea that a while ago.

Artichoke1 · December 27, 2023, 9:43pm

I think I fixed the problem. I was under the impression that after pulling the new image, the entrypoint script would manage the upgrade by itself without user intervention but it looks like this wasn’t the case.

After setting the initialDelaySeconds to a long time so Kubernetes wouldn’t kill the pod, I found I could still access the web interface. But I tried going to the web interface in the browser and found that it was in maintenance mode.

So I shelled into the nextcloud container and set maintenance_mode to false and reloaded the page and found it was awaiting further upgrade steps. After completing those steps and letting it finish upgrading, everything was up and running again.

The reason for the crashlooping was probably due to the fact that when in maintenance mode, nextcloud appeared to return a 503 status code to the base url, which probably caused the liveness/readiness probes to fail.