Help me diagnose major performance issues with official helm chart

I’m running the official helm chart, 3.5.19 (NC version 27.0.0) and having some major performance issues. Here is my config:

NC: 27.0.0 - apache image

Kubernetes:

3 Node TALOS cluster running on dell optiplex 7060’s.
Kubernetes version 1.27.4
Talos Version 1.4.6

Each node has 32GB of ram, and at least 6 threads.

Networking is 2.5Gb nics in each node, hooked up to a 10G switch.

Storage is handled by Ceph, also running on the nodes. Backing disks are NVME drives. a PV is mounted for data off of CEPH

The Problem:
Using either the official desktop client, or the web UI, is incredibly slow, like 1-2MBps transfer rates.

What I’ve done to troubleshoot:

Benchmark Ceph: Not the problem, can do 500MBps Read and write to the cluster (~2.5-3Gbps)

DD within the next cloud container to the PV: Not the problem, can do 400+ MBps Read/Write to the underlying PV

I am using Traefik as a proxy, so I tried bypassing that and going straight to the container, didnt change anything. I’m tried increasing the php memory to 1GB but that didnt make any difference, opcache is enabled, as is redis, no problems there.

Performance is absolutely terrible, it feels like I’m running it on a calculator. I’m leaning towards maybe an apache tuning issue, but i’m not sure how to adjust apache mpm_prefork settigs via the offical helm chart.

I also have no plugins enabled except for raw previews, preview generation, and brute force protection. I am unable to figure out how to tune anything further, as OPCACHe and php have already been turned on and adjusted. I am using HTTPS infront and see that HTTP2 is being used as well.

There is nothing in the logs to indicate an issue, it’s just SLOOOWWWWW (tested uploading both a single 100MB zip file, as well as trying to sync ~26GB of jpg files using the desktop client.

It is difficult to diagnose properly when you have given no information in regards to redis caching, php memory configurations or database setup.

Without knowing any of these things, I would take a look on your database tuning/setup.
Each file uplaoded, changed, moved, downloaded and deleted through the Nextcloud clients (web, device apps, desktop apps, WebDAV etc), is causing database operations:
Fingerprinting, meta data, tagging and indexing, as well as Redis (if using Redis) indexing the file. If this part of your solution is not fully tuned and optimized, you will not utilize any of those massive resources properly.

And if you are uploading huge amounts of files in a one time operation, you might have much better experience by uploading them with rsync, scp or cp, and then use occ files:scan --all (assuming you have configured your php cli ini file accordingly and is using caching properly).

That means, that you are using php-fpm.

Is the fpm process manager tuned to create enough child processes?

In case of php8.2 you can find where the PHP-FPM configuration file is located with php-fpm8.2 --test. Example:

php-fpm8.2 --test
[21-Jul-2023 09:00:00] NOTICE: configuration file /etc/php/8.2/fpm/php-fpm.conf test is successful

Now look for the configuration settings beginning with pm* and eventually tune them

I hope this helps a litlle bit.

Much luck

The default helm chart for next cloud deploys a 3 node redis cluster. The full details that it deploys are:

Web server
Apache

Redis:
3 node redis cluster

Database:
None (using my own postgressql server hosted on baremetal)

Nextcloud:
single pod with PV mounted from CEPH storage.

Ceph storage has been tested to R/W at 400+ MBps from the nextcloud pod as well as from the ceph pods themselves

Redis performance (from master node):

PING_INLINE: 78492.93 requests per second, p50=0.303 msec
PING_MBULK: 81433.22 requests per second, p50=0.303 msec
SET: 78492.93 requests per second, p50=0.351 msec
GET: 79936.05 requests per second, p50=0.303 msec
INCR: 75471.70 requests per second, p50=0.359 msec
LPUSH: 76745.97 requests per second, p50=0.367 msec
RPUSH: 75131.48 requests per second, p50=0.367 msec
LPOP: 77220.08 requests per second, p50=0.359 msec
RPOP: 74738.41 requests per second, p50=0.359 msec
SADD: 74294.21 requests per second, p50=0.311 msec
HSET: 76511.09 requests per second, p50=0.367 msec
SPOP: 78864.35 requests per second, p50=0.311 msec
ZADD: 78616.35 requests per second, p50=0.303 msec
ZPOPMIN: 80000.00 requests per second, p50=0.303 msec
LPUSH (needed to benchmark LRANGE): 77760.50 requests per second, p50=0.359 msec
LRANGE_100 (first 100 elements): 52301.26 requests per second, p50=0.463 msec
LRANGE_300 (first 300 elements): 25987.53 requests per second, p50=0.951 msec
LRANGE_500 (first 500 elements): 18271.51 requests per second, p50=1.351 msec
LRANGE_600 (first 600 elements): 15898.25 requests per second, p50=1.551 msec
MSET (10 keys): 62034.74 requests per second, p50=0.615 mse

DD from within nextcloud to the PV it’s on:

root@nextcloud-7595cd594d-fjlr6:/var/www/html/data# dd if=/dev/zero of=./test.data bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 36.5622 s, 287 MB/s

I don’t think the Apache image uses fpm. I tried running that command:

/var/www/html/data# rm test.data
root@nextcloud-7595cd594d-fjlr6:/var/www/html/data# php-fpm8.2 --test
bash: php-fpm8.2: command not found

If you have http2 on apache2 you must have fpm, if you have no fpm, you can have no http2 with apache2.

I highly doubt that the database is the problem, as this is the same exact database setup I used with a generic nextcloud container. The only differences so far are:

1.) using object storage PV instead of NFS PV
2.) using helm chart w/ apache image vs generic image with ngninx

I doubt 1.) is the cause ,since benchmarking shows 400+ MBps read and write speeds (I’ve even swapped out the drives ceph uses with drives with PLP in them for better IOPS). I’m not entirely ruling it out, but given that benchmarking CEPH works ok, I’m not convinced this is the issue:

2.) apache could be the issue? I did try using the nginx+fpm image but didnt see any major performance changes.

When I had a simple nextcloud image on top of nfs based storage, I was able to get 20-40MBps from the desktop client or web upload, so not entirely sure what else to check here (short of additional apache settings. I can also try going back to the nginx +fpm image and tunning that to see if that yields better performance.

Sorry,

HTTP2 is to traefik, which then forwards to http in the container. apache itself is not serving http2 (ssl), only traefik front end.

the only mods enabled in apache in the helm chart are:

lrwxrwxrwx 1 root root 36 Jul  4 13:43 access_compat.load -> ../mods-available/access_compat.load
lrwxrwxrwx 1 root root 28 Jul  4 13:43 alias.conf -> ../mods-available/alias.conf
lrwxrwxrwx 1 root root 28 Jul  4 13:43 alias.load -> ../mods-available/alias.load
lrwxrwxrwx 1 root root 33 Jul  4 13:43 auth_basic.load -> ../mods-available/auth_basic.load
lrwxrwxrwx 1 root root 33 Jul  4 13:43 authn_core.load -> ../mods-available/authn_core.load
lrwxrwxrwx 1 root root 33 Jul  4 13:43 authn_file.load -> ../mods-available/authn_file.load
lrwxrwxrwx 1 root root 33 Jul  4 13:43 authz_core.load -> ../mods-available/authz_core.load
lrwxrwxrwx 1 root root 33 Jul  4 13:43 authz_host.load -> ../mods-available/authz_host.load
lrwxrwxrwx 1 root root 33 Jul  4 13:43 authz_user.load -> ../mods-available/authz_user.load
lrwxrwxrwx 1 root root 32 Jul  4 13:43 autoindex.conf -> ../mods-available/autoindex.conf
lrwxrwxrwx 1 root root 32 Jul  4 13:43 autoindex.load -> ../mods-available/autoindex.load
lrwxrwxrwx 1 root root 30 Jul  4 13:43 deflate.conf -> ../mods-available/deflate.conf
lrwxrwxrwx 1 root root 30 Jul  4 13:43 deflate.load -> ../mods-available/deflate.load
lrwxrwxrwx 1 root root 26 Jul  4 13:43 dir.conf -> ../mods-available/dir.conf
lrwxrwxrwx 1 root root 26 Jul  4 13:43 dir.load -> ../mods-available/dir.load
lrwxrwxrwx 1 root root 26 Jul  4 13:43 env.load -> ../mods-available/env.load
lrwxrwxrwx 1 root root 29 Jul  4 13:43 filter.load -> ../mods-available/filter.load
lrwxrwxrwx 1 root root 30 Jul 11 03:24 headers.load -> ../mods-available/headers.load
lrwxrwxrwx 1 root root 27 Jul  4 13:43 mime.conf -> ../mods-available/mime.conf
lrwxrwxrwx 1 root root 27 Jul  4 13:43 mime.load -> ../mods-available/mime.load
lrwxrwxrwx 1 root root 34 Jul  4 13:43 mpm_prefork.conf -> ../mods-available/mpm_prefork.conf
lrwxrwxrwx 1 root root 34 Jul  4 13:43 mpm_prefork.load -> ../mods-available/mpm_prefork.load
lrwxrwxrwx 1 root root 34 Jul  4 13:43 negotiation.conf -> ../mods-available/negotiation.conf
lrwxrwxrwx 1 root root 34 Jul  4 13:43 negotiation.load -> ../mods-available/negotiation.load
lrwxrwxrwx 1 root root 26 Jul 10 22:11 php.load -> ../mods-available/php.load
lrwxrwxrwx 1 root root 31 Jul 11 03:24 remoteip.load -> ../mods-available/remoteip.load
lrwxrwxrwx 1 root root 33 Jul  4 13:43 reqtimeout.conf -> ../mods-available/reqtimeout.conf
lrwxrwxrwx 1 root root 33 Jul  4 13:43 reqtimeout.load -> ../mods-available/reqtimeout.load
lrwxrwxrwx 1 root root 30 Jul 11 03:24 rewrite.load -> ../mods-available/rewrite.load
lrwxrwxrwx 1 root root 31 Jul  4 13:43 setenvif.conf -> ../mods-available/setenvif.conf
lrwxrwxrwx 1 root root 31 Jul  4 13:43 setenvif.load -> ../mods-available/setenvif.load
lrwxrwxrwx 1 root root 29 Jul  4 13:43 status.conf -> ../mods-available/status.conf
lrwxrwxrwx 1 root root 29 Jul  4 13:43 status.load -> ../mods-available/status.load

dpkg -l shows no “fpm” packages (php8.0-fpm for example)

sites-enabled/000-default contents:

root@nextcloud-7595cd594d-fjlr6:/etc/apache2/sites-enabled# cat 000-default.conf  |grep -v "#"
<VirtualHost *:80>

        ServerAdmin webmaster@localhost
        DocumentRoot /var/www/html


        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined

</VirtualHost>

I may have figured it out. I have 2Gbps internet (2.3Gbps actually), and I noticed when I ran a speed test, that I was only getting 800Mbps down and 60Mbps up from my desktop. I tried at my firewall, and got 2.3G/2.3G. I ran some iperf tests between k8s pods as well as physical hosts, and got abysmal results as well. I decided to reboot my core switch, and now suddenly I’m getting ~18-30MBps, about what I expect.

As a test, I had also re-deployed my previous nextcloud install, and that suddenly was abysmally slow as well.

Thanks for all the replies. I’m not sure what the issue is with the switch, but IPERF is a lot better as is speedtest, as is nextcloud, so I’m marking this resolved.