High CPU usage every 20 minutes

Hello,

Environment:
VM
Operating System (64-bit): Ubuntu 22.04.4 LTS
Database: PostgreSQL 12.14
Webserver: Apache 2.4 with php-fpm - Redis-Server
PHP Runtime: 8.2
Version: 28.0.5 (updated from 27.1.9 to 28.0.5)
vCPU: 4 cores
RAM: 8 GB
HD: 40 GB (OS and apps) (58% used)
HD: 1 TB (data) (56% used)
HTTPS: CA GoDaddy
Users created: 85 (There are a considerable number of users created, but not all of them use it. Access is low, it can be less than 1-3 users per day).
Updated from 27.1.9 to 28.0.5

Hypervisor
VMware ESXi, 6.7.0, 17700523
Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

Issue
Approximately, every 20 minutes the virtual machine consumes 100% of the CPU (vCPU 4 cores), and during 2 - 3 minutes the computer stops responding (because it is at 100%). After that time, it works normally (its accessible, I can download, upload etc).

Security & setup warnings

It’s important for the security and performance of your instance that everything is configured correctly. To help you with that we are doing some automatic checks. Please see the linked documentation for more information.

There are some warnings regarding your setup.
13 warnings in the logs since 14 de mayo de 2024{“2”:13,“3”:0,“4”:0}
The PHP module “imagick” is not enabled although the theming app is. For favicon generation to work correctly, you need to install and enable this module. For more details see the documentation.

Log Reader
Message
User F11662B1-4ABA-4D5E-9F9B-D940665FC77F still has unscanned files after running background scan, background scan might be stopped prematurely

I ran this SQL to deletes old cache files.

DELETE FROM oc_filecache 
WHERE fileid IN (
    SELECT DISTINCT f.fileid 
    FROM oc_filecache AS f 
    INNER JOIN oc_mounts AS m ON m.storage_id = f.storage 
    WHERE size < 0 AND parent > 1
);

And now, if I do the select returns 0.

Source

And this to scan (I have 5 errors, due locks files) (its possible unlock and analyze it?).

sudo -u www-data php occ files:scan --all

I have stopped the cron job (comment the line in crontab -u www-data -e).

And I tried changing values for Apache

MaxRequestWorkers 10

But I still have the same problem. After about 20 minutes the CPU usage increases to 100% and stays there for 2 - 3 minutes.

Any idea or something to check?

Regards.

What is actually creating this load, and then every 20 minutes?

To cleanup the filecache, I’d rather use the dedicated occ command:
https://docs.nextcloud.com/server/stable/admin_manual/configuration_server/occ_command.html#file-operations

If you run the cronjob by hand, does it take very long and/or create errors? (sudo -u www-data php -f cron.php)

Do you use the redis cache? Probably for the number of users, if not, please do it, it reduces the db load a lot.

Did you do anything about it?

1 Like

You can use the occ background-job commands to see what jobs are loaded, scheduled, last ran.

I would also drop your loglevel a bit lower and monitor your nextcloud.log during these periods of high cpu.

That said, could be a million things. Also depends on what apps you have installed too.

What does htop look like during these periods?

Hi @d2rkm4n

First, you should find out exactly which process is involved each time. You can of course use atop for that, or you can find out automatically with a small script that monitors the whole thing and logs it.

This is how I would do that:

Create this little monitoring tool, lets name it “cpumon”:

cpumon

#!/bin/bash

# Path to the log file
LOGFILE=/var/log/cpumon.log # adapt to your need
THRESHOLD=95 # this should be < 100
INTERVAL=10 # seconds between each loop
CPU_MAX=$(( $(nproc) * 100 ))

# Main monitoring loop
while true; do
    # Monitor CPU usage
    CPU_USAGE=$(top -b -n 1 | awk 'NR>7{s+=$9} END {print s}')
    
    # If CPU usage is at 100%
    if (( CPU_USAGE > $((CPU_MAX*THRESHOLD/100)) )); then
        # Identify the process with the highest CPU usage
        TOP_PROCESS=$(ps -eo pid,%cpu,cmd --sort=-%cpu | head -n 2 | tail -n 1)
        # Write information to the log file
        echo "$(date +'%F_%T_%Z') - Total CPU usage is at $CPU_USAGE% caused by process: $TOP_PROCESS" >> $LOGFILE
    fi
    # A short pause to conserve resources
    sleep $INTERVAL
done

make it executable:

chmod +x "/path/to/cpumon"

… and start it as a daemon with the highest priority, so that it keeps logging, even if all other processes are frozen (as user root):

nohup nice -n -20 "/path/to/cpumon" &>/dev/null &

Now it will start logging the processes with the highest cpu usage every time the cpu usage exceeds the threshold.

This is how you have to kill this little debugging logger (as user root):

kill $(pgrep -f "cpumon");fg

PS: You can check if this tool works with stress (apt-get install stress):
make “stress” (highest possible cpu usage) for 15 seconds:

stress --cpu $(nproc) --timeout 15s

I hope this inspires you.


Much and good luck,
ernolf

Thanks for your time @ernolf @jtr @tflidd , I rolled back the environment to the latest backup and then upgraded to the latest version (27 → 28 → 29).
After that, the performance issue stopped.

This topic was automatically closed 8 days after the last reply. New replies are no longer allowed.