Background Job ScanFiles triggered via cron.php breaks my system

h3rb3rt · October 23, 2020, 6:00am

Hi,

I have realized that my cron is taken quite some time to finish. After some investigation I found that the job OCA\Files\BackgroundJob\ScanFiles is triggered via the oc_jobs table in my database. The job does not run every time cron is started (which would be every 5 minutes) but at least twice a day.

Why ? I mean, why is it scanning all my files for new / changes files. My server hosts around 1,5 TB of data, this takes forever and I am not getting the idea of it.
I would understand if I have external storage activated (I had previously btw). or if I upload files without using any NC interface (web, client, webdav) but this is not the case.

Biggest concern or unanswered question
WHY is OCA\Files\BackgroundJob\ScanFiles even necessary, and why does it scan through all files even the previews ?

Thanks for any help and advice, perhaps it is a leftover from the times I used the external storage app.

Update 03.11.2020
Some links, somehow related

Update 20.11.2020
strace -p PID helped me to figure out what the cron is doing in the background. For me there where two issues I could pinpoint

One was the problem of the scanner.php not able to follow symlinks, which happen if you but a Windows Backup into a NC directory.
- https://github.com/nextcloud/server/issues/22846
- The bug is not yet resolved but can be easily done yourself using https://github.com/nextcloud/server/pull/21723
The second problem was the scan of my appdata/preview folder. I have many pictures, and I also use the NC instance for many years already. Long story short over 3 500 000 entries in the files_cache table only for the previews.
- I deleted the preview folder and the entries in the database based on this Remove preview files without occ and without posix
- A tip from my side, if you do so and have a huge preview folder, rename it before starting to delete it, it takes forever. The process would be
  1. Put your instance in maintenance mode
  2. Rename the appdata/preview folder to preview_old
  3. Start deleting the folder preview_old
  4. Delete the table entries using DELETE FROM oc_filecacheWHEREpath LIKE '%appdata_%/preview/%'
  5. After the table is clean, you can switch off the maintenance mode, a new preview folder will be created and the old one is deleted in parallel.

h3rb3rt · October 27, 2020, 11:26am

Can someone tell my why this job is necessary ? Even without external storage app in use?

h3rb3rt · November 2, 2020, 6:59am

I need to push this again, someone must know if this is normal, if this should be in the database for a standard installation or if this is kept in the database from an uninstalled app.
With around 2 TB of files, this takes ages to finish and results in weird NC behavior.

h3rb3rt · November 3, 2020, 7:01am

Running occ files:scan --all finishes within 30 minutes and without any error. The oc_jobs tables tells me that the cron, after fixing some issues I had with occ files:scan, still takes several hours to finish.

Is there a way to figure out what cron.php ist doing for hours, even days?
I also tried to enforce only one cron job running, but this stops crons from running over days using for example flock

The big question is, what is cron.php doing while it reserves CPU cycles and starts messing with mysqld?

Paka · November 7, 2020, 12:14pm

I’ve been reading your posts from various threads because I’ve had the same issue since upgrading to v19.x.

Twice a day CPU usage becomes excessive resulting in swapping which, at least once a day, grinds the VPS to a halt. Notification of excessive swapping.

I’m rather surprised this issue hasn’t been resolved by now.

Having spent a good amount of time on this cron.php problem I’ve, to be honest, expended far too much time with no movement toward fixing this.

I’m posting here mostly to bump this … again.

h3rb3rt · November 9, 2020, 8:17am

Thanks, at least someone else having the same issue, for now I made a quite radical workaround killing all “php” processes every hour, 3 minutes after the full hour. This is not nice, but everything I could think off to keep my system running. I still do not understand why the scanfiles job takes so long - I even have not idea why it is even necessary.

This is what I added to the cron of www-data

sudo -u www-data crontab -e

# WorkAround
# used to kill the php job which gets stuck and eats up CPU
3 */1 * * * killall php

Paka · November 16, 2020, 11:16am

Wow. That’s a rather severe (if sadly necessary) solution to keep Nextcloud useful!

Just FWIW, here’s a report generated using Netdata to monitor the server which is running Nextcloud:

system.load Chart
load average 5 = 53.3 load
five-minute load average Alarm
load Family
CRITICAL Severity
Mon Nov 16 01:30:50 GMT 2020 Time

To be honest, I’m rather puzzled why a solution or explanation of this major issue hasn’t surfaced.

Perhaps v20 will resolve this.

h3rb3rt · November 16, 2020, 11:51am

upgrading to NC 20 is not possible for me yet, there is a bug with “NC Mail” which breaks my setup

Paka · November 17, 2020, 12:32pm

I’m looking for some way to do that with monit when the server load starts increasing. Will post back here when I’ve got it functioning properly.

h3rb3rt · November 20, 2020, 7:06am

I have fixed my issue with the cron, at least for now. Using strace -p PID for the process never finishing I realised that it was going over my appdata_xxx/preview folder and was never able to finish this.
Based on the files_cache database I had more than 3 158 958 entries linking to the preview. I assume I had even more files in there. I followed a “not so recommended” procedure to get rid of the folder and the database entries and it worked (but took forever).

Still there are open questions

Why the hack do I have millions of entries for preview in there? I have lots of pictures, but if this is an “overall” bottleneck this should be solved differently.
Why the hack are all files, including previews, scanned twice a day using the jobs database? OR is it even more often? This is my biggest concern, question, whatsover part of this - WHY ? Bigger system must suffer from this too, I just don’t get it.
How does the jobs database work, what do the entries mean, any documentation appreciated.
And why is nobody jumping in from the DEVs helping out, I thought this is a support forum where even DEVs from the Nextcloud Team help out?

Paka · November 26, 2020, 5:30pm

I’ve ended up using:

Process Resource Manager (PRM)
Monit. Editing existing config files.

PRM:

Edited PRM config files and needed and added on in the …/rules directory for the user that ran the Nextcloud cron job. Thus the file usr123.user contained:

IGNORE=""
MAX_CPU="20"
MAX_MEM="20"
MAX_PROC="50"
# we dont care about the process run time, set value 0 to disable check
MAX_ETIME="0"
IGNORE_ROOT="0"
KILL_TRIG="1"
KILL_WAIT="1"
KILL_PARENT="1"
KILL_SIG="9"
# KILL_RESTART_CMD="service php7.4-fpm restart"
KILL_RESTART_CMD="/sbin/reboot"

The PRM log showed this when the Nextcloud cron job ran away:

Nov 26 15:03:30 Tweedledee prm[25005]: HARD FAIL MAX_MEM use:74 limit:20 mode:killall ppid:23031 pidlist:23036 user:usr123 cmd:php7.4 -f /var/www/domain.com/html/cron.php restart-cmd:/sbin/reboot
Nov 26 15:03:29 Tweedledee prm[25005]: soft fail #1 MAX_MEM use:74 limit:20 pid:23036 user:usr123 cmd:php7.4 -f /var/www/domain.com/html/cron.php

I’d tried restarting php7.4 as you see, but the server still ground to a hold. So I resorted to rebooting which worked.

Monit:

/etc/monit/monitrc

check system $HOST
[..]
    if loadavg (5min) > 10 then restart
    if swap usage > 50% for 2 cycles then restart
    if cpu usage (system) > 90% for 1 cycles then restart
    stop program = "/sbin/reboot"
[..]

I’ve not done the process with clearing out appdata_xxx/preview yet.

Paka · November 27, 2020, 12:11pm

FYI: Bug report on this issue has been filed at:

Query on oc_filecache uses wrong index - Cron job runs very long #24401