Sudden expansive growth of php children processes

AndrewSkull · November 20, 2024, 7:05am

The Basics

Nextcloud Server version (e.g., 29.x.x):
- 28.0.12
Operating system and version (e.g., Ubuntu 24.04):
- Ubuntu 24.04 (but it started on Ubuntu 22.04)
Web server and version (e.g, Apache 2.4.25):
- Apache/2.4.58
PHP version (e.g, 8.3):
- 8.3.13 (but it started on 8.2)
Is this the first time you’ve seen this error? (Yes / No):
- No

Summary of the issue you are facing:

Hello! I’ve got some strange problem that randomly stops our Nextcloud server. First of all, this is a VM (30vCPU, 64Gb and 15Tb storage). This problem stated on NC27 and still persists in NC28. Randomly php processes start to grow unill they reach limit set in configuration (5000):

[19-Nov-2024 20:31:18] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 95 idle, and 206 total children
[19-Nov-2024 20:31:19] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 97 idle, and 211 total children
[19-Nov-2024 20:31:20] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 99 idle, and 214 total children
[19-Nov-2024 20:31:21] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 92 idle, and 215 total children
[19-Nov-2024 20:31:22] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 98 idle, and 223 total children
[19-Nov-2024 20:31:28] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 77 idle, and 232 total children
[19-Nov-2024 20:31:29] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 64 idle, and 240 total children

and so on.

The problem is that I can’t find who or what is responcible for that? The only thing that was added some time before is CODE server app, but I think it’s a coincidence.

When php stops responding it is enough

Configuration

"system": {
        "instanceid": "***REMOVED SENSITIVE VALUE***",
        "passwordsalt": "***REMOVED SENSITIVE VALUE***",
        "secret": "***REMOVED SENSITIVE VALUE***",
        "trusted_domains": [
            "cloud.***REMOVED***"
        ],
        "datadirectory": "***REMOVED SENSITIVE VALUE***",
        "dbtype": "pgsql",
        "version": "28.0.12.2",
        "overwrite.cli.url": "https:\/\/cloud.***REMOVED***\/",
        "overwriteprotocol": "https",
        "dbname": "***REMOVED SENSITIVE VALUE***",
        "dbhost": "***REMOVED SENSITIVE VALUE***",
        "dbport": "",
        "dbtableprefix": "oc_",
        "mysql.utf8mb4": true,
        "dbuser": "***REMOVED SENSITIVE VALUE***",
        "dbpassword": "***REMOVED SENSITIVE VALUE***",
        "installed": true,
        "memcache.local": "\\OC\\Memcache\\APCu",
        "filelocking.enabled": true,
        "memcache.locking": "\\OC\\Memcache\\Redis",
        "redis": {
            "host": "***REMOVED SENSITIVE VALUE***",
            "port": 0,
            "timeout": 0
        },
        "overwritehost": "cloud.***REMOVED***",
        "htaccess.RewriteBase": "\/",
        "ldapIgnoreNamingRules": false,
        "ldapProviderFactory": "OCA\\User_LDAP\\LDAPProviderFactory",
        "ldapUserCleanupInterval": 60,
        "skeletondirectory": "",
        "mail_from_address": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpmode": "smtp",
        "mail_sendmailmode": "smtp",
        "mail_domain": "***REMOVED SENSITIVE VALUE***",
        "mail_smtphost": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpport": "1025",
        "mail_send_plaintext_only": true,
        "versions_retention_obligation": "disabled",
        "tempdirectory": "\/var\/www\/nextcloud\/data\/tmp",
        "hashingThreads": 4,
        "default_locale": "ru_RU",
        "default_language": "ru",
        "simpleSignUpLink.shown": false,
        "trashbin_retention_obligation": "auto, 15",
        "lost_password_link": "disabled",
        "knowledgebaseenabled": true,
        "enable_previews": true,
        "preview_ffmpeg_path": "\/usr\/bin\/ffmpeg",
        "enabledPreviewProviders": [
            "OC\\Preview\\PNG",
            "OC\\Preview\\JPEG",
            "OC\\Preview\\GIF",
            "OC\\Preview\\HEIC",
            "OC\\Preview\\BMP",
            "OC\\Preview\\XBitmap",
            "OC\\Preview\\MP3",
            "OC\\Preview\\TXT",
            "OC\\Preview\\MarkDown",
            "OC\\Preview\\OpenDocument",
            "OC\\Preview\\Krita",
            "OC\\Preview\\PDF",
            "OC\\Preview\\Movie"
        ],
        "maintenance": false,
        "default_phone_region": "RU",
        "app_install_overwrite": [
            "files_retention",
            "sharelisting"
        ],
        "theme": "",
        "loglevel": 3,
        "updater.release.channel": "stable",
        "enforce_theme": "",
        "trusted_proxies": "***REMOVED SENSITIVE VALUE***",
        "maintenance_window_start": 17,
        "activity_expire_days": 30
    }

Any advice would be appreciated. Thanks in advance.

jtr · November 20, 2024, 3:37pm

Suggestions:

“loglevel”: 3,

Set this to the default (2) and see if you get a bit more verbose hints. You can further bump it down to 1 if needed to get more verbose logging.

Also, check your web server logs to see what HTTP transactions are occurring during these problematic events.

What are your PHP-FPM pool settings? How many simultaneous users?

AndrewSkull · November 21, 2024, 6:46am

Hello! I’ve changed to “2”. The problem occurs once in several weeks. So we’ll have to wait.

Apache log shows nothing interesting.

pm = dynamic
pm.max_children = 5000
pm.start_servers = 150
pm.min_spare_servers = 100
pm.max_spare_servers = 200
pm.max_requests = 300

About 200-400 simultaneous users.

AndrewSkull · November 21, 2024, 7:07am

I’ve looked at the log, as for now it seems that the problem is in CODE server:

{
  "reqId": "hWVg5ZPvZdLLbuYtt3Ac",
  "level": 3,
  "time": "2024-11-19T16:53:10+00:00",
  "remoteAddr": "",
  "user": "--",
  "app": "richdocuments",
  "method": "",
  "url": "--",
  "message": "Failed to fetch the Collabora capabilities endpoint: cURL error 28: Operation timed out after 45002 milliseconds with 0 bytes received"
}

and the next records is when php pool started to grow:

{
  "reqId": "8aNBYt21Ti6zJnury2Tv",
  "level": 3,
  "time": "2024-11-19T17:31:49+00:00",
  "remoteAddr": "***REMOVED***",
  "user": "***REMOVED***",
  "app": "richdocuments",
  "method": "GET",
  "url": "/apps/files/api/v1/stats",
  "message": "Failed to fetch the Collabora capabilities endpoint: cURL error 28: Operation timed out after 45002 milliseconds with 0 bytes received"
}
{
  "reqId": "EqiZk34alCzaERREXmmt",
  "level": 3,
  "time": "2024-11-19T17:31:49+00:00",
  "remoteAddr": "***REMOVED***",
  "user": "I***REMOVED***",
  "app": "richdocuments",
  "method": "GET",
  "url": "/apps/files/api/v1/stats",
  "message": "Failed to fetch the Collabora capabilities endpoint: cURL error 28: Operation timed out after 45002 milliseconds with 0 bytes received"
}

Can CODE server be a problem?

anon81630622 · November 21, 2024, 2:36pm

AndrewSkull:

The Basics

Nextcloud Server version (e.g., 29.x.x):

28.0.12

Operating system and version (e.g., Ubuntu 24.04):

Ubuntu 24.04 (but it started on Ubuntu 22.04)

Web server and version (e.g, Apache 2.4.25):

Apache/2.4.58

PHP version (e.g, 8.3):

8.3.13 (but it started on 8.2)

Is this the first time you’ve seen this error? (Yes / No):

No

Summary of the issue you are facing:

Hello! I’ve got some strange problem that randomly stops our Nextcloud server. First of all, this is a VM (30vCPU, 64Gb and 15Tb storage). This problem stated on NC27 and still persists in NC28. Randomly php processes start to grow unill they reach limit set in configuration (5000):

[19-Nov-2024 20:31:18] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 95 idle, and 206 total children
[19-Nov-2024 20:31:19] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 97 idle, and 211 total children
[19-Nov-2024 20:31:20] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 99 idle, and 214 total children
[19-Nov-2024 20:31:21] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 92 idle, and 215 total children
[19-Nov-2024 20:31:22] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 98 idle, and 223 total children
[19-Nov-2024 20:31:28] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 77 idle, and 232 total children
[19-Nov-2024 20:31:29] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 64 idle, and 240 total children
and so on.

The problem is that I can’t find who or what is responcible for that? The only thing that was added some time before is CODE server app, but I think it’s a coincidence.

When php stops responding it is enough

Configuration

“system”: {
“instanceid”: “REMOVED SENSITIVE VALUE”,
“passwordsalt”: “REMOVED SENSITIVE VALUE”,
“secret”: “REMOVED SENSITIVE VALUE”,
“trusted_domains”: [
“cloud.REMOVED”
],
“datadirectory”: “REMOVED SENSITIVE VALUE”,
“dbtype”: “pgsql”,
“version”: “28.0.12.2”,
“overwrite.cli.url”: “https://cloud.REMOVED/”,
“overwriteprotocol”: “https”,
“dbname”: “REMOVED SENSITIVE VALUE”,
“dbhost”: “REMOVED SENSITIVE VALUE”,
“dbport”: “”,
“dbtableprefix”: “oc_”,
“mysql.utf8mb4”: true,
“dbuser”: “REMOVED SENSITIVE VALUE”,
“dbpassword”: “REMOVED SENSITIVE VALUE”,
“installed”: true,
“memcache.local”: “\OC\Memcache\APCu”,
“filelocking.enabled”: true,
“memcache.locking”: “\OC\Memcache\Redis”,
“redis”: {
“host”: “REMOVED SENSITIVE VALUE”,
“port”: 0,
“timeout”: 0
},
“overwritehost”: “cloud.REMOVED”,
“htaccess.RewriteBase”: “/”,
“ldapIgnoreNamingRules”: false,
“ldapProviderFactory”: “OCA\User_LDAP\LDAPProviderFactory”,
“ldapUserCleanupInterval”: 60,
“skeletondirectory”: “”,
“mail_from_address”: “REMOVED SENSITIVE VALUE”,
“mail_smtpmode”: “smtp”,
“mail_sendmailmode”: “smtp”,
“mail_domain”: “REMOVED SENSITIVE VALUE”,
“mail_smtphost”: “REMOVED SENSITIVE VALUE”,
“mail_smtpport”: “1025”,
“mail_send_plaintext_only”: true,
“versions_retention_obligation”: “disabled”,
“tempdirectory”: “/var/www/nextcloud/data/tmp”,
“hashingThreads”: 4,
“default_locale”: “ru_RU”,
“default_language”: “ru”,
“simpleSignUpLink.shown”: false,
“trashbin_retention_obligation”: “auto, 15”,
“lost_password_link”: “disabled”,
“knowledgebaseenabled”: true,
“enable_previews”: true,
“preview_ffmpeg_path”: “/usr/bin/ffmpeg”,
“enabledPreviewProviders”: [
“OC\Preview\PNG”,
“OC\Preview\JPEG”,
“OC\Preview\GIF”,
“OC\Preview\HEIC”,
“OC\Preview\BMP”,
“OC\Preview\XBitmap”,
“OC\Preview\MP3”,
“OC\Preview\TXT”,
“OC\Preview\MarkDown”,
“OC\Preview\OpenDocument”,
“OC\Preview\Krita”,
“OC\Preview\PDF”,
“OC\Preview\Movie”
],
“maintenance”: false,
“default_phone_region”: “RU”,
“app_install_overwrite”: [
“files_retention”,
“sharelisting”
],
“theme”: “”,
“loglevel”: 3,
“updater.release.channel”: “stable”,
“enforce_theme”: “”,
“trusted_proxies”: “REMOVED SENSITIVE VALUE”,
“maintenance_window_start”: 17,
“activity_expire_days”: 30
}

It seems PHP-FPM is overwhelmed. Increase pm.max_children, pm.start_servers, and pm.min/max_spare_servers in your PHP-FPM config. Check server resources (CPU, memory) and optimize Redis/APCu caching. Investigate the CODE server for excessive resource usage, and ensure PostgreSQL is tuned for concurrency. Update Nextcloud and check logs for any specific errors.

AndrewSkull · November 21, 2024, 2:55pm

I’ve tried to increase pm.max_children till 10K, but it uses all resources and server stops responding.
Right now I think the problem is between Nexctloud Office and CODE apps. Something like CODE server hangs and another request falls into the loop and overwhelmes php pool.
Anyway I’ve increased logging and now whaiting for this issue.

AndrewSkull · December 2, 2024, 12:36pm

Hello everyone! This happend again. Now I’m on NC29.0.9. I’ve managed to catch this in the begining. But nothing was in the log. Th only thing that I saw was that CODE server is not responding:

jtr · December 2, 2024, 6:09pm

About 200-400 simultaneous users.

I see you’re using the Built-in CODE rather than a dedicated deployment.

Built-in CODE runs as am embedded AppImage mounted via FUSE (if your environment supports it) or it extracts itself into your system /tmp. Among other things, this makes it slower and susceptible to things like /tmp getting cleared.

Is CODE used by more than a couple of these users?

I would not use the Built in CODE with anything more than a handful of users. That’s all it is intended for. Instead install CODE directly and integrate it with Nextcloud per the docs.

Refs:

AndrewSkull · December 4, 2024, 7:00am

Hi! Thank you.
I’ve removed built in CODE and moved to CODE in docker. Hope this will fix the problem.