Crash when uploading many small files

Basanites · March 2, 2018, 6:01pm

Hello guys,

so I am running NextCloud on an Odroid HC2 using this NextCloudPi image for the Odroid XU4 and HC1.
The data location is configured to be on an HDD.

Running the system everything seems to work fine, but whenever i try to upload a bigger amount of files (~300 files equating to ~150MB total size) the system completely freezes up.
At that point connecting via ssh becomes terribly slow and executing a command does nothing anymore.
The only fix is powercycling.

Nextcloud version (eg, 12.0.2): 13.0.0.14
Operating system and version (eg, Ubuntu 17.04): NextCloudPi v0.46.34
PHP version (eg, 7.1): 7.0.19-1

The issue you are facing:

Complete server crash

Is this the first time you’ve seen this error? (Y/N): Yes

Steps to replicate it:

create lots files using for ((i = 1; i < 1000; i++)) ; do touch $i.foo ; done
upload using webinterface

The output of your Nextcloud log in Admin > Logging:

https://pastebin.com/ErnnyYpn

The output of your config.php file in /path/to/nextcloud (make sure you remove any identifiable information!):

<?php
$CONFIG = array (
  'passwordsalt' => 'xxx',
  'secret' => 'xxx',
  'trusted_domains' => 
  array (
    0 => 'localhost',
    5 => 'nextcloudpi.local',
    1 => '192.168.7.40',
    2 => 'nices-woelkchen.spdns.de',
    4 => 'nices-woelkchen.spdns.de',
  ),
  'datadirectory' => '/media/USBdrive/ncdata',
  'overwrite.cli.url' => 'https://nices-woelkchen.spdns.de',
  'dbtype' => 'mysql',
  'version' => '13.0.0.14',
  'dbname' => 'nextcloud',
  'dbhost' => 'localhost',
  'dbport' => '',
  'dbtableprefix' => 'oc_',
  'mysql.utf8mb4' => true,
  'dbuser' => 'ncadmin',
  'dbpassword' => 'xxx',
  'installed' => true,
  'instanceid' => 'ocuh990ujuea',
  'memcache.local' => '\\OC\\Memcache\\Redis',
  'memcache.locking' => '\\OC\\Memcache\\Redis',
  'redis' => 
  array (
    'host' => '/var/run/redis/redis.sock',
    'port' => 0,
    'timeout' => 0.0,
  ),
  'mail_smtpmode' => 'php',
  'mail_smtpauthtype' => 'LOGIN',
  'mail_from_address' => 'admin',
  'mail_domain' => 'ownyourbits.com',
  'overwriteprotocol' => 'https',
  'maintenance' => false,
  'loglevel' => '2',
  'log_type' => 'file',
  'updater.secret' => 'xxx',
  'theme' => '',
);

The output of your Apache/nginx/system log in /var/log/syslog:

https://pastebin.com/GZNdqp7J

At first I thought the issue was related to unsupported characters, but the problem persists using the supplied loop for generating test files.
The issue might be related to the webinterface, but I didn’t have time to check that yet.
Also from my understanding the logs don’t seem to contain anything useful, are there any other logs I could check or additional logs I can activate?

I hope this is the right place to ask, otherwise I will open an issue over at the NextCloudPi repo

Edit: Checked if it might be related to flowupload, still happens when using the regular webinterface
Edit2: Using bigger files the problem does not occur at all

nachoparker · March 3, 2018, 9:55am

can you check top and iotop to see where the system is getting stuck at?

Basanites · March 3, 2018, 1:56pm

Thanks for the quick answer
I tried, but when it is stuck I am not able to execute any command at all.
The console does not respond anymore when logged in via ssh and establishing a new connection takes quite long aswell.

budy · March 3, 2018, 4:04pm

This is, why one would run the commands in a ssh shell before trying to reproduce this issue.

Especially for iotop I’d suggest

iotop -d 5 -a -o

Basanites · March 3, 2018, 9:29pm

Thanks for the tip, I will keep it in mind for the future.

For now I have built my own ncp image using Armbian 5.41 and Debian stretch 4.9.85 and have yet to encounter any stability issues.
I will see what happens over the next few days, but at the moment the issue seems to be resolved.

nachoparker · March 4, 2018, 7:16am

Do you mean that you used this?

# curl -sSL https://raw.githubusercontent.com/nextcloud/nextcloudpi/master/install.sh | bash

Basanites · March 4, 2018, 9:04am

No, I set up a virtual machine with Ubuntu 16.04 according to the Armbian Wiki and did this

git clone https://github.com/armbian/build ncpbian
cd ncpbian
wget https://raw.githubusercontent.com/nextcloud/nextcloudpi/master/armbian.sh -O userpatches/customize-image.sh
./compile.sh docker  BOARD=odroidxu4 BRANCH=next KERNEL_ONLY=no KERNEL_CONFIGURE=no RELEASE=stretch BUILD_DESKTOP=no

nachoparker · March 4, 2018, 9:11am

Ok, so you built your own NCP image.

Cool, that’s the best way because the old one is still on ‘testing’ and it is a bit old.

What kernel do you have?

I still have some things to fix. Any issues with SSH or postfix?

edit: you can mark as solved

Basanites · March 4, 2018, 9:29am

Yeah, thats what I figured aswell.

The kernel version is 4.9.71-odroidxu4
I did not get any exceptions regarding the postfix installation so I am guessing it is in working order.
Also SSH seems to work fine as of now.

If you have anything else you want me to test, just hit me up. I’m more then happy to give back for the continuous support you guys do.

nachoparker · March 4, 2018, 9:38am

awesome!

can you run sudo ncp-report?

I have been wanting to upload another image, but my hands are so full… maybe you can upload the one you generated so other people can get it already built? (after a bit of testing)

Also, can you check what is the value of pm.max_children in /etc/php/7.0/fpm/pool.d/www.conf

Basanites · March 4, 2018, 5:59pm

This is the output of ncp-report except for the nextcloud log, since it was empty anyways.

NextCloudPi diagnostics


NextCloudPi version  v0.46.37
NextCloudPi image    NextCloudPi_03-03-18
distribution         Debian GNU/Linux 9 \n \l
automount            yes
USB devices          sda 
datadir              /media/USBdrive/ncdata
data in SD           no
data filesystem      btrfs
data disk usage      688M/299G
rootfs usage         1.7G/15G
swapfile             /var/swap
Nextcloud check      ok
Nextcloud version    13.0.0.14
HTTPD service        up
PHP service          up
MariaDB service      up
Redis service        up
Postfix service      up
internet check       ok
port check 80        open
port check 443       open
IP                   192.168.7.40
gateway              192.168.7.1
interface            enx001e0631fbfe
certificates         nices-woelkchen.spdns.de 
certs due            90 days 
NAT loopback         no
uptime               12min

Nextcloud configuration


{
    "system": {
        "passwordsalt": "***REMOVED SENSITIVE VALUE***",
        "secret": "***REMOVED SENSITIVE VALUE***",
        "trusted_domains": {
            "0": "localhost",
            "5": "nextcloudpi.local",
            "1": "192.168.7.40",
            "4": "nices-woelkchen.spdns.de"
        },
        "datadirectory": "***REMOVED SENSITIVE VALUE***",
        "overwrite.cli.url": "https:\/\/nices-woelkchen.spdns.de",
        "dbtype": "mysql",
        "version": "13.0.0.14",
        "dbname": "***REMOVED SENSITIVE VALUE***",
        "dbhost": "***REMOVED SENSITIVE VALUE***",
        "dbport": "",
        "dbtableprefix": "oc_",
        "mysql.utf8mb4": true,
        "dbuser": "***REMOVED SENSITIVE VALUE***",
        "dbpassword": "***REMOVED SENSITIVE VALUE***",
        "installed": true,
        "instanceid": "***REMOVED SENSITIVE VALUE***",
        "memcache.local": "\\OC\\Memcache\\Redis",
        "memcache.locking": "\\OC\\Memcache\\Redis",
        "redis": {
            "host": "***REMOVED SENSITIVE VALUE***",
            "port": 0,
            "timeout": 0,
            "password": "***REMOVED SENSITIVE VALUE***"
        },
        "mail_smtpmode": "php",
        "mail_smtpauthtype": "LOGIN",
        "mail_from_address": "***REMOVED SENSITIVE VALUE***",
        "mail_domain": "***REMOVED SENSITIVE VALUE***",
        "overwriteprotocol": "https",
        "maintenance": false,
        "logfile": "\/media\/USBdrive\/ncdata\/nextcloud.log",
        "loglevel": "2",
        "log_type": "file"
    }
}

HTTPd logs


[Sat Mar 03 14:51:41.760509 2018] [proxy_fcgi:error] [pid 1102:tid 2859029552] [client 192.168.7.48:37436] AH01067: Failed to read FastCGI header
[Sat Mar 03 14:51:41.760596 2018] [proxy_fcgi:error] [pid 1102:tid 2859029552] (104)Connection reset by peer: [client 192.168.7.48:37436] AH01075: Error dispatching request to :4443: 
[Sat Mar 03 14:51:43.225373 2018] [mpm_event:notice] [pid 892:tid 3069480960] AH00491: caught SIGTERM, shutting down
[Sat Mar 03 14:51:47.907940 2018] [ssl:warn] [pid 750:tid 3069116416] AH01909: localhost:4443:0 server certificate does NOT include an ID which matches the server name
[Sat Mar 03 14:51:48.057153 2018] [ssl:warn] [pid 834:tid 3069116416] AH01909: localhost:4443:0 server certificate does NOT include an ID which matches the server name
[Sat Mar 03 14:51:49.004444 2018] [mpm_event:notice] [pid 834:tid 3069116416] AH00489: Apache/2.4.25 (Debian) OpenSSL/1.0.2l configured -- resuming normal operations
[Sat Mar 03 14:51:49.004596 2018] [core:notice] [pid 834:tid 3069116416] AH00094: Command line: '/usr/sbin/apache2'
[Sat Mar 03 14:51:53.290661 2018] [ssl:error] [pid 1085:tid 2840589360] AH01936: stapling_check_response: response times invalid
[Sat Mar 03 14:51:53.290731 2018] [ssl:error] [pid 1085:tid 2840589360] AH01943: stapling_renew_response: error in retrieved response!
[Sat Mar 03 14:51:54.311176 2018] [proxy_fcgi:error] [pid 1084:tid 2858665008] [client 192.168.7.48:37470] AH01071: Got error 'PHP message: PHP Warning:  touch(): Utime failed: Permission denied in /var/www/ncp-web/ncp-output.php on line 81\n'
[Sat Mar 03 14:53:18.458131 2018] [proxy_fcgi:error] [pid 1084:tid 2858665008] [client 192.168.7.48:37470] AH01067: Failed to read FastCGI header
[Sat Mar 03 14:53:18.699332 2018] [mpm_event:notice] [pid 834:tid 3069116416] AH00491: caught SIGTERM, shutting down
[Sat Mar 03 14:53:25.410001 2018] [ssl:warn] [pid 769:tid 3069599744] AH01909: localhost:4443:0 server certificate does NOT include an ID which matches the server name
[Sat Mar 03 14:53:26.049150 2018] [ssl:warn] [pid 1030:tid 3069599744] AH01909: localhost:4443:0 server certificate does NOT include an ID which matches the server name
[Sat Mar 03 14:53:27.005614 2018] [mpm_event:notice] [pid 1030:tid 3069599744] AH00489: Apache/2.4.25 (Debian) OpenSSL/1.0.2l configured -- resuming normal operations
[Sat Mar 03 14:53:27.005832 2018] [core:notice] [pid 1030:tid 3069599744] AH00094: Command line: '/usr/sbin/apache2'
[Sat Mar 03 15:02:03.374606 2018] [ssl:error] [pid 1202:tid 2840589360] AH01936: stapling_check_response: response times invalid
[Sat Mar 03 15:02:03.374745 2018] [ssl:error] [pid 1202:tid 2840589360] AH01943: stapling_renew_response: error in retrieved response!
[Sat Mar 03 15:02:03.607011 2018] [proxy_fcgi:error] [pid 1203:tid 2859148336] [client 192.168.7.48:45976] AH01071: Got error 'PHP message: PHP Warning:  fclose(): supplied resource is not a valid stream resource in /var/www/ncp-web/L10N.php on line 121\n'
[Sat Mar 03 15:02:04.633848 2018] [proxy_fcgi:error] [pid 1203:tid 2976588848] [client 192.168.7.48:45976] AH01071: Got error 'PHP message: PHP Warning:  touch(): Utime failed: Permission denied in /var/www/ncp-web/ncp-output.php on line 81\n'

Database logs


2018-03-03 14:53:36 3063422976 [Note] InnoDB: innodb_empty_free_list_algorithm has been changed to legacy because of small buffer pool size. In order to use backoff, increase buffer pool at least up to 20MB.

2018-03-03 14:53:36 3063422976 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2018-03-03 14:53:36 3063422976 [Note] InnoDB: The InnoDB memory heap is disabled
2018-03-03 14:53:36 3063422976 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-03-03 14:53:36 3063422976 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2018-03-03 14:53:36 3063422976 [Note] InnoDB: Compressed tables use zlib 1.2.8
2018-03-03 14:53:36 3063422976 [Note] InnoDB: Using Linux native AIO
2018-03-03 14:53:36 3063422976 [Note] InnoDB: Using generic crc32 instructions
2018-03-03 14:53:36 3063422976 [Note] InnoDB: Initializing buffer pool, size = 128.0M
2018-03-03 14:53:36 3063422976 [Note] InnoDB: Completed initialization of buffer pool
2018-03-03 14:53:36 3063422976 [Note] InnoDB: Highest supported file format is Barracuda.
2018-03-03 14:53:37 3063422976 [Note] InnoDB: 128 rollback segment(s) are active.
2018-03-03 14:53:37 3063422976 [Note] InnoDB: Waiting for purge to start
2018-03-03 14:53:37 3063422976 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-82.1 started; log sequence number 21871674
2018-03-03 14:53:37 2483024704 [Note] InnoDB: Dumping buffer pool(s) not yet started
2018-03-03 14:53:37 3063422976 [Note] Plugin 'FEEDBACK' is disabled.
2018-03-03 14:53:37 3063422976 [Note] Server socket created on IP: '127.0.0.1'.
2018-03-03 14:53:37 3063422976 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.1.26-MariaDB-0+deb9u1'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Debian 9.1

and pm.max_children is set to 8.

However the problem seems to occur again, only less consistently.

Following @budy’s advice I startet top and iotop before the crash and made a before and after screenshot.

top

before
top_before

after
top_after

iotop

before

after

Edit: I also ran a long stress test to check, if the power supply can’t keep up, but that does not seem to be the problem.
I am going to try using a different SD card next, to see if that helps

Edit2: corrected order for iotop screenshots

budy · March 4, 2018, 6:18pm

Wait I/O of 25 (top, 2nd screenshot) really is massive. Once the system has such a high count of wait I/O, anything will have issues running. This could indicate an issue with the SD card, which you could further test using iostat…

iostat -xm 5

Also, if the system crashes, there must be something in /var/log/messages at that time - can you checke that as well, please?

nachoparker · March 4, 2018, 6:18pm

did you get the outputs of top (before, after) the other way around?

there could be a bottleneck in mysqld? did you move the database out of the SD card? ( nc-database )

that is great. NCP automatically sets as many threads as cores you have. You can also try lowering this number to 7 to see if the system becomes more responsive during load. ( in nc-limits )

budy · March 4, 2018, 6:21pm

I guess, as he wrote… before, after. I also thought about MySQL a bit, but if MySQL is the culprit, than this should be manageable, using some optimizations. However, 300 files doesn’t seem to be that much of a burden, does it?

nachoparker · March 4, 2018, 6:23pm

indeed… doesn’t make much sense, because even the raspberry pi can handle that

Basanites · March 4, 2018, 7:34pm

I did not find anything in /var/log/messages except for the startup messages.

iostat output

The %iowait for cpu time jumps up seemingly without a reason to my eye.

Basanites · March 4, 2018, 7:35pm

Having tested another SD card now the problem seems to be resolved by that.

Guess my “good” Samsung card wasn’t that great after all.
The database was on the SD card in both cases.

Basanites · March 4, 2018, 7:42pm

Oh, yes you are correct, the ones for iotop were flipped, it’s fixed now.

thanks, I will try that tomorrow.