I/O Errors on SAN during heavy Nextcloud/Redis commands?

Hi all, brief background and thank you:

Running Nextcloud for 6 users hosting ~400GB of data within ~150,000 files. Server is virtualized within Xen, has 8GB RAM and connects via SAN. Plenty of space disk space. Server accessible within LAN and over the internet, and the problem occurs in either environment. Externally, server is accessible via HAProxy, though I don’t think this is related.


Nextcloud version: 11.0.1 (stable)
Operating system and version: Ubuntu 16.04
Apache version: Apache 2.4.18
PHP version (eg, 5.6): 7.0.13
Is this the first time you’ve seen this error and can you replicate it?: Ongoing

The issue you are facing:

During heavy operations such as (when using the Web GUI) deleting a folder with 1GB+ of data in it, Web GUI will freeze for 1+ minute and the VM Host’s console will display I/O errors such as:

Buffer I/O error on device xvda1, logical block (numbers)
blk_update_request: I/O error, dev xvda, sector (numbers)

These errors occur solely when initiating commands from within Nextcloud. If I initiate the (effectively) same command within the VM Host’s CLI there are no I/O errors. Even a more burdensome command from the CLI will execute successfully.

The SAN is currently operating on a 1G link, soon to be upgraded to 10G. There are 12 other VMs (none other of them Nextcloud) and none of them are having I/O errors. The SAN reports as healthy.

To troubleshoot I originally moved the Nextcloud data from ‘external storage’ mounted as local to the same disk - same issue. I then merged the data with the primary partition - same issue. I have also migrated the entire disk to a separate pool of disks on the SAN - same issue.

I’m not sure exactly what’s causing this. My guess is that Nextcloud is sending too many commands for my 1G SAN connection to handle, but it’s odd that I can perform as many effectively equal and more I/O intensive commands from the CLI without issue.

Edit A new development is that whenever clients are connected via the Nextcloud sync agent, every 20 seconds (or so) I get a pair of I/O disk errors (as shown above). When the clients are powered off, the I/O errors disappear. Perhaps an indexing/comparison command sending many requests?

Edit #2 Correction to the above, an I/O error can still occur when clients are not syncing, though it is far less frequent. I appears to occur when Redis becomes active - perhaps too “busy” for my SAN?

Edit #3 Redis log is the only log displaying notable errors - see http://pastebin.com/nmjaJy5t

The output of your Nextcloud log in Admin > Logging:

I don’t think it’s related, as it probably couldn’t write the error to disk.

Fatal
webdav	
2017-02-22T01:43:48-0500

Sabre\DAV\Exception\ServiceUnavailable: HTTP/1.1 503 Doctrine\DBAL\Exception\DriverException: An exception occurred while executing 'UPDATE `oc_mounts` SET `storage_id` = ?, `mount_point` = ?, `mount_id` = ? WHERE (`user_id` = ?) AND (`root_id` = ?)' with params ["4", "\/username\/files\/anotherfolder\/", null, "username", 123570]: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
/var/www/nextcloud/3rdparty/sabre/dav/lib/DAV/Auth/Plugin.php - line 199: OCA\DAV\Connector\Sabre\Auth->check(Object(Sabre\HTTP\Request), Object(Sabre\HTTP\Response))
/var/www/nextcloud/3rdparty/sabre/dav/lib/DAV/Auth/Plugin.php - line 150: Sabre\DAV\Auth\Plugin->check(Object(Sabre\HTTP\Request), Object(Sabre\HTTP\Response))
[internal function] Sabre\DAV\Auth\Plugin->beforeMethod(Object(Sabre\HTTP\Request), Object(Sabre\HTTP\Response))
/var/www/nextcloud/3rdparty/sabre/event/lib/EventEmitterTrait.php - line 105: call_user_func_array(Array, Array)
/var/www/nextcloud/3rdparty/sabre/dav/lib/DAV/Server.php - line 466: Sabre\Event\EventEmitter->emit('beforeMethod', Array)
/var/www/nextcloud/3rdparty/sabre/dav/lib/DAV/Server.php - line 254: Sabre\DAV\Server->invokeMethod(Object(Sabre\HTTP\Request), Object(Sabre\HTTP\Response))
/var/www/nextcloud/apps/dav/appinfo/v1/webdav.php - line 60: Sabre\DAV\Server->exec()
/var/www/nextcloud/remote.php - line 165: require_once('/var/www/nextcl...')
{main}

The output of your config.php file in /path/to/nextcloud):

<?php
$CONFIG = array (
  'instanceid' => 'REDACTED',
  'passwordsalt' => 'REDACTED',
  'secret' => 'REDACTED',
  'trusted_domains' =>
  array (
    0 => 'REDACTED_IPADDRESS',
    1 => 'REDACTED_FQDN',
  ),
  'datadirectory' => '/var/www/nextcloud/data',
  'overwrite.cli.url' => 'REDACTED_FQDN',
  'dbtype' => 'mysql',
  'version' => '11.0.1.2',
  'dbname' => 'nextcloud',
  'dbhost' => 'localhost',
  'dbport' => '',
  'dbtableprefix' => 'oc_',
  'dbuser' => 'REDACTED',
  'dbpassword' => 'REDACTED',
  'logtimezone' => 'UTC',
  'installed' => true,
  'memcache.local' => '\\OC\\Memcache\\Redis',
  'filelocking.enabled' => 'true',
  'memcache.distributed' => '\\OC\\Memcache\\Redis',
  'memcache.locking' => '\\OC\\Memcache\\Redis',
  'redis' =>
  array (
    'host' => 'localhost',
    'port' => 6379,
    'timeout' => 0,
    'dbindex' => 0,
  ),
  'mail_smtpmode' => 'REDACTED',
  'mail_from_address' => 'REDACTED',
  'mail_domain' => 'REDACTED',
  'mail_smtpauthtype' => 'REDACTED',
  'mail_smtpauth' => REDACTED,
  'mail_smtphost' => 'REDACTED',
  'mail_smtpport' => 'REDACTED',
  'mail_smtpname' => 'REDACTED',
  'mail_smtppassword' => 'REDACTED',
  'mail_smtpsecure' => 'REDACTED',
);

The output of your Apache/nginx/system log in /var/log/____:

REDACTED_CLIENT_IP - REDACTED_USER1 [22/Feb/2017:07:14:58 -0500] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 960 "-" "Mozilla/5.0 (Macintosh) mirall/2.2.4 (build 1) (Nextcloud)"
REDACTED_CLIENT_IP - REDACTED_USER2 [22/Feb/2017:07:15:00 -0500] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 960 "-" "Mozilla/5.0 (Windows) mirall/2.2.4 (build 2) (Nextcloud)"
REDACTED_CLIENT_IP - REDACTED_USER1 [22/Feb/2017:07:15:14 -0500] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 1016 "-" "Mozilla/5.0 (Windows) mirall/2.2.4 (build 2) (Nextcloud)"
REDACTED_CLIENT_IP - REDACTED_USER1 [22/Feb/2017:07:15:26 -0500] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 1065 "-" "Mozilla/5.0 (Windows) mirall/2.2.4 (build 2) (Nextcloud)"
REDACTED_CLIENT_IP - REDACTED_USER1 [22/Feb/2017:07:15:28 -0500] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 960 "-" "Mozilla/5.0 (Macintosh) mirall/2.2.4 (build 1) (Nextcloud)"