File uploads >64k fail on LXD container

brmiller · April 22, 2018, 10:08pm

Host: Ubuntu 16.04.4
LXD: 2.21

Guest:
Nextcloud version: 13.0.0
Operating system and version: Ubuntu 16.04.4
Apache or nginx version: Apache 2.4.18
PHP version: 7.0.28

I have setup NC 13 as an LXD container starting from @JasonBayton’s NC 13 image followed by upgrading along the way. I also have a pristine NC 11 container using the same image that has not been upgraded, nor altered in any way except obvious things like passwords.

Recently, I discovered that when uploading files > 64k, the server seems to hang and takes a long time before ultimately failing. I can watch the temporary file in /tmp grow up to ~64k (not always 65536 bytes, but never over) and it never gets larger than 64k. Ultimately the upload fails. The message in the log (viewed from within nextcloud) is:

Fatal	webdav	Sabre\DAV\Exception\BadRequest: expected filesize 127173 got 64601

    /var/www/html/nextcloud/apps/dav/lib/Connector/Sabre/Directory.php - line 151: OCA\DAV\Connector\Sabre\File->put(Resource id #10)
    /var/www/html/nextcloud/3rdparty/sabre/dav/lib/DAV/Server.php - line 1096: OCA\DAV\Connector\Sabre\Directory->createFile('ColeEdmonson.xc...', Resource id #10)
    /var/www/html/nextcloud/3rdparty/sabre/dav/lib/DAV/CorePlugin.php - line 525: Sabre\DAV\Server->createFile('ColeEdmonson.xc...', Resource id #10, NULL)
    [internal function] Sabre\DAV\CorePlugin->httpPut(Object(Sabre\HTTP\Request), Object(Sabre\HTTP\Response))
    /var/www/html/nextcloud/3rdparty/sabre/event/lib/EventEmitterTrait.php - line 105: call_user_func_array(Array, Array)
    /var/www/html/nextcloud/3rdparty/sabre/dav/lib/DAV/Server.php - line 479: Sabre\Event\EventEmitter->emit('method PUT', Array)
    /var/www/html/nextcloud/3rdparty/sabre/dav/lib/DAV/Server.php - line 254: Sabre\DAV\Server->invokeMethod(Object(Sabre\HTTP\Request), Object(Sabre\HTTP\Response))
    /var/www/html/nextcloud/apps/dav/appinfo/v1/webdav.php - line 80: Sabre\DAV\Server->exec()
    /var/www/html/nextcloud/remote.php - line 164: require_once('/var/www/html/n...')
    {main}

In /etc/php/7.0/apache/php.ini I have

upload_max_filesize = 2048M
post_max_size = 2058M

and also in /var/www/html/.htaccess:

php_value upload_max_filesize 2G
php_value post_max_size 2G

and in /var/www/html/nextcloud/.user.ini:

upload_max_filesize=2G
post_max_size=2G

as I could not quite understand in what environments which file did what. (In the image from @JasonBayton, the .htaccess and .user.ini values are 511M - I don’t think any of this matters, though because my issues are @ 64k.)

This occurs both via the web-client and the Linux desktop client. Thinking it was potentially a network error/issue between my desktop client (which is on a different network than the nextcloud container), I attempted a tcpdump capture per this post, but I confess, I did not make much sense of the capture - at least not enough to understand what causes it. I think I can see the obvious packets up to the 64k mark where it quits, but not who or which end is doing the “quitting”. Any help would be appreciated; especially if someone has seen this before.

Thanks,
Brendan

JasonBayton · April 22, 2018, 10:46pm

If you can’t replicate this on my demo servers which use the same LXD containers (upgraded and native) I can’t imagine it’ll be anything to do with the image specifically. As you know I run all my instances via LXD, even the one managing my 42tb array at home, and haven’t come across this.

If you mount the webDAV address on both the host and the LXD container as local mount points do you get the same errors?

brmiller · May 26, 2018, 3:42am

Finally getting back to testing this. The http://demo.nextcloud.bayton.org/ sites do not exhibit the problem. Which means it’s something in my environment (duh). I had too many troubles with davfs to successfully test this route. Not sure where to go next.

brmiller · September 29, 2018, 11:31pm

Finally got to the bottom of this. For whatever reason, the LXD network layer seems to be “eating” packets or not being protocol compliant. Previously, my Nextcloud LXD container was using an LXD-assigned IP on the lxdbr0 device. I added the IP to the nextcloud instance to /etc/hosts or my router’s static mapping (didn’t matter) and also configured my router (pfsense) to route requests to the lxdbr0 network to my LXD server. Then, from a laptop on the same network as the LXD host, I’d make requests to my server by hostname. Pfsense’s DNS would resolve the LXD-assigned IP, route through to my LXD host, and (presumably) the Linux networking layer on the LXD host would route the requests to the Nextcloud LXD instance. I say “presumably” because although the Nextcloud web interface worked, I’d get these “short” file uploads after 64k and the server would timeout.

On a whim last week I changed my Nextcloud LXD instance’s profile to use one that specified the br0 network device as the parent which is on the same network as my internal LAN. Voila! File uploads work fine now of all sizes.

So something in the LXD bridged networking stack on lxdbr0 seems to be the culprit. I’d still like to let LXD assign IPs on the private LXD network subnet as I already have a pretty cluttered main DHCP pool for my LAN (br0 on the LXD server). Any ideas?