A couple of months back I looked into options for server-side data deduplication as I have Nextcloud at a school which is short on disc space and on budget to expand it.
I looked at btrfs (slow), ZFS (dedupe too memory intensive to be worth it) and fslint.
FSLint operates at the file level and is pretty simple: it finds duplicate files and hard-links them together.
I’ve been trying this on ownCloud (and asked about this in the ownCloud community here) to see if it works and so far testing hasn’t revealed any problems - initially for my own content, and stage two testing with three other users (with a lot of the same files between them). Since starting the trial, I’ve moved this site from ownCloud to Nextcloud so thought I should repost here.
It does depend on Nextcloud’s copy-on-write method of storing files server-side (which has the effect on breaking any hard link if a file is modified, leaving the others alone) but from what I can tell, there are no plans to change this, right? It seems like an easy win…
Has anyone else tried/tested this? Are there any risks I’ve failed to consider?
As stated on the ownCloud forums, directly modifying data in the datadir is not supported. The risk is that Nextcloud assumes nothing else is modifying that data. Eventually there is going to be an incompatibility and potential breakage. I would not take that risk, especially not with somebody else’s data.
Another possible option is to use external storage to a local folder/drive with fslint. This might make it safer. See https://docs.nextcloud.com/server/11/admin_manual/configuration_files/external_storage_configuration_gui.html
If you really need data deduplication, then ZFS is the way to go. Though honestly, I’d go with lz4 compression over data deduplication. Deduplication has more issues than just the memory usage.
I’d avoid btrfs entirely.
Ok so it’s not safe to assume Nextcloud will do copy-on-write for file modifications in the future? That’s the only thing this really depends on (as copy-on-write splits the hardlinks if one copy of a previously-deduplicated-by-hardlink file is modified - exactly what I want).
You mention extenral storage - Is it possible to put all user storage on a local external folder by default? I can’t see how to do that.
Compression doesn’t help much as the content duplicated between user folders tends to be reference material (Word docs with lots of images), as well as lots of photos and videos which are not really further compressible.
ZFS’s inband dedupe needs huge amounts of RAM to hold the dedupe table which hits exactly the same affordability problem which I’m trying to solve here.
Has anyone come up with any other solutions to having a lot of duplicated content between users?
There ought to be a simple solution to this by now. Several filesystems support the notion of a lightweight copy.
lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified.
control clone/CoW copies. See below
When --reflink[=always] is specified, perform a lightweight copy, where the
data blocks are copied only when modified. If this is not possible the
copy fails, or if --reflink=auto is specified, fall back to a standard copy.
So perhaps we have a cron job that periodically dedups using a --reflink type copy and keep a track of any issues arising form there.