Well if the FS is taking care of deduplication NC doesn’t need to be aware of the FS layout.
UserA and UserB upload the same file to their spaces, the FS detects the duplicate and creates a kind of soft link. A bit simplified: only one file actually consumes disk space.
Both users (and NC) access their files unknowingly there has been some disk space saved. If one file gets modified by one user, (depending on the deduplication mode) the FS creates a copy of the one existing file and out of the sudden two files exist. Or: only the modified block of that file is copied to the disk and there is still disk space saved.
Deduplication
We have another way to save disk in conjunction with compression, and that is deduplication. Now, there are three main types of deduplication: file, block, and byte. File deduplication is the most performant and least costly on system resources. Each file is hashed with a cryptographic hashing algorithm, such as SHA-256. If the hash matches for multiple files, rather than storing the new file on disk, we reference the original file in the metadata. This can have significant savings, but has a serious drawback. If a single byte changes in the file, the hashes will no longer match. This means we can no longer reference the whole file in the filesystem metadata. As such, we must make a copy of all the blocks to disk. For large files, this has massive performance impacts.
On the extreme other side of the spectrum, we have byte deduplication. This deduplication method is the most expensive, because you must keep “anchor points” to determine where regions of deduplicated and unique bytes start and end. After all, bytes are bytes, and without knowing which files need them, it’s nothing more than a sea of data. This sort of deduplication works well for storage where a file may be stored multiple times, even if it’s not aligned under the same blocks, such as mail attachments.
In the middle, we have block deduplication. ZFS uses block deduplication only. Block deduplication shares all the same blocks in a file, minus the blocks that are different. This allows us to store only the unique blocks on disk, and reference the shared blocks in RAM. It’s more efficient than byte deduplication, and more flexible than file deduplication. However, it has a drawback- it requires a great deal of memory to keep track of which blocks are shared, and which are not. However, because filesystems read and write data in block segments, it makes the most sense to use block deduplication for a modern filesystem.
The shared blocks are stored in what’s called a “deduplication table”. The more duplicated blocks on the filesystem, the larger this table will grow. Every time data is written or read, the deduplication table is referenced. This means you want to keep the ENTIRE deduplication table in fast RAM. If you do not have enough RAM, then the table will spill over to disk. This can have massive performance impacts on your storage, both for reading and writing data.
Sure, for backups to another disk (and FS) you require the “real” storage amount to save the files multiple times.
But what I’m actually saying is: Implementing deduplication in Nextcloud will take quite some time and will not come in the next two major releases probably. And if deduplication is urgently required, the FS feature is a quick solution and maybe a good compromise.