Many users have uploaded duplicate files. Is there any way to save space?

gao · October 6, 2019, 9:32am

Support intro

Sorry to hear you’re facing problems

help.nextcloud.com is for home/non-enterprise users. If you’re running a business, paid support can be accessed via portal.nextcloud.com where we can ensure your business keeps running smoothly.

In order to help you as quickly as possible, before clicking Create Topic please provide as much of the below as you can. Feel free to use a pastebin service for logs, otherwise either indent short log examples with four spaces:

example

Or for longer, use three backticks above and below the code snippet:

longer
example
here

Some or all of the below information will be requested if it isn’t supplied; for fastest response please provide as much as you can

Nextcloud version (eg, 12.0.2):
Operating system and version (eg, Ubuntu 17.04):
Apache or nginx version (eg, Apache 2.4.25):
PHP version (eg, 7.1):

The issue you are facing:

Is this the first time you’ve seen this error? (Y/N):

Steps to replicate it:

The output of your Nextcloud log in Admin > Logging:

PASTE HERE

The output of your config.php file in /path/to/nextcloud (make sure you remove any identifiable information!):

PASTE HERE

The output of your Apache/nginx/system log in /var/log/____:

PASTE HERE

j-ed · October 6, 2019, 9:52am

Unfortunately no mechanism exist to save storage space, if the files are not shared by all users and are individual files.

kevdog · October 8, 2019, 5:12am

Oh but wouldn’t this be a great feature — to somehow keep track of hashes on individual files and then keep an access list for the file. Would need to duplicate file if user/hash changed. I’d say that feature would actually be awesome.

Schmu · October 8, 2019, 5:27am

Hi,

I think that would complicate things way to much.
But you can leave that task to the file system if you like. Not sure if there are other filesystems supporting this, but at least ZFS does. It’s called deduplication. Possible that this feature/ flag can be activated / set at runtime.

anon9582441 · October 8, 2019, 7:03am

Btrfs, ZFS and NTFS supports deduplication. The requirement is that the files are on the same volume. However Nextcloud is most probably not aware of the inderlaying FS’s layout and properties. Files in Nextcloud could be stored on many different locations.

However, storing hashes would be a good thing. It would allow background jobs or scripts to do this on supported filesystems without having to rescan all data.

Mind that files can be changed outside of nextcloud. One example is for example changing EXIF tags with external tools.

Currently I use fdupes and duperemove, which has special support for the Btrfs filesystem. It can store hashes in an sqlite DB so it doesn’t have to scan every single file again.

Schmu · October 8, 2019, 7:51am

Well if the FS is taking care of deduplication NC doesn’t need to be aware of the FS layout.
UserA and UserB upload the same file to their spaces, the FS detects the duplicate and creates a kind of soft link. A bit simplified: only one file actually consumes disk space.
Both users (and NC) access their files unknowingly there has been some disk space saved. If one file gets modified by one user, (depending on the deduplication mode) the FS creates a copy of the one existing file and out of the sudden two files exist. Or: only the modified block of that file is copied to the disk and there is still disk space saved.

Deduplication

We have another way to save disk in conjunction with compression, and that is deduplication. Now, there are three main types of deduplication: file, block, and byte. File deduplication is the most performant and least costly on system resources. Each file is hashed with a cryptographic hashing algorithm, such as SHA-256. If the hash matches for multiple files, rather than storing the new file on disk, we reference the original file in the metadata. This can have significant savings, but has a serious drawback. If a single byte changes in the file, the hashes will no longer match. This means we can no longer reference the whole file in the filesystem metadata. As such, we must make a copy of all the blocks to disk. For large files, this has massive performance impacts.

On the extreme other side of the spectrum, we have byte deduplication. This deduplication method is the most expensive, because you must keep “anchor points” to determine where regions of deduplicated and unique bytes start and end. After all, bytes are bytes, and without knowing which files need them, it’s nothing more than a sea of data. This sort of deduplication works well for storage where a file may be stored multiple times, even if it’s not aligned under the same blocks, such as mail attachments.

In the middle, we have block deduplication. ZFS uses block deduplication only. Block deduplication shares all the same blocks in a file, minus the blocks that are different. This allows us to store only the unique blocks on disk, and reference the shared blocks in RAM. It’s more efficient than byte deduplication, and more flexible than file deduplication. However, it has a drawback- it requires a great deal of memory to keep track of which blocks are shared, and which are not. However, because filesystems read and write data in block segments, it makes the most sense to use block deduplication for a modern filesystem.

The shared blocks are stored in what’s called a “deduplication table”. The more duplicated blocks on the filesystem, the larger this table will grow. Every time data is written or read, the deduplication table is referenced. This means you want to keep the ENTIRE deduplication table in fast RAM. If you do not have enough RAM, then the table will spill over to disk. This can have massive performance impacts on your storage, both for reading and writing data.

Sure, for backups to another disk (and FS) you require the “real” storage amount to save the files multiple times.

But what I’m actually saying is: Implementing deduplication in Nextcloud will take quite some time and will not come in the next two major releases probably. And if deduplication is urgently required, the FS feature is a quick solution and maybe a good compromise.

kevdog · October 8, 2019, 7:05pm

Hey thanks for the information regarding deduplication – Very interesting. I’m running Nextcloud on FreeBSD jail within FreeNAS so I wasn’t aware I was utilizing the advantage of the underlying ZFS filesystem. Neat!!!

Schmu · October 8, 2019, 8:00pm

I’m glad if it helps

But check if this feature is enabled.

With the following command you can read if deduplication is enabled for your datasets / zfs pools:
zfs get dedup

With
zfs set dedup=on <zpool-name>
you can enable that.

Check the other configuration as well and keep an eye on dedupration:
zpool get all

By the way, ZFS has a pretty good compression feature, which doesn’t use much CPU with lz4. So if disk space matters a lot and you have enough RAM to profit from the ZFS features, that might help you.

kevdog · October 8, 2019, 8:53pm

Here are my zpool setting on my mail pool:

tank          size                           43.5T                          -
tank          capacity                       15%                            -
tank          altroot                        /mnt                           local
tank          health                         ONLINE                         -
tank          guid                           13698474688684120618           default
tank          version                        -                              default
tank          bootfs                         -                              default
tank          delegation                     on                             default
tank          autoreplace                    off                            default
tank          cachefile                      /data/zfs/zpool.cache          local
tank          failmode                       continue                       local
tank          listsnapshots                  off                            default
tank          autoexpand                     on                             local
tank          dedupditto                     0                              default
tank          dedupratio                     1.00x                          -
tank          free                           36.9T                          -
tank          allocated                      6.61T                          -
tank          readonly                       off                            -
tank          comment                        -                              default
tank          expandsize                     -                              -
tank          freeing                        0                              default
tank          fragmentation                  6%                             -
tank          leaked                         0                              default
tank          bootsize                       -                              default
tank          checkpoint                     -                              -
tank          feature@async_destroy          enabled                        local
tank          feature@empty_bpobj            active                         local
tank          feature@lz4_compress           active                         local
tank          feature@multi_vdev_crash_dump  enabled                        local
tank          feature@spacemap_histogram     active                         local
tank          feature@enabled_txg            active                         local
tank          feature@hole_birth             active                         local
tank          feature@extensible_dataset     enabled                        local
tank          feature@embedded_data          active                         local
tank          feature@bookmarks              enabled                        local
tank          feature@filesystem_limits      enabled                        local
tank          feature@large_blocks           enabled                        local
tank          feature@sha512                 disabled                       local
tank          feature@skein                  disabled                       local
tank          feature@device_removal         disabled                       local
tank          feature@obsolete_counts        disabled                       local
tank          feature@zpool_checkpoint       disabled                       local

And here is the depup:
#zpool get dedup
NAME PROPERTY VALUE SOURCE
freenas-boot dedupratio 1.00x -
tank dedupratio 1.00x -

It doesn’t seem dedup is activated I’m guessing.

Schmu · October 8, 2019, 9:09pm

zpool get all

but (zfs get dedup)
zfs get dedup

Outputs something like:

NAME     PROPERTY  VALUE
bigdata  dedup     on

kevdog · October 8, 2019, 9:35pm

I’ve got a bunch of dataset under tank – which it looks like you can turn on dedup on individual datasets, however for main zpool:

tank dedup off local

gao · October 9, 2019, 3:06pm

Thank you very much for your answers. I am not prepared to continue until there is no solution for nextcloud.

I hope that future versions will have this feature.

kevdog · October 10, 2019, 7:07pm

@Schmu

So I’m running Nextcloud on FreeNAS which is based on FreeBSD using zfs. I enquired on FreeNAS forums about implications for turning on dedup. Basically they said for about every 1Tb of storage – dedup is going to require 5-6Gb of RAM once a lot of duplicates start appearing. Their zfs setup uses compression as a minimum, but they stated unless you have a lot of RAM, they suggested not turning on this feature. Once the feature is turned on – it can’t be undone.

anon9582441 · October 12, 2019, 6:55am

@gao A bit harsh, don’t you think? Do you use Linux? if so, look at the tools fdupes and duperemove. Also change to btrfs filesystem.

Here’s an example of some files on a server I have. They would use up 178GiB of diskspace normally on, but with some compression and the use of duperemove the total space used is only 91GiB.

# compsize .
Processed 36436 files, 53433 regular extents (265250 refs), 4588 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       99%       91G          91G         178G
none       100%       91G          91G         178G
zstd        75%      1.5M         2.0M         2.0M

gao · October 12, 2019, 11:52am

We always want better functionality.Thank you for your approach.