Deduplication over the network for client sync (Delta sync)

We have a directory tree to sync that contains large files that are easy to deduplicate at a block level (over 99% duplication version to version).

Files are about 200MB and new versions are created all the time, with different file names but identical content except the last few bytes. With Nextcloud client, sync then takes a few minutes per file which does not work well since new versions can be created multiple times a minute.

Syncthing manages this very well and sending a new file that is a copy of the previous one or a version of a previous one.

For now we built a hacky setup where Nextcloud client is ignoring the large files, Syncthing is ignoring everything but the large files and both clients sync in the same folder on the client. Syncthing folder is mounted as external storage on the server.

The use case is to have quick sync for CAD files in the same directory structure as other files related to the same project :

  • projecta/
    files/ (docs and other stuff)
    models (large .rvt files)
  • projectb/
    files/ (docs and other stuff)
    models (large .rvt files)

It is not a big deal to have a separate folder structure on the server / webapp side but it is important to the current workflow to have all files related to a project in the same folder on the client.

I spent quite a bit of time searching around this issue and found the huge thread about this feature on github. It seems unlikely that it will be done soon.

I’d like feedback from the community on how to solve this, there is likely a solution I did not think of.

For now my options are :

  • Keep the hacky setup, I’m worried it will eventually fail and is way more complicated to setup. Architects won’t be able to setup by themselves
  • Fix NextCloud, probably a few weeks of my time that I could spend right now if somebody is willing to pay me, otherwise I have to work on other stuff that brings bread to the table
  • Move to something else, Owncloud being the most obvious option. I looked into seafile and it looks good as a file sync service but I would loose too many NextCloud features and apps.

Do you have any indication that Owncloud has this feature?

My profile says that I’ve been to this forum on 795 different days and spent a total time reading it of 9 days. I haven’t asked to be paid for my contributions yet…

It’s been merged in owncloud back in 2019

I wonder if the same code could be adapted since Nextcloud was forked from Owncloud. Maybe there’s no need to reinvent the wheel. I don’t know how similar the clients are.

So, from this discussion, it seems that the use-case was determined to be uncommon, and it was also mentioned that Owncloud’s delta sync code base ended up being huge and buggy. I suppose that’s why it was never implemented in Nextcloud.

First I didn’t mean to be rude or disrespectful to the community by stating that I would need to be sponsored to work on this. I just can’t afford doing it for free right now but would gladly in other circumstances.

I looked a bit into owncloud implementation, it looked doable to backport the changes. Was something like 2k lines of code in the backend, about 1300 lines of new files, 200 documentation and the rest looked a bit more tricky.

I don’t know how much the sync algorithm evolved differently since the fork though so it might make things more difficult. In any case, the sync part is the hard part of syncing, deduplication is simply splitting a file into chunks and syncing those chunks instead. It doesn’t mean it is trivial to implement but the concept is pretty simple.

Another potential solution I though of is to entirely replace the sync engine with something that already does deduplication like syncthing. It would mean :

  • working on the client and server to bundle syncthing inside (syncthing is a standalone binary that runs on many platforms, probably all the same supported by nextcloud)
  • Find a place somewhere in syncthing where we detect any incoming change on the server
  • Find a place somewhere in nextcloud where we tell it that a file has been updated

All those features exist and “only” have to be plumbed together. There would likely be major drawbacks to this solution like syncthing always scanning the entire content of a huge file on every update so maybe it would need to involve handling specific file types, which is kind of what I already did with the exclusion/inclusion. It works but definitely is more a hacky proof of concept than a clean solution.