Why use S3 over file system?

Paradox55 · October 26, 2019, 3:52am

tl;dr I have a local storage cluster that can support S3. Why should I use this over a file system? Especially as primary storage.

Wouldn’t the MySQL database act as the metadata server? Wouldn’t this be slower then regular metadata, especially on a popular nextcloud install?

Reiner_Nippes · October 26, 2019, 10:00am

imho. there should be no reason to use s3 in your case. it’s just another protocol to access the same physical layer of storage. except your storage cluster has some special feature only available to s3 objects like remote site replication, encryption or versioning.

in the cloud because of the price. on aws s3 is 3-4 times cheaper than ebs.
i don’t know about the smaller providers but s3 on aws can be configured to replicate your data automatically to another region. and “normal” s3 is always replicated to three AZs. aws s3 is designed to give you 99.999999999% durability of objects.

encryption and versioning is already build into nextcloud. and I don’t know if nextcloud versioning is using s3 versioning or you have everything versioned twice and pay for it double.

since s3 is a network storage you can put your server in the us and your data the eu. (if that’s senseful depends on your use case.)

38tb medical data in the cloud imho s3 is the only way to go. My experience deploying a HIPAA Compliant Nextcloud

yes. and what i don’t know wether the database is the single point of failure here. if you loose the database in a normal setting you might be able to restore it from backup and rescan the file folders. in a worst case scenario you still have the files in folder named after the users. (app data, links and so forth are lost)

but i don’t know how to backup and restore an s3 based nextcloud. ok. s3 needs no backup. pgdump is one of the tools to backup your database. but i have no clue how that fits together. e.g. your database backup is 3h old. how to recover the changes to files in the mean time? are there any addintional metadata in the s3 bucket.

since up to now i didn’t have the need to setup a s3 nextcloud i didn’t dive into that matter. if you want to know you may contact the author the hipaa article or just test it in a save environment. i would be interested if my playbook runs with your s3 backend. feedback would be welcome. support would be limited.

github.com

ReinerNippes/nextcloud_on_docker/blob/7a3e9329ed5b79015c9118fc79572b4e1efdd56a/inventory#L50


      
          nextcloud_mail_smtpauthtype = LOGIN
          nextcloud_mail_domain       =
          nextcloud_mail_smtpname     =
          nextcloud_mail_smtpsecure   = tls
          nextcloud_mail_smtpauth     = 1
          nextcloud_mail_smtphost     =
          nextcloud_mail_smtpport     = 587
          nextcloud_mail_smtpname     =
          nextcloud_mail_smtppwd      = 
          
          # Use S3 Bucket as primary storage
          aws_s3_key            = ''
          aws_s3_secret         = ''
          # aws_s3_bucket_name    = ''
          # aws_s3_hostname       = 's3.amazonaws.com'
          # aws_s3_port           = '443'
          # aws_s3_use_ssl        = 'true'
          # aws_s3_region         = 'us-east-1'
          # aws_s3_use_path_style = 'true'
          
          # Install restic backup tool if backup_folder is not empty

Paradox55 · October 27, 2019, 2:43am

Looks like I can mount the pool and backup that way but it’s not an ideal situation. Otherwise I’d be looking at copying the nextcloud database and somehow mounting it with fuse or something.

Alternatively I could use s3ql as a drop in replacement, which uses S3 as the backend.

imho. there should be no reason to use s3 in your case.

Well, nextcloud already technically acts as the metadata server even on a regular filesystem. All of the files and file paths are stored in the database iirc and it has to go through that database constantly.

Eliminating the MDS & CephFS (which uses fuse) from the equation should provide a significant performance improvement.

At least s3ql supports deduplication. But it’s still a fuse based file system which I’d prefer to avoid.

i don’t know about the smaller providers but s3 on aws can be configured to replicate your data automatically to another region. and “normal” s3 is always replicated to three AZs. aws s3 is designed to give you 99.999999999% durability of objects.

s3 needs no backup.

I don’t mean to make assumptions but based on your post it sounds like you think you don’t need to backup your data on AWS. This is not the case. Replication is not a backup solution. Even if it’s spread out across three datacenters that doesn’t guarantee something won’t happen on amazons end that causes data loss. They can physically lose a datacenter and still be operational but that doesn’t stop malicious attacks or technical errors.

This doesn’t include payment issues or amazon suspending/terminating you for any reason. No matter how unlikely it is.

Reiner_Nippes · October 27, 2019, 10:39am

maybe rclone.org is usefull to backup/sync S3 objects. and in combination with restic it should be easy to get a scripted backup of the database and the S3 objects.

that line should be replaced with an rclone command.

but still I would try this in a save environment before release it to production. because I have no idea how to recover a single file. and if recover a hole site works at all.

I even have a playbook for this. but not yet released. because this would be a tricky setup. (that is to say, if something breaks I would have no idea how to recover.)

imho there are too many layer between you and the files on the physical storage device. and you have to rely on a lot of people fixing bugs.

same with “normal” filesystems (hdd, esb, nfs, you-name-it). or?

yeah. sorry. I had only the Armageddon scenario in mind that one aws region goes down. user deleting files or getting attacked from malware or aws taking down its S3 service is of course not covered. (and except armageddon I have seen this all. you are right.)

so “Why should I use this over a file system?”: I wouldn’t use S3 as primary storage. to many layers and unclear how to recover. but as a backup target (restic/rclone) it should be ideal. (replicate/backup everything to aws, digitalocean and scaleway and you should be more than save.)

p.s.: i’m not speaking on behalves of nextcloud prof service. If you bought a subscription and nextcloud is happy to support your setup. ignore my comments.

Paradox55 · October 27, 2019, 7:11pm

Okay, as it turns out backing up S3 is not as hard as I thought.

I still need some way to parse the database to a file and sort through it so last modified time & creation time can be backed up based on x day though, since backing up the entire S3 instance daily would be really stupid. Not sure if restic does this - will be checking it out later.

Also @Reiner_Nippes my point regarding AWS and backup wasn’t so much to do with how unlikely it would be. I just wanted to point out that I’ve had services like dropbox randomly delete my data over the years.

I guess it really depends on how important your data is. If it’s medical data or other sensitive data you’d want more then one storage location and probably cold archival. If it’s linux ISO’s who cares.

Reiner_Nippes · October 27, 2019, 7:40pm

restic is doing incremental backup. but i guess you can configure S3 only as a backup target, not source. i would check if rclone can sync two S3 buckets.

so that you can run rclone remote1:bucket1 remote2:bucket2

in both cases it’s not necessary to query the database for modified files. imho that is done by restic/rclone during the sync.

Paradox55 · October 27, 2019, 8:08pm

You’re right, it looks like it does store the date and timestamps in the S3 storage. I did not see this until running a specific rclone command.

Thank you!

IIPoliII · November 2, 2020, 8:35pm

Hey, Can you explain me how you did it to retrive a file from a S3 stored object?

Like how can I download a file, manually?