S3 external storage excessive network traffic

I am running S3 as external storage, and recently I’ve noticed a huge spike in API requests.

I have disabled previews.

I have check changes set to never.

I have followed this guide to secure my S3 bucket

https://aws.amazon.com/blogs/opensource/scale-your-nextcloud-with-storage-on-amazon-simple-storage-service-amazon-s3/

I have approximately 120 GB in the bucket.

I have versioning enabled.

I have set all sync clients to ignore hidden files.

I am running NC28

Once per day, I see a spike in traffic that results in over 50 to 100 GB being downloaded over the course of several hours.

I don’t have any cron jobs running that align with this.

I have enabled logs to aws cloud trail. All requests are coming from the server IP. Sometimes there are 404 errors because it tries to run a GET request and for some reason adds a / at the end of the path, so it might try to get .jpg/ and fail. But it will also check for the .jpg

Otherwise I don’t see anything out of the ordinary, but I am not an expert. Basically all the requests are coming from my server, through my IAM role, with user agent aws-sdk-php

This is a lot of data being downloaded, and I don’t know where it would be going. I checked during a download event and all of the sync clients are quiet.

I am not sure why the web UI needs to stream such a large quantity of data, if that’s where it’s going.

The only shares in the system tab are from within Talk.

Has anyone else seen this behaviour ? What can I do to figure this out ? Previously costs were very reasonable, but at this rate my monthly rate is going up significantly and I am concerned because it seems quite strange.

Looking at the application logs should give you better insights what happens at this time. Maybe you have to increase log level… Logging — Nextcloud latest Administration Manual latest documentation

I think this might be related to a background job which explains why it takes place at about the same time every day.

When I deactivated the access key to my S3 bucket during an event, I noticed this error immediately appeared in my log.

[core] Error: Error while running background job (class: OC\Core\BackgroundJobs\GenerateMetadataJob, arguments: )
from ? by – at Dec 26, 2023 at 11:44:54 AM

Right before there was

[PHP] Error: fopen(httpseek://): Failed to open stream: “OC\Files\Stream\SeekableHttpStream::stream_open” call failed at /mydomain/public_html/lib/private/Files/Stream/SeekableHttpStream.php#67
from ? by – at Dec 26, 2023 at 11:44:54 AM

And

[PHP] Error: fopen(https://s3imagepath.jpg): Failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden
at /mydomain/public_html/lib/private/Files/ObjectStore/S3ObjectTrait.php#90
from ? by – at Dec 26, 2023 at 11:44:54 AM

So it seems like this background job is perhaps checking the metadata on all of the objects in external storage, which is causing a huge volume of requests.

I am on 28.0 awaiting my provider to make the latest version available.

Does this possibly explain ?

This link mentions possible heavy data usage relating to this job, but why is it trying to rescan the whole storage every day ? https://github.com/nextcloud/photos/issues/2185#issue-2033895634

I stumbled upon the same issue. I’m running 28.0.1 with juicefs on Backblaze B2 for storage and was wondering why every day in the evening a lot of egress traffic appears. Some digging in the logs showed:

{"reqId":"LF9xJW9in00pBvwc4Tw9","level":0,"time":"2023-12-26T18:12:41+00:00","remoteAddr":"","user":"--","app":"cron","method":"","url":"--","message":"Run OC\\Core\\BackgroundJobs\\GenerateMetadataJob job with ID 119439","userAgent":"--","version":"28.0.1.1","data":{"app":"cron"}}

The corresponding reqId in audit.log shows access for a lot of files for hours. Shouldn’t the metadata update job run only for an hour if I understand

correctly?
And does

scan now every file every day?
Why isn’t the metadata updated when running occ files:scan -all?

But all in all it’s bad when paying for egress traffic and NC seems to download every file every day. B2 has 3x storagr as egress for free, but this doesn’t also fit then.

Anything is possible. This is very new code and I’m not at all familiar with it.

You might find something useful in the Dev Manual section for it:

https://docs.nextcloud.com/server/latest/developer_manual/digging_deeper/files-metadata.html

I also suggest checking for/filing a GH issue in the server repo. If nothing else, maybe docs needs to be extended. But it does sound like something more is going on here that may be unintended.

OK, I’ve opened an issue here https://github.com/nextcloud/server/issues/42489#issue-2056859055

If anyone would like to contribute to the ticket please.

I wonder if this can be temporarily mitigated by setting ‘enable_file_metadata’ to false in config.

I can also confirm that the job lasted for more than an hour. Even within an hour, with a large external storage, the amount of data can be quite high and thus the charge. I am not sure why it’s necessary to download the entire file for metadata purposes, but I am a hobbyist not a developer.

I also tried a few occ metadata:get commands, and saw that the few files I checked were indexed (but only the creation date) and not all of the metadata detected. Except the job has ran for the past 10 days in a row, so by now it should have indexed everything required.

Any suggestion to mitigate while the issue is being investigated ? For example, if I know what time the job runs every day, could I set up a cron to run at that time with a command like pgrep -f cron.php | xargs kill ?

Is there a risk of corruption by doing this ?

Unless there is a modification I can to do a php file directly ?

I assume it will probably take some time for any kind of fix to get pushed to production.

Did you try setting enable_file_metadata' => false ?

I did not, because chat GPT returned that it would not prevent the job from running after I provided some of the relevant code.

You can prolong the time between cronjob runs by editing … /core/BackgroundJobs/GenerateMetadataJob.php. You can change the time given in line 51

$this->setInterval(24 * 3600);

to e.g.

$this->setInterval(7 * 24 * 3600);

for only getting into this mess every 7th day. Obviously, I have no clue what effect that might have on new files’ metadata.

OK, good idea, I have done this. Thank you

Seems like ChatGPT is right - I’ve set enable_file_metadata' => false and is has no effect on the background job. I’ve moved my files to another storage now where I don’t pay for traffic.

Changed it to
$this->setInterval(30 * 24 * 3600);

now 100GB less traffic per day.