How to prevent bots scraping public data?

Some days ago I installed Nextcloud on public OCI server with the use of docker.
This morning I received a warning from chrome browser that the site was deceptive and that I should be careful using my password.

Upon investigating the apache access log, I saw that there were lots of strange IP’s and user agents getting access to my Nextcloud instance:

198.x.x.x - - [15/Jan/2022:17:21:40 +0000] "GET /apps/files_videoplayer/js/files_videoplayer-main.js?v=d81f9bfa-15 HTTP/1.1" 200 20211 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
216.x.x.x - - [15/Jan/2022:17:21:40 +0000] "GET /apps/files_rightclick/js/script.js?v=d81f9bfa-15 HTTP/1.1" 200 3870 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
216.x.x.x - - [15/Jan/2022:17:21:41 +0000] "GET /apps/files_rightclick/js/files.js?v=d81f9bfa-15 HTTP/1.1" 200 1899 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
92.x.x.x - - [15/Jan/2022:17:21:41 +0000] "GET /apps/theming/js/theming.js?v=d81f9bfa-15 HTTP/1.1" 200 633 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
216.x.x.x - - [15/Jan/2022:17:22:49 +0000] "GET / HTTP/1.1" 301 565 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
168.x.x.x - - [15/Jan/2022:17:22:50 +0000] "GET / HTTP/1.1" 302 6881 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
168.x.x.x - - [15/Jan/2022:17:22:51 +0000] "GET /login HTTP/1.1" 200 7115 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
168.x.x.x - - [15/Jan/2022:17:22:52 +0000] "GET /apps/files_rightclick/css/app.css?v=62abc69f-15 HTTP/1.1" 200 812 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
181.x.x.x - - [15/Jan/2022:17:22:52 +0000] "GET /core/css/guest.css?v=d81f9bfa-15 HTTP/1.1" 200 11482 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
181.x.x.x - - [15/Jan/2022:17:22:52 +0000] "GET /core/js/oc.js?v=d81f9bfa HTTP/1.1" 200 2483 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
168.x.x.x - - [15/Jan/2022:17:22:52 +0000] "GET /apps/theming/styles?v=15 HTTP/1.1" 200 2007 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
168.x.x.x - - [15/Jan/2022:17:22:53 +0000] "GET /core/js/dist/files_client.js?v=d81f9bfa-15 HTTP/1.1" 200 49004 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
168.x.x.x - - [15/Jan/2022:17:22:53 +0000] "GET /core/js/dist/files_fileinfo.js?v=d81f9bfa-15 HTTP/1.1" 200 10154 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
181.x.x.x - - [15/Jan/2022:17:22:52 +0000] "GET /core/js/dist/main.js?v=d81f9bfa-15 HTTP/1.1" 200 492921 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
181.x.x.x - - [15/Jan/2022:17:22:53 +0000] "GET /apps/files_sharing/js/dist/main.js?v=d81f9bfa-15 HTTP/1.1" 200 6940 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
168.x.x.x - - [15/Jan/2022:17:22:53 +0000] "GET /js/core/merged-template-prepend.js?v=d81f9bfa-15 HTTP/1.1" 200 4040 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
181.x.x.x - - [15/Jan/2022:17:22:53 +0000] "GET /apps/theming/image/background?v=15 HTTP/1.1" 200 5132556 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
181.x.x.x - - [15/Jan/2022:17:22:53 +0000] "GET /apps/theming/image/logo?v=15 HTTP/1.1" 200 92578 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
168.x.x.x - - [15/Jan/2022:17:22:53 +0000] "GET /apps/theming/image/logo?useSvg=1&v=15 HTTP/1.1" 200 92578 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
150.x.x.x - - [15/Jan/2022:17:22:53 +0000] "GET /apps/accessibility/js/accessibilityoca.js?v=d81f9bfa-15 HTTP/1.1" 200 11475 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
168.x.x.x - - [15/Jan/2022:17:22:54 +0000] "GET /apps/files_videoplayer/js/files_videoplayer-main.js?v=d81f9bfa-15 HTTP/1.1" 200 25450 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"
191.x.x.x - - [15/Jan/2022:17:22:54 +0000] "GET /apps/files_rightclick/js/script.js?v=d81f9bfa-15 HTTP/1.1" 200 9109 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/95.0.4638.50 Mobile/15E148 Safari/604.1"

Giving that the IP’s and user agents are changing all the time, I suspect that a bot is scraping all the data from my cloud. When accessing the same URL’s as the bot (e.g. https://mycloud.com/apps/theming/image/logo?v=15) I notice that this information is accessible without logging in.

So I’m wondering whether I could use an apache rewriterule or redirect so that when trying to access such URL’s it redirects to the login page. Or would this break the functionality of my Nextcloud instance.

Besides, I did check my account and of my wife and did not notice any strange login attempts or devices (settings → security → devices & sessions) so this means that nobody has access to our data? Also, why did chrome gave this warning, it does worry my that after 2 days my site is already under treat (fyi, I applied the HTTPS hardening directives + get an A+ security rating from Nextcloud scan)

I don’t know how this works exactly and how your site got on that list. Maybe the previous owner of your server’s IP address did something mallicious with it…? But I suppose like with all Google services, there is some algorithm and automatism behind it, which works everything but perfect :wink: I found this about the topic… https://support.google.com/chrome/answer/99020?hl=en&co=GENIE.Platform%3DDesktop

Welcome to the internet! :wink: Not much you can do about it, if you host a public server. There are ways to protecting it from DDoS attacks by puting it behind something like a Cloudflare Proxy or lock it down by using a VPN or some kind of overlay network to access it. But I’m no expert when it comes to these topics.

I tested a few of the requests in your log on my instance and they do not reveal any sensitive information, as far as I can tell.

Some amount of “backgrond noise” is perfectly normal, when you host things on a public server. If you use a well known cloud provider like OCI, there is probably a bit more of it, because the operators of the bots know which IP ranges they have to scan first, in order to find potentally insecure web applications. But if you have secured everything properly, you should be fine.

@bb77

Thanks for your insights.

Indeed the google algoritm was probably wrong but I was a bit worried giving the foreign IP’s in my access log. I’m using an DNS from freenom (.ga), maybe this also doesn’t help to gain trust :grinning_face_with_smiling_eyes:

This could help indeed but this NC instance will only be used by a handful of people so no need to put it behind (expensive) proxy I guess :slight_smile:

Indeed, I also tested these URL’s on my other NC instance at home and getting the similar output - without sensitive or personal information.

I dind’t think of that but indeed the IP addresses of OCI are well known so guess they will be under scanning all the time. I did not encountered this behaviour on my NC I’m hosting at home, but this this IP will be lesser known indeed.

To summarize: nothing to worry at the moment and Ill just have to make sure I’ll update regularly + keep everything secure.

Thanks for your explanation, I am more at ease now :slight_smile:

1 Like