High Availability Server - My latest Journey experience

Hi all.

I have been using Nextcloud since 2016, and I rely on it for my self-hosted services. Historically, I have run my Nextcloud instance in a lxd (lxc) container, and it’s operated well. One of the things I had on my ‘to-do’ list was to create a truly seamless fail-over experience. This was because downtime was sometimes more than just inconvenient. I suspect my journey is not over, but recently, I have setup a new installation that addresses many of my needs:

  1. I have three instances of Nextcloud running, each on three different physical servers.
  2. Using keepalived, one of them is always hot and faces the internet (this so very easy to config).
  3. My files are stored using a glusterfs replicated storage (which is hosted on each of the three servers inside dedicated storage vm’s).
    ==> I mount the glusterfs storage in my containers in two different directories, one for the Nextcloud installation with all the configs/settings/apps (typically at /var/www/nextcloud) and one for the much more voluminous user data (I personally mount mine at /var/www/nextcloud-data, yours may be different). You could use e.g. ceph instead of gluster, but it does my head in trying to get it work reliably so I went with glusterfs. Your mileage may vary. There are other solutions here too - but basically some storage you can sync very fast.
  4. I syncronize the databases using Galera. This is the magic: this is what keeps everything properly in sync. It’s actually easy to configure.

I have to say, to date, I am very pleased with this setup. It’s my best ever “failover” service. I can stop any one server and another one picks up where I was. It’s not 100% fool-proof, but it’s about 98+% there so far. I may be able to improve my reverse proxy or something to fix the brief issues I do see, but they are very “livable” and are similar to glitches on commercial services, where you have to refresh a page occasionally.

A key result of my setup is that all my files, settings, apps , app-data, shares, contacts etc. etc. are simultaneously updated and made immediately available on every node. Of course, you double (or in my case, triple) storage requirements since three redundant copies means, well, three lots of storage.

I can still get downtime (power cut, ISP), but things like me rebooting a server (or more likely, me accidentally breaking one (again)) now doesn’t impact my Nextcloud uptime at all.

I just wanted to post this to let people know this is possible, as I have seen many articles that make this harder work than it now has to be. I stress this is still not 100% perfect - when I have deliberately tried to break this, I can get momentary disconnections, but nothing that a page refresh doesn’t fix - something I think we all deal with on just about any service.

A downside of clustering is that it is slightly slower, but in my use-case, it’s literally about 25% slower - to where I don’t really notice it. If you want absolute best speed, a single server using flash storage on bare metal with super fast cpu/memory is unbeatable. But I wanted substantial redundancy as I store a lot of personally valuable data.

I am not suggesting this is the only way to do it, it’s just my (new) way. If anyone is interested to know more about this, feel free to pm me or better yet message me (likely faster) on twitter (@ogselfhosting) or on mastadon (@OGSH). I don’t claim to be an expert, but after 8 years or so, I finally mange to get enough of this right enough that it seems to work well for me.

Happy Nextclouding, y’all :slight_smile:

5 Likes

Thanks for sharing your “Journey”.

Isn’t this the old misconception that assumes that redundancy equals security?

It’s like Raid 1 or even (Next)cloud, data is mirrored so it’s available twice, but you only have to make one mistake, e.g. an accidental deletion of files or a misconfiguration and then it’s replicated immediately as well. There is no filter, that knows if it is something bad, and should not be replicated or if it is business as usual So high availability when individual hosts receive a kernel update one after the other and need a restart, yes, but extra security against self-inflicted worsenings, no.

Much luck,
ernolf

Well, I guess I didn’t state it explicitly, but your post signals that I should have done: “fail-over” is absolutely NOT a backup and should not be thought of as such.

As you correctly state, if I delete a file (or even the whole data directory) in the current live instance then yes, all the others in that cluster kindly mirror that in blindingly fast time. My backup strategy is of course very different from this: backups are not stored in my ‘gluster-cluster’, but in a completely different (and indeed multiple) locations.

This post (and my high-availability strategy) is about trying to maximize uptime in a homelab environment to mirror a professional service. Thanks for showing me I needed to make that clear, as I’d hate for someone to trip over the old “i thought raid was a backup” train-wreck! :slight_smile:

3 Likes

This is a cool idea, and good to hear how you did it. Thanks for sharing.

So, taking your comments a bit further…I have been further experimenting by using a zfs zvol as my gluster backend. It turns out this is not an uncommon setup - get the massive benefits of zfs whilst also taking advantage of gluster’s clustering capabilities.

I’ll post an update on this after some extensive testing, BUT it offers the potential/opportunity of using a complete restore of a prior state and/or re-creating the entire setup from a zfs backup. The front end does seem to work as advertised, but I have a lot to do to see if this can be an effective addition to a backup strategy (I still don;t think this alone would be good enough).

THANKS for getting me thinking on this some more. :slight_smile:

1 Like

It would be really nice if this interesting topic were cleaned up and expanded into more of a #howto document that others could follow, even if if that means mostly linking to documentation on other sites, etc. Cheers.

This is exactly what I was looking for. I saw the video at https://www.youtube.com/watch?v=ARsqxUw1ONc&t=477s but this is how to do a fresh install for the master and backup server. Did you deploy your backup on an existing NC instance?

Yes that video set off on that path too. Galera is fine, but it really needs proxysql in front of it, especially if you have more than 2 nodes ( I play with three).

For the install, I tried several ways. The easiest for me was to setup a glusterfs mount for Nextcloud files/data in a clone of an existing Nextcloud instance. Rsync -a all the existing Nextcloud files/data to the gluster mount(s), then change the mountpoint names so they match your Nextcloud config.php and restart the instance. All the Nextcloud files are on the gluster cluster and, once you delete the old copies of the Nextcloud directories, the instance is now MUCH smaller and thus easier to copy/backup etc. Very cool.

Once “Nextcloud” fires up it doesn’t know anything has changed- data and files are exactly “where they were before”, but now operating on gluster mounts with the resilience that brings against HD failure. Remember to change permissions for the gluster-mounts for www-data access.

The ISSUE is always mariadb. Even with proxysql, I found it can get itself into a mess with conflicted-writes, especially with three nodes. I find all of the problems it brings to be “unsatisfying” and arguably not worth the hassle thus far. That said, I haven’t quite given up on this yet, and I may try a do-over using a different mariadb clustering approach.

One thing’s for sure: true Nextcloud “high-availability with no data-loss” to clients is MUCH harder to setup than a simpler “fail-over system that’s always available” (even if not always 100.000% up to date).

So what I am hearing is that you really need to do a fresh install of both in order to avoid a mess with databases…

To be honest, I only have maybe 5 users so I would be thrilled to have a failover system that’s almost always available. I don’t think I need to mess with HA setups.

Would you have any advise on how I can simply have a 2-node setup with a good (maybe even off site) fail over system?

I must not have been clear:

  1. You do NOT have to do a full install
  2. But I do recommend reconfiguring a full working COPY of your instance in case things don’t work as expected
  3. There are varying degrees of a fail-over service. From “none”, all the way to “a 99.9999% replicated hot instant”. It’s actually quite easy to get, say, “99.5%” failover without having to complicate life with clustering mariadb. It’s only if you want that last extra fractional percent (basically not wanting to lose a single scrap of data) that this gets challenging.

I’d be happy to try to explain further and/or help if I can.

Andrew

I would truly be grateful if you could help me sound out a number of n00b questions. 99.5 is good enough for me!

Maybe you can use simple backup and restore from Nextcloud docs.

You can use a production system. On the second system you restore configs, data and database. Possibly use a different Nextcloud name in case of non-failover. Regularly test that the backup Nextcloud can run. Make sure that new Nextcloud versions are also imported via a restore of the software, configurations, data and database.

Create regular automatic backups of the software, configs, data and database. Ideally, these should be automatically imported to the backup system. If not, check how long the restore takes from backup server on the second system and whether it is acceptable in terms of time. In the event of an error, just change the name of the Nextcloud. Maybe you must also change IP or DNS.