High Availability Server - My latest Journey experience

Hi all.

I have been using Nextcloud since 2016, and I rely on it for my self-hosted services. Historically, I have run my Nextcloud instance in a lxd (lxc) container, and it’s operated well. One of the things I had on my ‘to-do’ list was to create a truly seamless fail-over experience. This was because downtime was sometimes more than just inconvenient. I suspect my journey is not over, but recently, I have setup a new installation that addresses many of my needs:

  1. I have three instances of Nextcloud running, each on three different physical servers.
  2. Using keepalived, one of them is always hot and faces the internet (this so very easy to config).
  3. My files are stored using a glusterfs replicated storage (which is hosted on each of the three servers inside dedicated storage vm’s).
    ==> I mount the glusterfs storage in my containers in two different directories, one for the Nextcloud installation with all the configs/settings/apps (typically at /var/www/nextcloud) and one for the much more voluminous user data (I personally mount mine at /var/www/nextcloud-data, yours may be different). You could use e.g. ceph instead of gluster, but it does my head in trying to get it work reliably so I went with glusterfs. Your mileage may vary. There are other solutions here too - but basically some storage you can sync very fast.
  4. I syncronize the databases using Galera. This is the magic: this is what keeps everything properly in sync. It’s actually easy to configure.

I have to say, to date, I am very pleased with this setup. It’s my best ever “failover” service. I can stop any one server and another one picks up where I was. It’s not 100% fool-proof, but it’s about 98+% there so far. I may be able to improve my reverse proxy or something to fix the brief issues I do see, but they are very “livable” and are similar to glitches on commercial services, where you have to refresh a page occasionally.

A key result of my setup is that all my files, settings, apps , app-data, shares, contacts etc. etc. are simultaneously updated and made immediately available on every node. Of course, you double (or in my case, triple) storage requirements since three redundant copies means, well, three lots of storage.

I can still get downtime (power cut, ISP), but things like me rebooting a server (or more likely, me accidentally breaking one (again)) now doesn’t impact my Nextcloud uptime at all.

I just wanted to post this to let people know this is possible, as I have seen many articles that make this harder work than it now has to be. I stress this is still not 100% perfect - when I have deliberately tried to break this, I can get momentary disconnections, but nothing that a page refresh doesn’t fix - something I think we all deal with on just about any service.

A downside of clustering is that it is slightly slower, but in my use-case, it’s literally about 25% slower - to where I don’t really notice it. If you want absolute best speed, a single server using flash storage on bare metal with super fast cpu/memory is unbeatable. But I wanted substantial redundancy as I store a lot of personally valuable data.

I am not suggesting this is the only way to do it, it’s just my (new) way. If anyone is interested to know more about this, feel free to pm me or better yet message me (likely faster) on twitter (@ogselfhosting) or on mastadon (@OGSH). I don’t claim to be an expert, but after 8 years or so, I finally mange to get enough of this right enough that it seems to work well for me.

Happy Nextclouding, y’all :slight_smile:

5 Likes

Thanks for sharing your “Journey”.

Isn’t this the old misconception that assumes that redundancy equals security?

It’s like Raid 1 or even (Next)cloud, data is mirrored so it’s available twice, but you only have to make one mistake, e.g. an accidental deletion of files or a misconfiguration and then it’s replicated immediately as well. There is no filter, that knows if it is something bad, and should not be replicated or if it is business as usual So high availability when individual hosts receive a kernel update one after the other and need a restart, yes, but extra security against self-inflicted worsenings, no.

Much luck,
ernolf

Well, I guess I didn’t state it explicitly, but your post signals that I should have done: “fail-over” is absolutely NOT a backup and should not be thought of as such.

As you correctly state, if I delete a file (or even the whole data directory) in the current live instance then yes, all the others in that cluster kindly mirror that in blindingly fast time. My backup strategy is of course very different from this: backups are not stored in my ‘gluster-cluster’, but in a completely different (and indeed multiple) locations.

This post (and my high-availability strategy) is about trying to maximize uptime in a homelab environment to mirror a professional service. Thanks for showing me I needed to make that clear, as I’d hate for someone to trip over the old “i thought raid was a backup” train-wreck! :slight_smile:

3 Likes

This is a cool idea, and good to hear how you did it. Thanks for sharing.

So, taking your comments a bit further…I have been further experimenting by using a zfs zvol as my gluster backend. It turns out this is not an uncommon setup - get the massive benefits of zfs whilst also taking advantage of gluster’s clustering capabilities.

I’ll post an update on this after some extensive testing, BUT it offers the potential/opportunity of using a complete restore of a prior state and/or re-creating the entire setup from a zfs backup. The front end does seem to work as advertised, but I have a lot to do to see if this can be an effective addition to a backup strategy (I still don;t think this alone would be good enough).

THANKS for getting me thinking on this some more. :slight_smile:

1 Like

It would be really nice if this interesting topic were cleaned up and expanded into more of a #howto document that others could follow, even if if that means mostly linking to documentation on other sites, etc. Cheers.