High Availability Overview

Staring high availability straight in the eyes without turning to stone takes real courage, just like looking Medusa in the face. The safe, lazy choice? Just rent it from today's "IBM": AWS, Azure, or Google Cloud. After all, as our IT grandfathers loved to say: "Nobody ever got fired for buying IBM." But is it actually the smart move?

Medusa_by Bernini Medusa by Bernini (c. 1638–1640)

Following up our initial post, we want to share a bit about how we do high availability at Totus. It is a mixture of software, business processes and relationships.

We are robust because we plan for plans not to go as planned 🙃🥁🎶 ... And yes, we look at the abyss for you, so you do not have to do it.

At each layer of the Totus platform, we asked the simple question of "what could go wrong" and we deal with each one, one at a time, and later with each of them generating a cascade of failures:

Physical

Hard disk RAID failures
Memory, cpu, motherboard failures
Network failure
Split networks
Providers closing accounts ¹

Logical

Database replicas out of sync
Complete node data-loss
Security compromise
Generic Service Failure
Usage peak

Business

Accounts closure without notice ¹
GDPR/Data Protection
Spike in demand

We will not go into the whole operational plan detail, mostly because of operational risk management. Overall we believe we are pretty secure but it could be dangerous to expose our whole platform in minutiae detail, we will share highlights.

Where are we positioned?

A bit like you, we are not a corporation with infinite budget for lawyers and hard disks, we buy smart, we rent servers that are 3+ years old and commodity hardware. We work hard on the HA instead of spending money on it by buying one very expensive server: "Smart Start-up".

Data

There is nothing very new in here, we just do it right. We need to say it: we do it parsimoniously, correctly and precisely. There are two types of nodes: transient (do not store data on long-term basis, they do not "own" the data) and data nodes.

Transient nodes might have whole database copies for performance, but ultimately the data is transacted, edited and stored in data nodes.

There are three levels of data security:

Locally at data nodes, data is duplicated in disks using RAID
At logical level, databases or other states (i.e. Patroni for PostgreSQL, etcd, distributed filesystems like GlusterFS, etc) are duplicated at least three times (usually more) for state consensus and also for avoiding split-brain scenarios. Copies are kept in different datacenters of different companies for securing availability under physical failure but also business duress ¹. We use physically close datacenters for performance.
Finally, backup copies. All data is at least daily backed-up (sometimes more), in three different locations, in three different time-horizons (daily copy, weekly copy, monthly copy), in two different mediums: our offices, cloud copies (and a third in the actual servers.)

Host Based Intrusion Detection System

Each Totus server runs a "root" software specifically dedicated to monitor many security aspects, including network, processes, system patterns, containers, etc. This root service has been built using enhanced security techniques, among others it keeps encrypting itself in memory, hiding itself, moving itself and communicating with other root services in remote servers ... All roots keep tabs of other roots and can raise alarms and act proactively and autonomously. They hold the keys to other services.

We cannot share more than this.

It is likely the only way you will know about this system is by this post, and only this post, even if you have root access to a Totus server.

Monitoring

We do plenty of monitoring, we run multi-node Prometheus and Alertmanager. We also have the "Eye of Totus" a custom service monitoring many Totus APIs, services and networks. We will trigger an operator alarm (a real person receiving an alarm that cannot be silenced) not only on failures but in trends. i.e. disk is going to run out at this speed in a few hours.

We work very hard to have accurate and precise alarms, to not have false-positives so we avoid the typical company "alarm fatigue": when operators simply ignore most of the alarms because they are typically "not important".

Alarms rarely trigger at Totus, and when they do, they are important, and they get looked at.

We have all the standard alarms you can expect like hardware i.e. raid, smart, etc. and software related i.e. memory, processes, etc. We also have alarms for all internal core services like database replications, cluster management services, etc... and public services like API endpoints, DNS services, endpoint SSL certificate days left, SSL terminations for all public nodes, etc.

Because we take security very seriously at Totus, the Host Based Intrusion Detection System will trigger alarms, we also constantly scan the public and private networks at Totus for open ports or unexpected network connections. We trigger alarms on those.

Data Center and Hardware providers

We use OVH, we use Hetzner and we use a few others. We are happy, and we have a good relationship with all of them. Because we know what a "compliance process" can do to an account, we take providers unavailability very seriously. ¹

We know that at any point in time a whole account can be closed and Totus will continue to work seamlessly. Well, precisely DNS health-checks and time-to-live might show to some users as if Totus is down ... for about 30 seconds, reload the page, retry the API call and that would be all of it. six-nines, or less than 30 seconds a year. (again, we do not promise, but we aim for it).

We have worked very hard to achieve this level of provider independence.

We believe it is something you can rely on us for your peace of mind.

Perseus with the Head of Medusa, Benvenuto Cellini (c. 1545-1554)

Companies exercise their right to choose their customers, usually at the less expected moment. ↩ ↩² ↩³ ↩⁴