Whether simple hardware failures, cyberattacks, natural disasters or even human error, threats to business continuity are a reality for all businesses. A sound business continuity plan does not ignore such threats, it faces them head-on — just like our Disaster Recovery plan. Our Senior Systems Operations Engineer, Ryan Calleja, explains.
While Ixaris’ systems are robust and resilient, disaster recovery planning helps us prepare for unpredictable ”corner cases” and unforeseen chains of events. Such disruptions to our services can have cost, revenue and reputational impacts for our customers. The longer the disruption, the greater the impact. In a disaster every second counts — in both senses of the word.
At Ixaris, we take disaster recovery very seriously, drilling to effectively and efficiently resume our operations as fast as possible in the event of an unexpected interruption — or even a whole series of interruptions. Here’s how we do it.
Under the (hardware) hood
As with every other technology company, our product needs to run on some hardware. This hardware needs to be constantly powered on, cooled and connected to the Internet at all times. In our case, Ixaris owns our own hardware but not the premises where it is located (a data centre). For this, we chose the best hosting providers in the UK and Malta.
In simple terms, a disaster recovery plan allows Ixaris to ”switch” communications (to/from our clients and partners) from one site to another. The active site becomes passive and vice-versa. In short, there are two scenarios: 1) the UK was the active site and we want to switch all operations to Malta or 2) or the other way round.
At our data centres, everything is replicated, and both sites are interconnected over a secured Internet link (VPN) to keep all our latest data constantly synchronized. At both of our data centres:
- All equipment is powered by two independent electricity feeds and backed up by different UPS systems and generators
- Internet connectivity is provided by two separate ISPs connected via different network routes and internet carriers
- Each piece of hardware is independently replicated, whether it is used to run our software, store our data, provide network
In other words, you can picture both sites as an exact replica of each other, but only one active at any given time. That means, if a disaster were to occur at one site, we can “flip the switch” to its sister site, and vice versa.
It is critically important that this “switching” operation is as seamless as possible for our clients, with zero loss of data. But it also needs to be simple enough that a failback (going back to the original site once disaster situation is solved) is as easy as the failover (switching to the backup site). While there are several ways to achieve this, Ixaris chose open-source software Ansible to automate the procedure.
Flipping the switch
Ansible is an automation tool that is easy to understand and gives the administrator the flexibility to deal with all the infrastructure layers that make up our product. The whole procedure is consolidated in one central file, known as the Ansible playbook. This is always stored in a centralized code repository (git) so that any changes we make over time are always tracked, documented and versioned. Git also gives us the flexibility to choose and compare different versions of the procedure, as we sometimes need to update it to reflect infrastructural changes.
Ansible works by using modules, which are each responsible to interact with a specific layer of our infrastructure (for example, storage array, a network switch, operating system or even a specific piece of software). The administrator is only required to populate the module parameters and Ansible does all the magic!
In our case, we ask Ansible to:
- Change our public DNS records to reflect the IP addresses of the site we are switching to.
- Reverse the way data is being replicated (UK-to-Malta becomes Malta-to-UK).
- Switch off all the software components at the site where there is the disaster (for example, the UK site). This step can be skipped if the site is unreachable.
- Switch on all the software components at the failover site (for example, the Malta site).
The exact same procedure is applied for failback. We simply change some parameters determining which site needs to become active or passive. This can be executed from a specific machine on both sites and is templated into a job that only specific, audited persons are able to run. We use another open-source tool for this, called Rundeck.
While this “switching” procedure involves some downtime, we have fine-tuned this to just under 4 minutes. Luckily, to date, we have never needed to actually implement our disaster plan (touch wood!) However, we still perform an annual disaster recovery drill to simulate disaster situations and responses. Following this, we keep our operations running at our secondary site for more than a week to further and rigorously test this process.
How else do we prepare for the unexpected?
We have designed Ixaris’ software to be performant, scalable, secure and more reliable than most. It is flexible enough to run on multiple machines at the same time, forming part of a highly available cluster. We can also quickly provision and expand our software setup through specialized configuration tools. We constantly back up all the important data both locally and an offsite location, giving us the ability to restore it in case needed. And we’re protected against cyber-attacks via third-party CDN providers.
Apart from our technology, knowledge and processes within our teams are also "replicated." In other words, there are no knowledge gaps at any levels of the company. Support Engineers have clearly defined processes they follow in case of incidents, and they can make use of various tools to help the business recover quickly should our intelligent monitoring system issue detect any issues.
Prevention and preparedness pays off
Our team plans for business continuity by building infrastructure that is extremely robust to minimise the chances of needing to perform a disaster recovery scenario. Simultaneously, we plan and drill for just such disasters.
In my opinion, our ability to restore operations with a zero-cost, simple, maintainable solution without side effects (apart from a very brief service interruption) is quite an achievement — and one that perfectly meets our business continuity goals.
Ryan Calleja is a Senior Systems Operations Engineer at Ixaris.