24th September 2020

Effective monitoring is proactive — and predictive

Monitoring blog

What is behind our 99.95% uptime guarantee? An experienced team — and a proactive, predictive approach to monitoring.

If you want to understand what monitoring means, and why it’s important at a financial technology company like Ixaris, consider aeroplane maintenance. Similar to an aeroplane, Ixaris’ platform can show signs of fatigue or failure before its performance is impacted. Following this analogy, it’s obvious why infrastructure monitoring is a critical part of our overall operations, and why (apart from being reactive) it must be proactive and predictive. And while lives are not at stake should our platform experience a failure, similar to aeroplane engineers, we hold ourselves to extremely high performance and reliability standards.

Ixaris confidently offers a 99.95% uptime guarantee. This is, in part, because of our approach to infrastructure monitoring, which has been rigorously honed over 16 years of operations. Our Technology Director, Andrew Calafato, explains.

The evolution of monitoring at Ixaris

Ixaris’ approach to monitoring has evolved into a system that is centralised, manageable, in-depth, proactive, reactive and predictive — and is at the core of our exemplary uptime and performance records.

As with most technology providers, our systems and services are bound by different levels of Service Level Agreements (SLAs), which give our customers peace of mind that Ixaris Payments will run smoothly and reliably 24/7/365. In general, SLAs require infrastructure investment proportional to the quality of service offered — and infrastructure means many things, from data centre, hardware (servers and networking), virtualisation, middleware, database, applications, various tools (authentication, configuration, orchestration, logging) and monitoring.

In Ixaris’ early days, as our product strategy evolved, we made a decision to create an entirely new platform that would be scalable and stable enough to meet tight performance and uptime SLAs. We implemented every part of the platform’s set up in a highly available configuration, with the capability to switch all services to run from the other side of the continent in under 10 minutes. Monitoring — continuously keeping eyes on every part of the infrastructure — was a crucial part of this setup and like all aspects of Ixaris’ infrastructure, it needed to become more fit for purpose, centralised, manageable and in-depth. Our team also needed to drive a shift from a retroactive monitoring system to a proactive, reactive and predictive one.

So, how does Ixaris approach monitoring today?

Centralised and manageable

During beta testing, the hardware, networking, virtualisation, application, and all other parts of our hosted services had their own basic, reactive monitoring. However, having a siloed 24/7 team in every area, each maintaining its own tools, was not scalable. A core support team needed to have a bird’s eye view of the whole technical landscape, and the ability to escalate to the right people. With this in mind, Ixaris consolidated its monitoring tools in a few key services:

  • A new core monitoring and alerting tool that monitors all layers of our infrastructure, sending different types of notifications depending on the criticality
  • A new external monitoring tool that monitors the external-facing applications from around the world, mimicking customer experience
  • A new incident management tool that immediately notifies the right agents of critical alerts, through to the management of the critical incidents
  • The use of an existing graphing tool to extend the dashboards offered by the core monitoring tool
  • The use of an existing infrastructure automation and delivery tool to configure the monitoring agents where possible (e.g. all filesystems capacity and CPU usages of each Operating System, database-related monitoring for all databases, etc.)


In-depth

Today, we monitor each layer and feed into a single system, from the datacentre or hardware level and up to business logic (e.g. a customer stopped transacting, or an SLA not being met), in line with the defence in-depth concept. This centralised set of information gives us the ability for correlation and allows us to determine root causes faster. For example, external service providers are monitored from all sites and at different layers (VPN, connectivity to a remote firewall, connectivity to remote applications). Depending on the alerts that trigger, an agent can easily differentiate between:

  • a local network problem – remote firewall and application alerts from one of our sites
  • a remote network problem – remote firewall and application alerts from all our sites
  • a remote application problem – remote application alert from all our sites.

Apart from directly monitoring databases, filesystems, and all external services used by the applications, our connectivity to these services is also monitored through applications. This type of monitoring is internally known as “deep-ping.” If only one “deep-ping” fails from specific application nodes, it is quite likely that an issue resides at the local application node. If a “deep-ping” comes from all nodes as well as the direct monitoring alerts, it is more likely that the problem is at the particular remote service.

Proactive

Proactive monitoring is about preventing problems. At Ixaris, alerts are configured at different levels with different criticalities. For example, an alert could signal a warning at 70% disk full and a critical alert would signal at 90% disk full. We also monitor each element of a cluster individually. Combined, this gives the ability to react before the overall service is impacted and the initial resource usage warning reassures our capacity planning. This is highly effective since all components in our infrastructure are clustered, from storage and power supplies up to databases, filesystems, and applications.

Reactive

Apart from monitoring all layers, in-depth monitoring means combining proactive and reactive monitoring to identify problems as fast as possible and provide as much information as possible to engineers so they can revert service as fast as possible. Reactive monitoring is the last line of defence. It focuses on past issues and is typically polled frequently and aggressively.

Predictive

Our monitoring system issues alerts when a particular threshold is exceeded. But sometimes a threshold can be fluid, and to be effective the trigger must be based on trends. Whenever possible, we use existing BI data with logic to predict problems, sometimes querying operational data. In the latter case, we ensure database queries used for monitoring are executed routinely so they are not heavy.

Monitoring: the bigger picture

Monitoring and alerting evolve continuously at Ixaris. As the environment changes and new services and business logic are released, alerts are tuned and augmented. Operations, including monitoring, have become an intrinsic part of Ixaris’ new features and other development work. With our toolset, we can create monitors, alerts and dashboards within minutes, effortlessly. Monitoring is so important that like all aspects of the infrastructure, the monitoring tools are highly available.

But what does this structured, rigorous approach to monitoring mean for our customers? During the first 8 months of 2020, only a single critical incident occurred, during which our system did not transition smoothly from communicating from one ‘stuck service’ to another in the cluster. Despite this incident, we still met our monthly Service Uptime SLA. This is because our reactive monitors quickly alerted our technical support team, who were able to revert the service and continue with root cause analysis and incident management. In other words, while critical incidents remain exceedingly rare, our in-depth consolidated monitoring allows us to take our responsibility to protect our customers from disruptions and confidently meet uptime and performance SLAs month on month — and plan and invest accordingly. As it should be.

Our in-depth consolidated monitoring allows us to take our responsibility to protect our customers from disruptions and confidently meet uptime and performance SLAs month on month — and plan and invest accordingly. As it should be.

Andrew Calafato, Technology Director

Copy link to clipboard