Building High Available Systems

Niraj Mehta
4 min readAug 17, 2020

What is High Available System?

High Availability (HA) is a characteristic of the system, that ensures the system is adhering to an agreed level of SLA or performance. High Available systems are designed such that they have redundancy built into it. What that means is that in case of one of the instance failing or one data center going out of commission, the system would not degrade in performance or miss SLA. High available systems are always running on Active-Active Configuration. Before we go further with the design of the system, lets understand some of the concepts.

Active-Passive Systems

Active-Passive systems are basically the traditional systems which have a dedicated Production and Disaster Recovery (DR) configuration. These systems are deployed on 2 or more Data Centers, with one of them being Production and the other being DR. In an event of a failure in Production the system is failed over to DR, either automatically or through a manual process. In any case there is some amount of recovery time required before the system is back to its full capability. How much ever small the recovery time might be, it will cause either loss in performance, or data, or missed SLA (depending on the Recovery Time Objective defined in the SLA).

Typical example of an Active-Passive System Design

In the above architecture normally they use a VIP with primary and secondary IP address for failover. If the application, server or database goes down in one of the data centers then the networking team will fail over the VIP to the other site and the user will start accessing application on that data center.

Active-Active Systems

Active-Active systems are what we term has High Available systems which have the capability to direct traffic/functionality to the part of the system which is up and running without losing performance or data. There is no concept of a DR Data Center in an Active-Active configuration. The system is deployed on 2 or more data center just like its counterpart, but all of those data centers are taking in traffic.

Designing a HA Systems

Lets build a system that hosts a software, that has a web based user interface taking in traffic, an api layer performing some business functions and finally storing the data to a database. For this introductory article we will assume the web and the api layers are all bundled into one big archive. This is a very simplistic depiction of what actual systems are, but will help in understanding how HA systems are designed to take in continuous incoming traffic.

Components

  • Software- Application that is performing some business function. It has a web based interface that users can interact with and the api layer that performs some business function before storing it to the database
  • Database- Datastore to hold the data. This data is either created, modified or being consumed by the user via the application.
  • Servers- Hardware where the application and database is hosted. The servers can have any kind of operating system built into it; Windows, Linux or Solaris. The servers are housed on the different Data Centers, and can either be physical servers or virtual server for multi-tenency.
  • Load Balancers- The load balancers play a very important role in building a HA system. They direct traffic to the different Data Centers based on different algorithms; round robin, weighted average load or any other custom logic.
  • Monitoring- One the most critical part of building a HA system is ensuring that there is sufficient monitoring and alerting in place. Monitoring and alerting will provide feedback when there is a failure in any of the components and need attention. Even though the system is designed such that other parts of the system can start taking in traffic, responding to issues on time will ensure that the complete ecosystem does not collapse.

Example 1-

In this example we have used a load balancer with a round robin algorithm. The load balancer routes traffic to both the data centers. However you can see that both the applications are connecting to the same database, and the other database is being kept in sync by replication. There is still an issue of switch over and downtime with this approach.

Example 2 -

In this example we are using real time replication of the data between the two nodes. Oracle provides that using Golden Gate, and MS SQL has Availability Groups. The concept is very simple, there is a load balancer of sorts infront of the two nodes and the applications can read from any of the nodes. The write is however performed only on the primary node. The other node is kept in sync by replicating data real time before the commit. So in case the primary node fails the other node assumes primary and starts functioning. We are still left with the problem of load balancers being on one Data Center. Hardware load balancers like F5 support active-active or active-passive configuration. That provides redundancy of load balancers and helps in achiving true HA.

--

--