Today’s IT systems are expected to run continuously. There’s nothing strange about it. IT now reaches virtually every single facet of our everyday lives. Systems that control and manage manufacturing plants, power lines, or crisis reporting, can simply never be offline. Our lives and well-being are dependent on some of those systems.Others are required for handling everyday bank or government office services. Even such simple tasks as going shopping, checking mail, issuing an invoice, or making a phone call are all at the mercy of IT systems.
All the examples mentioned above have one thing in common. In case of such systems High Availability is an absolute requirement. They must work round the clock, even when failures or other unexpected circumstances occur.
System availability means ability to properly fulfill assigned tasks in a specified time period (month, year, or any other).
Systems with highest availability are available throughout 99.999 percent of the time period. The following table depicts “the number of nines”, and a corresponding monthly unavailability period:
Availability Unavailability time for a monthly period
Availability parameters are defined by the system’s SLA parameters (a detailed article on SLAs is available in the “What is an SLA, and what is it composed of” section).
The table presented above clearly proves, that in case of e-commerce systems availability is not only of utmost importance, but its significance is measurable as well. If availability is at 99 percent, which would appear quite high, this means that our store would be down for as much as 7 hours every month. Unfortunately availability does not define, whether those 7 hours of downtime would occur at the peak of a marketing campaign or, for example, just before Christmas. In such cases losses incurred due to this downtime could be severe.
The percentage value for availability is based on multiple determinants:
- How reliable is the utilized equipment (how prone to hardware failures).
- Whether our system has been suitably tested, or is it prone to failures leading to operational perturbations.
- What is the reaction time, and how skilled is the team responsible for system maintenance (administrators, developers) — how quickly a potential problem would be reacted to and eliminated.
- Whether our system utilizes High Availability features as a means of protection against failures (more on such features follows).
Developing High Availability systems is cost efficient, as long as the cost of their deployment does not exceed the cost of potential losses incurred due to failures.
As best practice, High Availability solutions should be considered for the following IT environment components:
- Networking hardware: routers, firewalls, switches, network connections (including Internet connections)
- Database servers
- Application servers (hosting applications, CRM, ERP, finance and accounting systems etc.)
- Terminal services
- Mail and workgroup servers
- Web servers
- File servers.
High Availability may be provided on a hardware level by implementing redundancy in every single infrastructure layer. This protects the system against a very hazardous phenomenon called the Single Point of Failure (SPoF).
A Single Point of Failure is an environment component (any component), the failure of which may cause loss of access to data or applications. Single Points of Failure may also occur in hardware and software components, as well as in case of external dependencies, such as third-party energy supply.
To start minimizing Single Points of Failure, one should start by carefully selecting a hosting service for the project at hand. Does the server room have redundant power supplies, cooling systems, and links to multiple ISPs? Most hosting providers offer such solutions, but a customer should be aware of their availability. Continuing to analyze the Data Center capabilities, one should check, whether the network infrastructure utilized by servers hosting the project (routers, switches, adapters) is redundant, and capable of switchover from one connection point to another in case of failure.
Proper server infrastructure selection protects us from factors, which are usually entirely beyond our control. However, hosting the application at our Data Center might not be the best idea in terms of availability-related costs.
Apart from hardware, the application should utilize redundancy as well to provide availability. The application should be made to run on multiple servers. An optimized solution would also detect unavailability of the database server, network services, and other connected systems. It may either implement failover by itself, or rely on network-level failover.
In order to provide High Availability, the application’s IT environment should be properly planned at the stage of technical analysis, when collecting hosting requirements. Questions to ask oneself at that time should include: will we be able to react in case of a database failure (e.g. by switching the application over to a replicated database), or Web server failure (e.g. by removing it from a load balancer, or changing its IP-Failover address). Those are in practice the most common failure locations for Web applications.
The second question to ask oneself is: how soon will we be able to react, and how quickly will we be able to eliminate the issue? What would happen, if our administrator is away on a holiday, or calls in sick — can anyone serve as a replacement? Are night shifts assigned, and is the night shift personnel trained to react to such failures? Are backups tested?
Except for working shifts, employees on administrator duty must also be well trained, and have access to current system documentation. Such documentation must be continuously updated as required, and include developer descriptions of ways to handle anomalous situations and repetitive issues. Every single person of interest should have unlimited access to such documentation. It’s of utmost importance, especially when software maintenance is provided by a different team than the one responsible for original development. However, such team separation brings additional benefits. It provides extra motivation for quick and efficient information and knowledge sharing (in writing), which enables development of a knowledge base. A knowledge base is in turn invaluable in case of personnel changes, or if system development is outsourced.
In order to attain High Availability, one more factor should be considered — software development process quality assurance. It’s no surprise, that software bugs appear during development. System unavailability time is in turn directly affected by how quickly the team is able to find the most recent working configuration and perform a rollback. If a suitable version control system is utilized (e.g. GIT or SVN), updates may be provided using a transactional model, which provides protection against development-time error repercussions.
The final significant factor has to do with external system contact points. Contact points between internal and external systems are very prone to various failures. Failure risk is high due to two primary reasons. One: “that which is known meets that which is unknown” — not every situation can be handled properly, due to insufficient communication and documentation. Two: we have no influence on foreign system availability and reliability. There are solutions which enable us to solve many integration-related availability problems. One of those solutions is implementing simple message queues (in order to avoid losing messages in case of failure). Communications should usually be provided in an asynchronous manner, so that connection problems don’t shut the user out for example at the last stage of an order. Consequences would be dire!
The following chapters discuss basic high availability techniques for primary IT system components.
In Practice – Is My System Available Enough?
Below you will find a list of questions to ask oneself in order to evaluate the level of system availability. If you’re not able to answer some of those questions, make sure to ask your IT department, or a third party responsible for deployment. If there are no answers to some questions, or the answer is no, availability or availability restoration problems may appear in case of a failure. Make sure to take adequate steps to minimize them whilst still “at peace”, and not “at war” — when everything still works.
- Does your hosting provider utilize connections with various ISPs, and is able to reroute dynamically (e.g. using BGP) in order to ensure continuous network access?
- Does the server room have an alternative energy source and a redundant cooling system?
- Are backups made on regular basis? How often? Do those backups cover all user data and databases?
- What is the backup retention time?
- Is the network equipment utilized in the server room redundant (N+1 or 2N)?
- Are the servers equipped with dual network adapters, and connected to two separate switches?
- Does the server room support IP-Failover (switching IPs between servers in case of a failure)? Is the service reliable?
- How long does it take to restore the application from backup, and what is the maximum data loss expected (time between backup and restore operations)?
- Is the application able to run on several Web servers? Is our environment redundant (N+1 or 2N) — is failover to a secondary server possible?
- Is the current application source code managed using a version control system, and is it possible to restore a working (stable) version at any time as required?
- Are external systems monitored? Is the communications with those systems asynchronous and covered by error handling? Are messages queued, so that they are not lost in case of an external system failure?
- Are administrators working 24/7 in shifts, and do they know how to restore the application?
- What are the reaction and repair times as defined by the system warranty? What are the reaction times as defined by the hosting provider contract?
- Is the application able to detect a Web server failure (e.g. using a load balancer), and perform a failover to a secondary server?
- Is the database replicated in real time (master-slave or master-master) to a secondary server? Perhaps real-time copies are not indispensable.
These are basic questions, which may be answered with no need to meticulously analyze the system and the application. An answer of “yes” to most of those questions solves approximately 80 percent of availability issues encountered in standard e-commerce systems. The process of solving and automating the other 20 percent may however be quite complex and time-consuming.