In the last entry I wrote about DevOps methods we use in Divante. DevOps is one of the most important factors to provide high service level agreements (SLA). Now I would like to share another puzzle of high availability we’ve implemented in Divante.
We challenged ourselves to provide a very high availability (about 99,95% a month), secure reaction, and recovery times defined in SLA. Sometimes the reaction times that we guarantee to our customers are as low as 0.5h and recovery times 1-2h, proportionately, including weekends and night hours. This is not an easy task. Even special maintenance centers with administrators working on 3 shifts are sometimes unable to provide the SLA required.
We discovered that the key is motivation and appreciation of IT teams. We created Divante S.W.A.T (Special Weapons And Tactics)!
The idea was to create fast-reaction special task force called “Divante S.W.A.T” (for details: http://en.wikipedia.org/wiki/SWAT).
Special Weapons and Tactics.
Team consists of 3 members and the membership is voluntary. The main task of the group members is to provide, using all the methods at hand, SLA for websites (mostly e-Commerce) serviced by Divante.
Each member has to gain special knowledge. We provide one-month training, which includes:
- linux administering workshops – mastered by trainers from an external company, specializing in administering servers,
- bash course,
- exams and tests in staging environment – including simulated disasters. A S.W.A.T. member doesn’t know what exactly is going to happen (like in the real world) and has to recover the system.
The membership is voluntary but to sign up each member has to pass the official exam, like it is in real special task forces. To provide constant motivation we sign up agreement with each new member:
- Members get special bonuses added to monthly salary. It’s a considerable amount of money. Also to help them feel part of the group each person gets a special emblem, a Swiss knife, etc.
- We agree on minimal SLA values – 1h reaction time, 2h recovery time. Those values are minimum for group but maximum for our customers. This means we couldn’t ensure greater SLA. But – to be fair- this is very high one!
- S.W.A.T members have to ensure SLA using all possible methods. They could organize duties, divide websites among themselves, install monitoring systems, write documentation or motivate programmers. They should organize their work themselves.
- In the first two months, were the time for work organization, that’s why the bonus to monthly salary was granted regardles from the fact if the group maintained SLA or not.
- After that period the group responsibility was applied. As in army. If SLA will be violated in any website, nobody gets their bonus.
SLA is monitored and feedback is given to the group in weekly and monthly periods. We’ve planed to create special dashboard in the room to provide status of SLA in current week.
For group members we ensured one month long training. During this period we discussed infrastructure, system administering. After training there was an exam (60 minutes; very hard).
S.W.A.T team has access to SysOps and external consulting agency (specialized in Linux administration and networks). They are the first line of intercepting issues. They do what is possible and when all ideas are shuted down – they go for help to second line.
Team self organized his work. Special e-mail group was created. It works like CB-Radio – where each members sends info’s about current interventions. After each action special report is created. Using Redmine system, under special project called “Interventions” – summary is collected to use in case of next similar incident.
Team members had motivated each Divante developers to create special documents called “Trouble shootings” for each website which is maintained. In TS configuration details, architecture and possible errors are described. You could use it even not knowing anything about application which is needing help right now.
Group receives informations about planned deploys and maintenance works. Guys know then – when they should prepare. We created special phone number to connect directly to Divante S.W.A.T (calling there is like calling 911!).
Does it work?
From time when we created the Team – only one time we violated response time. Cause – in fact – was bad check in monitoring system.
All the time we’re recruiting new members and trains them. If You think about improving reaction times – variation about this idea could be for You!