In the eCommerce world, a system breakdown is always a measurable loss. Even a couple of minutes can result in fewer transactions and lower revenues. That’s why we challenged ourselves to provide a very high availability (up to 99,95% a month), secure reaction, and recovery times defined in SLA.
We also discovered that the key is motivation and appreciation of IT teams. We created Divante S.W.A.T (Special Weapons And Tactics) – a team of our most experienced employees that is responsible for a quick reaction during the crisis. The membership here is voluntary with special bonuses added to the salary. The main task of the group members is to provide, using all the methods at hand, SLA for websites (mostly e-Commerce) serviced by Divante.
Here’s an interview with one of the S.W.A.T team members – our Hosting Administration Director, Paweł Szreder. After reading it you’ll know more about the online security and crisis procedures.
First members of Divante S.W.A.T – Michal, Maciek, Marcin with diplomas of passed exam :)
When and where we can face errors that can paralyze the online shop?
Paweł: Actually, there’s no strict rule. Talking about such errors we should divide them into two groups – these related to the server infrastructure (hardware, networking, OS and related services) and these connected with the application itself, integrations with external systems and business logic. Basically, the application errors are easier to identify. Usually, they happen during source code updates or at the high level of integration between application and external services. Sometimes we only need to revert source code to the previous version to remove the bug. It’s funny, because when you don’t touch the application and don’t update the system, the risk that something will happen in this area is minimal.
There is a bigger problem with the second group – infrastructure issues. This kind of errors is usually unpredictable, like, for example, hardware breakdowns or applications failures that result in the server overloading. In addition, the infrastructure working under the pressure of traffic requires continuous monitoring and improvement.
Where the idea od S.W.A.T comes from?
Paweł: As always – we tried to meet the requirements of our Clients. As eBusiness Software House we focus on delivering comprehensive services and SLA (of both – infrastructure and application) is one of them. At the beginning, there wasn’t any structure inside our company but we felt that there is a strong need for such a formation. So we created a dedicated group of quick reaction that takes care about the uninterrupted functioning of online platforms serviced by Divante.
It’s difficult to be always on the alert, isn’t it?
Paweł: Yes, it is, but we have day and night duties as well as dedicated coordinators. There is always at least one person responsible, but apart from that, if any member notices that something happened, they check what is wrong and inform other members that the action was taken. The other members can follow this process live.
Do you have any procedures?
Paweł: Yes, of course – we have organizational procedures as well as repair procedures. Most of our services have troubleshooting procedures, they are written by programmers – it’s a set of pivotal points of every system, hints, what should be checked as first. If it isn’t infrastructure error nor any of these described in troubleshooting procedures, it is sent to programmers. It’s worth noting, that sometimes we can bypass some steps of the procedure – when we know from the beginning, that the problem is connected with the application. During the action we are in current communication with our Clients, we inform them what’s happening and help to make decisions during the whole process.
What tools do you use in your work?
Paweł: The first tools we use is Zabbix – it monitors all the infrastructure, checking the technical parameters. We use it also for monitoring of business check, specific for a system and delivered by the programmers (like integrations). We also use Monit – pro-active monitoring, if something is wrong with the system, this tools tries recovering it. The third tool that is worth to mention is Graylog – for log management and real-time analysis.
What is the general rule of every crisis reaction?
Paweł: I’d said that firstly, we should wonder how to overcome the crisis, secondly, we try to understand what happened and how to avoid such situations in the future. As a result, we constantly implement new pro-active mechanisms. Their goal is to detect and resolve problems that have occurred in the past and for which there is a risk of recurrence.
Does it work, the whole S.W.A.T idea?
Paweł: Yes, it actually does. Firstly, we checked the whole react procedure many times and now we can say that it works. On the other hand, we gain experience and make our security mechanisms even more effective.