21/02/2022

Chaos engineering: improve software, generating errors

The term chaos engineering refers to a method of testing IT systems, aimed at identifying any bugs or weak points. It differs from the usual stress tests in that it is based on principles of chaos theory and aims to generate events considered very unlikely to happen.

It consists of controlled experiments which are able to improve IT security by locating anomalies which are not immediately visible or bugs which can occur randomly. Its main objective is to prevent hackers from exploiting these vulnerabilities to cause damage to the infrastructure.

Improving the system via controlled tests

The main principle of chaos engineering focuses on breaking a system on purpose in order to gather information that will help to improve its resilience. This method is an effective way to test software and applications, as well a distribution software, such as a network of inter-connected computers, sharing resources locally or remotely.

The wider and more complex a network is, the more susceptible it becomes to ‘breakdowns’. There are numerous points where it could fail and they often seem to be generated by seemingly random events; this impression of randomness tends to increase according to how structured the entire system is.

Tests which follow the principles of chaos theory are specifically designed to simulate unpredictable situations, in order to locate previously unsuspected weak points. Examples of issues potentially identifiable via this testing method include:

Blind spots: areas where software monitoring is not able to gather sufficient data;
Hidden bugs; glitches or other problems which can cause the software or network to malfunction;
Performance bottle necks; situations in which efficiency and performance could be improved.

It is very important to identify any malfunction, especially if we consider that network infrastructures and programming methods are constantly becoming more complex and intricate.

Any company wishing to adequately protect its IT department must, therefore, use strategies like the ones described above, in order to adapt and control the inevitable chaos caused by certain unexpected events as well as possible.

Functioning of chaos engineering

As with ordinary stress tests, chaos engineering also aims to reveal and correct any malfunctions in a system, network or program. However, the main difference is that instead of concentrating resources on one single component at a time, chaos engineering operates using an almost infinite scale of possibility.

It looks beyond the obvious bugs and tests distribution systems for issues or combinations of issues which are less likely to occur. In other words, the idea is to obtain new knowledge of the system via a process which is typically divided into various stages as follows:

Establish a baseline: testers must understand how the system should function in optimal conditions and specify what constitutes a normal working state;
Create a hypothesis: one or more potential weaknesses are considered and hypotheses are formulated as to their effects, for example, the software testers might want to know what would happen during a peak flow of traffic;
Carry out the testing phase: experiments are conducted to evaluate the consequences of the established theory. The tests carried out could reveal an error in a critical process or an unexpected cause-effect relation. Using the previous hypothetical case, simulation of a peak in traffic could reveal problems with storage performance;
Evaluate the accuracy of the hypothesis: key issues to be resolved are established.

Teams dealing with chaos engineering therefore (mainly) use ‘what if’ scenarios to provoke random errors and malfunctions. This allows them to identify the most important intervention required, in order to improve the system’s performance and ensure its integrity.

Translated by Joanne Beckwith