Cloud 7 min read

Implementing chaos engineering and disaster recovery testing to improve system resiliency in DevOps

Possibilities that chaos engineering and disaster recovery testing
Table of Contents

Chaos engineering is a testing method that combines chaos theory and system design to create more realistic conditions for testing. It was originally developed by Netflix as a way to test their infrastructure and applications, but it has since become an industry-standard in distributed systems and DevOps teams everywhere. Chaos engineering can be used by any organization with an interest in improving resilience through a better understanding of how failures affect systems.

What is chaos engineering?

Chaos engineering is an approach to building resilient systems. It’s a new approach that treats failures as normal and predictable, rather than unexpected.

Chaos engineers experiment on distributed systems in production by injecting faults and monitoring the results. This process helps us gain confidence in our ability to withstand turbulent conditions in production.

How to define your goals with chaos engineering.

In order to effectively implement chaos engineering, you need to set goals for yourself. The first step is defining the problem you’re trying to solve and identifying its root cause. This can help you understand how much work it’s going to take for your team and organization to get where they want to go.

Once you’ve identified what needs fixing and why to set measurable fitness goals that are ambitious but realistic. Make sure these goals align with business goals so that they’re relevant at both an individual level (e.g., “I want my codebase well-tested”) as well as a team level (e.g., “We need more people on our engineering team”).

Set up your first experiment.

Step 1: Define the experiment
Before you can get started, it’s important to define your first experiment. This will help you determine what kind of failure you want to test and how it will affect your application. You should also think about when and where this failure could occur. For example, if a database server goes down during peak hours on Monday morning when everyone is trying to log in at once, that would be an excellent time for an outage simulation!

Once you’ve defined your experiment and its impact on users or systems (more on this later), the next step is deciding how long each test should run before being stopped by human intervention or some other automated process–or whether there will even be any human intervention at all during testing (more on this later).

Finally, think about who needs access during each phase of testing so they can make decisions based on results from previous experiments without having prior knowledge of them beforehand (which might skew results).

Devise a hypothesis and inject failure into your system in a controlled way.

  • Define the problem before starting on a solution.
  • Start by defining your goals, even if they are ambitious and far away.
  • Don’t worry about what other people’s goals are; focus on your own.
  • Be ambitious, but stay realistic in terms of what you can achieve within 3-6 months.

Measure the impact of failure and communicate any incidents you identify.

Measure the impact of failure

As you’re measuring your system’s resiliency, it’s important to understand how much impact each incident has on your business. You can measure the impact by:

  • Counting how many users were affected by an incident
  • Calculating the increase in response time for all users, or just those who experienced an outage (i.e., if there are 100 users and only 50 experienced an outage, then half would be impacted)

Communicate any incidents you identify

Apply what you learn.

  • You’re not done until you’ve applied what you learn.
  • If a new failure mode is discovered, it should be incorporated into your existing set of failure scenarios and tested again. The goal here is to make sure your system can handle all potential failures in the most efficient manner possible so that they don’t cause any downtime or service degradation when they happen in production. This feedback loop can be repeated over time as more people become involved in testing, which will allow for faster iteration times and greater adaptability within your organization’s infrastructure as a whole.

With chaos engineering, you can do more than understand how failures affect your systems; you can build systems that are resilient and stay operational even when failures occur.

Chaos engineering is a way to test the resiliency of your systems. It involves injecting failures into your system in a controlled way so that you can observe how it responds and learn from those observations. By introducing failures on purpose, you can build systems that are resilient and stay operational even when failures occur.

Chaos engineering addresses a common problem for organizations: they have good uptime goals but no way to measure whether those goals are being met or not–and there’s no easy way to tell if their systems would actually be able to survive real-world disasters like outages or attacks until these events actually happen (and sometimes not even then).

With chaos engineering, however, you can do more than understand how failures affect your systems; you can build them so they’re designed from the ground up with resilience in mind!


I hope you’re excited about the possibilities that chaos engineering and disaster recovery testing can bring to your organization. The next time you design a system, consider whether chaos engineering may be able to help you build one that is more resilient and capable of overcoming failure.

Need help with technology
for your digital platform?

Get to know how technology can be leveraged to turn your idea into a reality.
Schedule a call with our experts

unthinkable ideas