Intentionally injecting failures into systems to test resiliency.
We define the steady state of a system to set a baseline for normal behaviour. We hypothesize that the steady state will continue in both the control group and the experimental group. We introduce variables that reflect real world events like servers that crash, malfunctioning hard drives, or severed network connections.
We are aiming to disprove the hypothesis by looking for a difference between the control group and the experimental group.
Chaos Monkey
Randomly disables production instances to make sure that a system is resilient enough to survive this common type of failure.
We can run Chaos Monkey in the middle of a business day with on call engineers ready to address issues to learn about the weaknesses of a system.
Types of Chaos
I believe this stuff comes from Netflix’s test tooling
- Chaos monkey: bring down random production instances
- Latency monkey: Induces artificial delays in the communication layer
- Simulates service degradation and makes sure upstream services respond accordingly
- Large delays can simulate nodes or services being down
- This is good for testing the fault tolerance of a new service
- Doctor monkey: Taps into health checks to check for system health
- The goal is check for unhealthy instances
- Unhealthy instances are removed from service so the end user cant see them but the devs can use them
- Once the root cause is found, the service is terminated
- Janitor monkey: Finds unused resources and gets rid of them
- Ensures that the environment is running free to clutter
- Conformity monkey: Finds instances that don’t adhere to best practices
- Security monkey: Finds security violations or vulnerabilities
- Terminate offending instances
- Make sure TLS is valid
- 10-18 Monkey Localization
- Detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets
- Chaos gorilla: Simulates outage of an entire availability zone
- We want to verify that functional availability zones are automatically re-balanced