Skip to main content

Core Concepts

Understanding these concepts will help you design effective chaos experiments.

ChaosExperiment

A ChaosExperiment is the fundamental unit of work. It describes what to break, how to break it, for how long, and what "healthy" looks like before and after.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: my-experiment
namespace: production
spec:
target: # what to target
action: # what chaos to inject
duration: # how long to run
steadyState: # health checks before/after
abortConditions: # auto-abort triggers
rollback: # whether to undo changes

Experiment phases

An experiment moves through an 8-state machine:

PhaseDescription
PendingCreated, waiting to start
SteadyStateCheckingRunning pre-chaos health probes
RunningChaos is actively injected
CompletingDuration elapsed, wrapping up
RecoveringRunning post-chaos health probes
CompletedAll probes passed, experiment done
FailedA probe failed or an error occurred
AbortedManually aborted or abort condition triggered

Targets

The target spec tells ChaosPlane which resources to affect:

target:
kind: Pod # Pod or Node
namespace: default
labelSelector:
matchLabels:
app: my-app
# OR target by name:
names:
- my-pod-abc123

You can target by label selector or by explicit names. For pod actions, namespace is required.

Actions

An action specifies the type of chaos and its parameters:

action:
type: pod-kill
parameters:
gracePeriodSeconds: "0"

Parameters are always strings (they're passed as a map[string]string to the executor). See the Action Reference for each action's parameters.

Action categories

  • Pod actions (8): kill, container-kill, cpu-stress, memory-stress, io-stress, dns-error, http-abort, http-delay
  • Network actions (6): delay, loss, corrupt, duplicate, partition, bandwidth
  • Node actions (4): drain, taint, restart, cpu-stress
  • Stress actions (2): stress-cpu, stress-memory (generic, not pod-scoped)

Steady-State Probes

Probes verify your system is healthy before and after chaos. If a before probe fails, the experiment won't run. If an after probe fails within the recovery timeout, the experiment is marked Failed.

steadyState:
before:
- name: check-replicas
type: k8s
k8s:
resource: pods
namespace: default
labelSelector: app=my-app
condition:
minReady: 3
after:
- name: check-replicas-recovered
type: k8s
k8s:
resource: pods
namespace: default
labelSelector: app=my-app
condition:
minReady: 3
recoveryTimeout: 5m

Probe types

Kubernetes probe (k8s): checks that a minimum number of matching resources are in a ready state.

HTTP probe (http): makes an HTTP request and checks the response status or body.

- name: api-healthy
type: http
http:
url: https://my-api.example.com/health
method: GET
expectedStatus: 200

Prometheus probe (prometheus): queries a Prometheus endpoint and evaluates the result against a threshold.

- name: error-rate-low
type: prometheus
prometheus:
url: http://prometheus:9090
query: 'rate(http_requests_total{status=~"5.."}[1m])'
condition:
operator: "<"
threshold: 0.01

Abort Conditions

Abort conditions are probes that run continuously during the Running phase. If one triggers, the experiment stops immediately.

abortConditions:
- name: error-rate-spike
type: prometheus
prometheus:
url: http://prometheus:9090
query: 'rate(http_requests_total{status=~"5.."}[1m])'
condition:
operator: ">"
threshold: 0.05
action: abort # abort | pause | rollback

Rollback

Some actions support rollback, which reverses the chaos when the experiment ends. Network chaos actions always roll back (tc rules are removed). Pod kill has no rollback (pods are already restarted by Kubernetes). Node drain rolls back by uncordoning the node.

rollback:
enabled: true
timeout: 5m

ChaosWorkflow

A ChaosWorkflow chains multiple experiments into a DAG. Templates can be experiments, delays, conditions, parallel groups, or suspend points.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosWorkflow
metadata:
name: resilience-suite
spec:
templates:
- name: kill-pods
type: experiment
experimentRef:
name: nginx-pod-kill
namespace: demo
- name: wait
type: delay
delay:
duration: 30s
dependencies: [kill-pods]
- name: network-test
type: experiment
experimentRef:
name: nginx-network-delay
namespace: demo
dependencies: [wait]
errorHandling:
strategy: abort # abort | continue | retry | rollback

BlastRadiusPolicy

A BlastRadiusPolicy is a cluster-scoped guardrail that limits what experiments can do. It evaluates in 7 steps: namespace scope, label scope, action type, max targets, max concurrent, time windows, and audit mode.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: BlastRadiusPolicy
metadata:
name: production-policy
spec:
enforcement: Enforce # Enforce | Audit
scope:
namespaces: [production]
targetLimits:
maxTargets: 1
maxPercentage: 10
protectedResources:
namespaces: [kube-system]
labels:
chaosplane.io/protected: "true"
actionLimits:
allowedActions: [pod-kill, network-delay]
maxDuration: 5m
timeWindows:
allowed:
- name: business-hours
schedule: "0 9 * * 1-5"
duration: 8h
timezone: UTC

In Audit mode, violations are logged but not blocked. In Enforce mode, experiments that violate the policy are rejected.

The Daemon

The ChaosPlane daemon runs as a DaemonSet on every node. It handles chaos actions that require node-level access: network manipulation (via tc/iptables), stress testing (via stress-ng), container kills (via the container runtime), and node restarts. The operator communicates with the daemon over gRPC.

Actions that only need the Kubernetes API (pod-kill, node-drain, node-taint) run directly from the operator without touching the daemon.