Core Concepts
Understanding these concepts will help you design effective chaos experiments.
ChaosExperiment
A ChaosExperiment is the fundamental unit of work. It describes what to break, how to break it, for how long, and what "healthy" looks like before and after.
apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: my-experiment
namespace: production
spec:
target: # what to target
action: # what chaos to inject
duration: # how long to run
steadyState: # health checks before/after
abortConditions: # auto-abort triggers
rollback: # whether to undo changes
Experiment phases
An experiment moves through an 8-state machine:
| Phase | Description |
|---|---|
Pending | Created, waiting to start |
SteadyStateChecking | Running pre-chaos health probes |
Running | Chaos is actively injected |
Completing | Duration elapsed, wrapping up |
Recovering | Running post-chaos health probes |
Completed | All probes passed, experiment done |
Failed | A probe failed or an error occurred |
Aborted | Manually aborted or abort condition triggered |
Targets
The target spec tells ChaosPlane which resources to affect:
target:
kind: Pod # Pod or Node
namespace: default
labelSelector:
matchLabels:
app: my-app
# OR target by name:
names:
- my-pod-abc123
You can target by label selector or by explicit names. For pod actions, namespace is required.
Actions
An action specifies the type of chaos and its parameters:
action:
type: pod-kill
parameters:
gracePeriodSeconds: "0"
Parameters are always strings (they're passed as a map[string]string to the executor). See the Action Reference for each action's parameters.
Action categories
- Pod actions (8): kill, container-kill, cpu-stress, memory-stress, io-stress, dns-error, http-abort, http-delay
- Network actions (6): delay, loss, corrupt, duplicate, partition, bandwidth
- Node actions (4): drain, taint, restart, cpu-stress
- Stress actions (2): stress-cpu, stress-memory (generic, not pod-scoped)
Steady-State Probes
Probes verify your system is healthy before and after chaos. If a before probe fails, the experiment won't run. If an after probe fails within the recovery timeout, the experiment is marked Failed.
steadyState:
before:
- name: check-replicas
type: k8s
k8s:
resource: pods
namespace: default
labelSelector: app=my-app
condition:
minReady: 3
after:
- name: check-replicas-recovered
type: k8s
k8s:
resource: pods
namespace: default
labelSelector: app=my-app
condition:
minReady: 3
recoveryTimeout: 5m
Probe types
Kubernetes probe (k8s): checks that a minimum number of matching resources are in a ready state.
HTTP probe (http): makes an HTTP request and checks the response status or body.
- name: api-healthy
type: http
http:
url: https://my-api.example.com/health
method: GET
expectedStatus: 200
Prometheus probe (prometheus): queries a Prometheus endpoint and evaluates the result against a threshold.
- name: error-rate-low
type: prometheus
prometheus:
url: http://prometheus:9090
query: 'rate(http_requests_total{status=~"5.."}[1m])'
condition:
operator: "<"
threshold: 0.01
Abort Conditions
Abort conditions are probes that run continuously during the Running phase. If one triggers, the experiment stops immediately.
abortConditions:
- name: error-rate-spike
type: prometheus
prometheus:
url: http://prometheus:9090
query: 'rate(http_requests_total{status=~"5.."}[1m])'
condition:
operator: ">"
threshold: 0.05
action: abort # abort | pause | rollback
Rollback
Some actions support rollback, which reverses the chaos when the experiment ends. Network chaos actions always roll back (tc rules are removed). Pod kill has no rollback (pods are already restarted by Kubernetes). Node drain rolls back by uncordoning the node.
rollback:
enabled: true
timeout: 5m
ChaosWorkflow
A ChaosWorkflow chains multiple experiments into a DAG. Templates can be experiments, delays, conditions, parallel groups, or suspend points.
apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosWorkflow
metadata:
name: resilience-suite
spec:
templates:
- name: kill-pods
type: experiment
experimentRef:
name: nginx-pod-kill
namespace: demo
- name: wait
type: delay
delay:
duration: 30s
dependencies: [kill-pods]
- name: network-test
type: experiment
experimentRef:
name: nginx-network-delay
namespace: demo
dependencies: [wait]
errorHandling:
strategy: abort # abort | continue | retry | rollback
BlastRadiusPolicy
A BlastRadiusPolicy is a cluster-scoped guardrail that limits what experiments can do. It evaluates in 7 steps: namespace scope, label scope, action type, max targets, max concurrent, time windows, and audit mode.
apiVersion: chaos.chaosplane.io/v1alpha1
kind: BlastRadiusPolicy
metadata:
name: production-policy
spec:
enforcement: Enforce # Enforce | Audit
scope:
namespaces: [production]
targetLimits:
maxTargets: 1
maxPercentage: 10
protectedResources:
namespaces: [kube-system]
labels:
chaosplane.io/protected: "true"
actionLimits:
allowedActions: [pod-kill, network-delay]
maxDuration: 5m
timeWindows:
allowed:
- name: business-hours
schedule: "0 9 * * 1-5"
duration: 8h
timezone: UTC
In Audit mode, violations are logged but not blocked. In Enforce mode, experiments that violate the policy are rejected.
The Daemon
The ChaosPlane daemon runs as a DaemonSet on every node. It handles chaos actions that require node-level access: network manipulation (via tc/iptables), stress testing (via stress-ng), container kills (via the container runtime), and node restarts. The operator communicates with the daemon over gRPC.
Actions that only need the Kubernetes API (pod-kill, node-drain, node-taint) run directly from the operator without touching the daemon.