Core Concepts

Understanding these concepts will help you design effective chaos experiments.

ChaosExperiment

A ChaosExperiment is the fundamental unit of work. It describes what to break, how to break it, for how long, and what "healthy" looks like before and after.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: my-experiment
  namespace: production
spec:
  target:        # what to target
  action:        # what chaos to inject
  duration:      # how long to run
  steadyState:   # health checks before/after
  abortConditions: # auto-abort triggers
  rollback:      # whether to undo changes

Experiment phases

An experiment moves through an 8-state machine:

Phase	Description
`Pending`	Created, waiting to start
`SteadyStateChecking`	Running pre-chaos health probes
`Running`	Chaos is actively injected
`Completing`	Duration elapsed, wrapping up
`Recovering`	Running post-chaos health probes
`Completed`	All probes passed, experiment done
`Failed`	A probe failed or an error occurred
`Aborted`	Manually aborted or abort condition triggered

Targets

The target spec tells ChaosPlane which resources to affect:

target:
  kind: Pod          # Pod or Node
  namespace: default
  labelSelector:
    matchLabels:
      app: my-app
  # OR target by name:
  names:
    - my-pod-abc123

You can target by label selector or by explicit names. For pod actions, namespace is required.

Actions

An action specifies the type of chaos and its parameters:

action:
  type: pod-kill
  parameters:
    gracePeriodSeconds: "0"

Parameters are always strings (they're passed as a map[string]string to the executor). See the Action Reference for each action's parameters.

Action categories

Pod actions (8): kill, container-kill, cpu-stress, memory-stress, io-stress, dns-error, http-abort, http-delay
Network actions (6): delay, loss, corrupt, duplicate, partition, bandwidth
Node actions (4): drain, taint, restart, cpu-stress
Stress actions (2): stress-cpu, stress-memory (generic, not pod-scoped)

Steady-State Probes

Probes verify your system is healthy before and after chaos. If a before probe fails, the experiment won't run. If an after probe fails within the recovery timeout, the experiment is marked Failed.

steadyState:
  before:
    - name: check-replicas
      type: k8s
      k8s:
        resource: pods
        namespace: default
        labelSelector: app=my-app
        condition:
          minReady: 3
  after:
    - name: check-replicas-recovered
      type: k8s
      k8s:
        resource: pods
        namespace: default
        labelSelector: app=my-app
        condition:
          minReady: 3
  recoveryTimeout: 5m

Probe types

Kubernetes probe (k8s): checks that a minimum number of matching resources are in a ready state.

HTTP probe (http): makes an HTTP request and checks the response status or body.

- name: api-healthy
  type: http
  http:
    url: https://my-api.example.com/health
    method: GET
    expectedStatus: 200

Prometheus probe (prometheus): queries a Prometheus endpoint and evaluates the result against a threshold.

- name: error-rate-low
  type: prometheus
  prometheus:
    url: http://prometheus:9090
    query: 'rate(http_requests_total{status=~"5.."}[1m])'
    condition:
      operator: "<"
      threshold: 0.01

Abort Conditions

Abort conditions are probes that run continuously during the Running phase. If one triggers, the experiment stops immediately.

abortConditions:
  - name: error-rate-spike
    type: prometheus
    prometheus:
      url: http://prometheus:9090
      query: 'rate(http_requests_total{status=~"5.."}[1m])'
      condition:
        operator: ">"
        threshold: 0.05
    action: abort   # abort | pause | rollback

Rollback

Some actions support rollback, which reverses the chaos when the experiment ends. Network chaos actions always roll back (tc rules are removed). Pod kill has no rollback (pods are already restarted by Kubernetes). Node drain rolls back by uncordoning the node.

rollback:
  enabled: true
  timeout: 5m

ChaosWorkflow

A ChaosWorkflow chains multiple experiments into a DAG. Templates can be experiments, delays, conditions, parallel groups, or suspend points.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosWorkflow
metadata:
  name: resilience-suite
spec:
  templates:
    - name: kill-pods
      type: experiment
      experimentRef:
        name: nginx-pod-kill
        namespace: demo
    - name: wait
      type: delay
      delay:
        duration: 30s
      dependencies: [kill-pods]
    - name: network-test
      type: experiment
      experimentRef:
        name: nginx-network-delay
        namespace: demo
      dependencies: [wait]
  errorHandling:
    strategy: abort   # abort | continue | retry | rollback

BlastRadiusPolicy

A BlastRadiusPolicy is a cluster-scoped guardrail that limits what experiments can do. It evaluates in 7 steps: namespace scope, label scope, action type, max targets, max concurrent, time windows, and audit mode.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: BlastRadiusPolicy
metadata:
  name: production-policy
spec:
  enforcement: Enforce   # Enforce | Audit
  scope:
    namespaces: [production]
  targetLimits:
    maxTargets: 1
    maxPercentage: 10
  protectedResources:
    namespaces: [kube-system]
    labels:
      chaosplane.io/protected: "true"
  actionLimits:
    allowedActions: [pod-kill, network-delay]
    maxDuration: 5m
  timeWindows:
    allowed:
      - name: business-hours
        schedule: "0 9 * * 1-5"
        duration: 8h
        timezone: UTC

In Audit mode, violations are logged but not blocked. In Enforce mode, experiments that violate the policy are rejected.

The Daemon

The ChaosPlane daemon runs as a DaemonSet on every node. It handles chaos actions that require node-level access: network manipulation (via tc/iptables), stress testing (via stress-ng), container kills (via the container runtime), and node restarts. The operator communicates with the daemon over gRPC.

Actions that only need the Kubernetes API (pod-kill, node-drain, node-taint) run directly from the operator without touching the daemon.

ChaosExperiment​

Experiment phases​

Targets​

Actions​

Action categories​

Steady-State Probes​

Probe types​

Abort Conditions​

Rollback​

ChaosWorkflow​

BlastRadiusPolicy​

The Daemon​