Workflows
A ChaosWorkflow lets you chain multiple experiments into a directed acyclic graph (DAG). Templates run in dependency order, with independent templates executing in parallel.
How the DAG engine works
The workflow controller uses Kahn's algorithm to process templates. It starts with templates that have no dependencies, executes them, then unlocks the next layer. Templates in the same layer run concurrently (up to maxParallelism).
Template types
There are 5 template types:
| Type | Description |
|---|---|
experiment | References a ChaosExperiment resource |
delay | Waits for a fixed duration |
condition | Evaluates an expression and branches |
parallel | Runs a group of templates concurrently |
suspend | Pauses the workflow until manually resumed |
Basic workflow
This workflow kills pods, waits 30 seconds, then injects network delay:
apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosWorkflow
metadata:
name: resilience-suite
namespace: production
spec:
templates:
- name: kill-pods
type: experiment
experimentRef:
name: nginx-pod-kill
namespace: production
- name: recovery-wait
type: delay
delay:
duration: 30s
dependencies: [kill-pods]
- name: network-degradation
type: experiment
experimentRef:
name: nginx-network-delay
namespace: production
dependencies: [recovery-wait]
errorHandling:
strategy: abort
Parallel execution
Run multiple experiments at the same time using a parallel template:
spec:
templates:
- name: parallel-chaos
type: parallel
parallel:
templates: [cpu-stress, memory-stress]
- name: cpu-stress
type: experiment
experimentRef:
name: pod-cpu-stress
namespace: production
- name: memory-stress
type: experiment
experimentRef:
name: pod-memory-stress
namespace: production
- name: verify-recovery
type: experiment
experimentRef:
name: health-check
namespace: production
dependencies: [parallel-chaos]
Suspend and resume
A suspend template pauses the workflow until you manually resume it. Useful for human-in-the-loop approval gates:
spec:
templates:
- name: pre-chaos-check
type: experiment
experimentRef:
name: baseline-check
namespace: production
- name: approval-gate
type: suspend
suspend:
timeout: 1h # auto-resume after 1 hour if not manually resumed
dependencies: [pre-chaos-check]
- name: destructive-chaos
type: experiment
experimentRef:
name: node-drain
namespace: production
dependencies: [approval-gate]
Resume a suspended workflow:
chaosctl resume workflow resilience-suite -n production
Error handling strategies
| Strategy | Behavior |
|---|---|
abort | Stop the workflow immediately on any failure |
continue | Keep running remaining templates despite failures |
retry | Retry the failed template (not yet configurable per-template) |
rollback | Abort and attempt to roll back completed experiments |
spec:
errorHandling:
strategy: rollback
Workflow parameters
Define parameters at the workflow level and reference them in templates:
spec:
parameters:
- name: target-namespace
default: staging
- name: experiment-duration
default: 60s
templates:
- name: pod-kill
type: experiment
experimentRef:
name: pod-kill-experiment
namespace: "{{parameters.target-namespace}}"
Controlling parallelism
Limit how many templates run concurrently:
spec:
execution:
maxParallelism: 2
Workflow phases
| Phase | Description |
|---|---|
Pending | Created, not yet started |
Running | At least one template is executing |
Paused | Suspended at a suspend template |
Completed | All templates finished successfully |
Failed | One or more templates failed |
Aborted | Manually aborted |
Monitoring workflows
# List all workflows
chaosctl list workflows -n production
# Watch a workflow
chaosctl get workflow resilience-suite -n production -w
# Get detailed status including template statuses
chaosctl describe workflow resilience-suite -n production
# Stream logs
chaosctl logs workflow resilience-suite -n production
Full example: multi-stage resilience test
apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosWorkflow
metadata:
name: full-resilience-test
namespace: production
spec:
parameters:
- name: app-namespace
default: production
templates:
# Stage 1: Verify baseline
- name: baseline
type: experiment
experimentRef:
name: baseline-health-check
namespace: production
# Stage 2: Pod-level chaos
- name: pod-kill-test
type: experiment
experimentRef:
name: pod-kill-experiment
namespace: production
dependencies: [baseline]
# Stage 3: Wait for recovery
- name: wait-recovery
type: delay
delay:
duration: 2m
dependencies: [pod-kill-test]
# Stage 4: Network chaos in parallel with CPU stress
- name: combined-chaos
type: parallel
parallel:
templates: [network-delay-test, cpu-stress-test]
dependencies: [wait-recovery]
- name: network-delay-test
type: experiment
experimentRef:
name: network-delay-experiment
namespace: production
- name: cpu-stress-test
type: experiment
experimentRef:
name: cpu-stress-experiment
namespace: production
# Stage 5: Final verification
- name: final-check
type: experiment
experimentRef:
name: final-health-check
namespace: production
dependencies: [combined-chaos]
errorHandling:
strategy: rollback
execution:
maxParallelism: 3