Skip to main content

Workflows

A ChaosWorkflow lets you chain multiple experiments into a directed acyclic graph (DAG). Templates run in dependency order, with independent templates executing in parallel.

How the DAG engine works

The workflow controller uses Kahn's algorithm to process templates. It starts with templates that have no dependencies, executes them, then unlocks the next layer. Templates in the same layer run concurrently (up to maxParallelism).

Template types

There are 5 template types:

TypeDescription
experimentReferences a ChaosExperiment resource
delayWaits for a fixed duration
conditionEvaluates an expression and branches
parallelRuns a group of templates concurrently
suspendPauses the workflow until manually resumed

Basic workflow

This workflow kills pods, waits 30 seconds, then injects network delay:

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosWorkflow
metadata:
name: resilience-suite
namespace: production
spec:
templates:
- name: kill-pods
type: experiment
experimentRef:
name: nginx-pod-kill
namespace: production

- name: recovery-wait
type: delay
delay:
duration: 30s
dependencies: [kill-pods]

- name: network-degradation
type: experiment
experimentRef:
name: nginx-network-delay
namespace: production
dependencies: [recovery-wait]

errorHandling:
strategy: abort

Parallel execution

Run multiple experiments at the same time using a parallel template:

spec:
templates:
- name: parallel-chaos
type: parallel
parallel:
templates: [cpu-stress, memory-stress]

- name: cpu-stress
type: experiment
experimentRef:
name: pod-cpu-stress
namespace: production

- name: memory-stress
type: experiment
experimentRef:
name: pod-memory-stress
namespace: production

- name: verify-recovery
type: experiment
experimentRef:
name: health-check
namespace: production
dependencies: [parallel-chaos]

Suspend and resume

A suspend template pauses the workflow until you manually resume it. Useful for human-in-the-loop approval gates:

spec:
templates:
- name: pre-chaos-check
type: experiment
experimentRef:
name: baseline-check
namespace: production

- name: approval-gate
type: suspend
suspend:
timeout: 1h # auto-resume after 1 hour if not manually resumed
dependencies: [pre-chaos-check]

- name: destructive-chaos
type: experiment
experimentRef:
name: node-drain
namespace: production
dependencies: [approval-gate]

Resume a suspended workflow:

chaosctl resume workflow resilience-suite -n production

Error handling strategies

StrategyBehavior
abortStop the workflow immediately on any failure
continueKeep running remaining templates despite failures
retryRetry the failed template (not yet configurable per-template)
rollbackAbort and attempt to roll back completed experiments
spec:
errorHandling:
strategy: rollback

Workflow parameters

Define parameters at the workflow level and reference them in templates:

spec:
parameters:
- name: target-namespace
default: staging
- name: experiment-duration
default: 60s
templates:
- name: pod-kill
type: experiment
experimentRef:
name: pod-kill-experiment
namespace: "{{parameters.target-namespace}}"

Controlling parallelism

Limit how many templates run concurrently:

spec:
execution:
maxParallelism: 2

Workflow phases

PhaseDescription
PendingCreated, not yet started
RunningAt least one template is executing
PausedSuspended at a suspend template
CompletedAll templates finished successfully
FailedOne or more templates failed
AbortedManually aborted

Monitoring workflows

# List all workflows
chaosctl list workflows -n production

# Watch a workflow
chaosctl get workflow resilience-suite -n production -w

# Get detailed status including template statuses
chaosctl describe workflow resilience-suite -n production

# Stream logs
chaosctl logs workflow resilience-suite -n production

Full example: multi-stage resilience test

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosWorkflow
metadata:
name: full-resilience-test
namespace: production
spec:
parameters:
- name: app-namespace
default: production

templates:
# Stage 1: Verify baseline
- name: baseline
type: experiment
experimentRef:
name: baseline-health-check
namespace: production

# Stage 2: Pod-level chaos
- name: pod-kill-test
type: experiment
experimentRef:
name: pod-kill-experiment
namespace: production
dependencies: [baseline]

# Stage 3: Wait for recovery
- name: wait-recovery
type: delay
delay:
duration: 2m
dependencies: [pod-kill-test]

# Stage 4: Network chaos in parallel with CPU stress
- name: combined-chaos
type: parallel
parallel:
templates: [network-delay-test, cpu-stress-test]
dependencies: [wait-recovery]

- name: network-delay-test
type: experiment
experimentRef:
name: network-delay-experiment
namespace: production

- name: cpu-stress-test
type: experiment
experimentRef:
name: cpu-stress-experiment
namespace: production

# Stage 5: Final verification
- name: final-check
type: experiment
experimentRef:
name: final-health-check
namespace: production
dependencies: [combined-chaos]

errorHandling:
strategy: rollback

execution:
maxParallelism: 3