Workflows

A ChaosWorkflow lets you chain multiple experiments into a directed acyclic graph (DAG). Templates run in dependency order, with independent templates executing in parallel.

How the DAG engine works

The workflow controller uses Kahn's algorithm to process templates. It starts with templates that have no dependencies, executes them, then unlocks the next layer. Templates in the same layer run concurrently (up to maxParallelism).

Template types

There are 5 template types:

Type	Description
`experiment`	References a `ChaosExperiment` resource
`delay`	Waits for a fixed duration
`condition`	Evaluates an expression and branches
`parallel`	Runs a group of templates concurrently
`suspend`	Pauses the workflow until manually resumed

Basic workflow

This workflow kills pods, waits 30 seconds, then injects network delay:

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosWorkflow
metadata:
  name: resilience-suite
  namespace: production
spec:
  templates:
    - name: kill-pods
      type: experiment
      experimentRef:
        name: nginx-pod-kill
        namespace: production

    - name: recovery-wait
      type: delay
      delay:
        duration: 30s
      dependencies: [kill-pods]

    - name: network-degradation
      type: experiment
      experimentRef:
        name: nginx-network-delay
        namespace: production
      dependencies: [recovery-wait]

  errorHandling:
    strategy: abort

Parallel execution

Run multiple experiments at the same time using a parallel template:

spec:
  templates:
    - name: parallel-chaos
      type: parallel
      parallel:
        templates: [cpu-stress, memory-stress]

    - name: cpu-stress
      type: experiment
      experimentRef:
        name: pod-cpu-stress
        namespace: production

    - name: memory-stress
      type: experiment
      experimentRef:
        name: pod-memory-stress
        namespace: production

    - name: verify-recovery
      type: experiment
      experimentRef:
        name: health-check
        namespace: production
      dependencies: [parallel-chaos]

Suspend and resume

A suspend template pauses the workflow until you manually resume it. Useful for human-in-the-loop approval gates:

spec:
  templates:
    - name: pre-chaos-check
      type: experiment
      experimentRef:
        name: baseline-check
        namespace: production

    - name: approval-gate
      type: suspend
      suspend:
        timeout: 1h    # auto-resume after 1 hour if not manually resumed
      dependencies: [pre-chaos-check]

    - name: destructive-chaos
      type: experiment
      experimentRef:
        name: node-drain
        namespace: production
      dependencies: [approval-gate]

Resume a suspended workflow:

chaosctl resume workflow resilience-suite -n production

Error handling strategies

Strategy	Behavior
`abort`	Stop the workflow immediately on any failure
`continue`	Keep running remaining templates despite failures
`retry`	Retry the failed template (not yet configurable per-template)
`rollback`	Abort and attempt to roll back completed experiments

spec:
  errorHandling:
    strategy: rollback

Workflow parameters

Define parameters at the workflow level and reference them in templates:

spec:
  parameters:
    - name: target-namespace
      default: staging
    - name: experiment-duration
      default: 60s
  templates:
    - name: pod-kill
      type: experiment
      experimentRef:
        name: pod-kill-experiment
        namespace: "{{parameters.target-namespace}}"

Controlling parallelism

Limit how many templates run concurrently:

spec:
  execution:
    maxParallelism: 2

Workflow phases

Phase	Description
`Pending`	Created, not yet started
`Running`	At least one template is executing
`Paused`	Suspended at a suspend template
`Completed`	All templates finished successfully
`Failed`	One or more templates failed
`Aborted`	Manually aborted

Monitoring workflows

# List all workflows
chaosctl list workflows -n production

# Watch a workflow
chaosctl get workflow resilience-suite -n production -w

# Get detailed status including template statuses
chaosctl describe workflow resilience-suite -n production

# Stream logs
chaosctl logs workflow resilience-suite -n production

Full example: multi-stage resilience test

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosWorkflow
metadata:
  name: full-resilience-test
  namespace: production
spec:
  parameters:
    - name: app-namespace
      default: production

  templates:
    # Stage 1: Verify baseline
    - name: baseline
      type: experiment
      experimentRef:
        name: baseline-health-check
        namespace: production

    # Stage 2: Pod-level chaos
    - name: pod-kill-test
      type: experiment
      experimentRef:
        name: pod-kill-experiment
        namespace: production
      dependencies: [baseline]

    # Stage 3: Wait for recovery
    - name: wait-recovery
      type: delay
      delay:
        duration: 2m
      dependencies: [pod-kill-test]

    # Stage 4: Network chaos in parallel with CPU stress
    - name: combined-chaos
      type: parallel
      parallel:
        templates: [network-delay-test, cpu-stress-test]
      dependencies: [wait-recovery]

    - name: network-delay-test
      type: experiment
      experimentRef:
        name: network-delay-experiment
        namespace: production

    - name: cpu-stress-test
      type: experiment
      experimentRef:
        name: cpu-stress-experiment
        namespace: production

    # Stage 5: Final verification
    - name: final-check
      type: experiment
      experimentRef:
        name: final-health-check
        namespace: production
      dependencies: [combined-chaos]

  errorHandling:
    strategy: rollback

  execution:
    maxParallelism: 3

How the DAG engine works​

Template types​

Basic workflow​

Parallel execution​

Suspend and resume​

Error handling strategies​

Workflow parameters​

Controlling parallelism​

Workflow phases​

Monitoring workflows​

Full example: multi-stage resilience test​