Operator
The ChaosPlane operator is a Kubernetes controller built with controller-runtime. It manages the lifecycle of ChaosExperiment, ChaosWorkflow, and BlastRadiusPolicy resources.
Controllers
ChaosExperiment controller
The experiment controller implements the 8-phase state machine. It reconciles on every phase transition and uses finalizers to ensure rollback runs even if the experiment is deleted mid-run.
Reconcile loop:
Observe current phase
│
├─ Pending → run before probes → SteadyStateChecking
├─ SteadyStateChecking → probes pass → Running
│ → probes fail → Failed
├─ Running → duration elapsed → Completing
│ → abort condition → Aborted
├─ Completing → rollback if enabled → Recovering
│ → no rollback → Recovering (skip to after probes)
├─ Recovering → after probes pass → Completed
│ → timeout → Failed
└─ Terminal (Completed/Failed/Aborted) → no-op
The controller uses requeueAfter to poll during the Running phase for abort conditions and duration tracking.
ChaosWorkflow controller
The workflow controller builds a DAG from the spec.templates list and uses Kahn's algorithm to determine execution order.
DAG execution:
- Build adjacency list from
dependenciesfields - Find all templates with in-degree 0 (no dependencies)
- Execute them concurrently (up to
maxParallelism) - As each template completes, decrement in-degree of dependent templates
- Templates with in-degree 0 become ready to execute
- Repeat until all templates are done or error handling triggers
Template execution creates or references ChaosExperiment resources and watches their status. Delay templates use time.After. Suspend templates set the workflow phase to Paused and wait for a manual resume annotation.
Probe runner
The probe runner is shared between the experiment controller and the abort condition monitor. It supports three probe types:
- K8s probe: lists resources matching the selector and counts ready ones
- HTTP probe: makes an HTTP request and checks status/body
- Prometheus probe: queries the Prometheus HTTP API and evaluates the result
Probes run with a configurable timeout and retry logic.
Rollback manager
The rollback manager calls executor.Rollback() for each affected resource. It tracks which executors were called during Execute and calls them in reverse order during rollback.
Executor registry
The executor registry maps action type strings to Executor implementations:
type Executor interface {
Execute(ctx context.Context, exp *v1alpha1.ChaosExperiment) error
Rollback(ctx context.Context, exp *v1alpha1.ChaosExperiment) error
Validate(exp *v1alpha1.ChaosExperiment) error
}
Registered action types:
| Action type | Executor | Package |
|---|---|---|
pod-kill | KillExecutor | internal/executor/pod |
container-kill | ContainerKillExecutor | internal/executor/pod |
pod-cpu-stress | CPUStressExecutor | internal/executor/pod |
pod-memory-stress | MemoryStressExecutor | internal/executor/pod |
pod-io-stress | IOStressExecutor | internal/executor/pod |
pod-dns-error | DNSErrorExecutor | internal/executor/pod |
pod-http-abort | HTTPAbortExecutor | internal/executor/pod |
pod-http-delay | HTTPDelayExecutor | internal/executor/pod |
network-delay | DelayExecutor | internal/executor/network |
network-loss | LossExecutor | internal/executor/network |
network-corrupt | CorruptExecutor | internal/executor/network |
network-duplicate | DuplicateExecutor | internal/executor/network |
network-partition | PartitionExecutor | internal/executor/network |
network-bandwidth | BandwidthExecutor | internal/executor/network |
node-drain | DrainExecutor | internal/executor/node |
node-taint | TaintExecutor | internal/executor/node |
node-restart | RestartExecutor | internal/executor/node |
node-cpu-stress | CPUStressExecutor | internal/executor/node |
stress-cpu | CPUStressExecutor | internal/executor/node |
stress-memory | MemoryStressExecutor | internal/executor/node |
Admission webhook
The validating admission webhook intercepts ChaosExperiment creates and updates. It:
- Lists all
BlastRadiusPolicyresources - Filters to policies whose scope matches the experiment's target
- Evaluates each policy's 7-step chain
- Rejects the request if any
Enforcepolicy is violated - Logs violations for
Auditpolicies
The webhook is registered as a ValidatingWebhookConfiguration and requires TLS. The Helm chart handles certificate generation via cert-manager or a self-signed certificate.
RBAC
The operator service account needs:
get,list,watch,create,update,patch,deleteonChaosExperiment,ChaosWorkflow,BlastRadiusPolicyget,list,watchonPods,NodesdeleteonPods(for pod-kill)updateonNodes(for node-drain, node-taint)createonpods/eviction(for node-drain)get,list,watch,create,update,patchonEvents
Leader election
The operator uses controller-runtime's built-in leader election via a Lease resource in the chaosplane namespace. Only the leader reconciles; followers stand by.
Metrics
The operator exposes Prometheus metrics on :8080/metrics:
| Metric | Type | Description |
|---|---|---|
chaosplane_experiments_total | Counter | Total experiments by phase |
chaosplane_experiment_duration_seconds | Histogram | Experiment duration |
chaosplane_probe_duration_seconds | Histogram | Probe execution duration |
chaosplane_rollback_total | Counter | Total rollbacks by result |