Skip to main content

Operator

The ChaosPlane operator is a Kubernetes controller built with controller-runtime. It manages the lifecycle of ChaosExperiment, ChaosWorkflow, and BlastRadiusPolicy resources.

Controllers

ChaosExperiment controller

The experiment controller implements the 8-phase state machine. It reconciles on every phase transition and uses finalizers to ensure rollback runs even if the experiment is deleted mid-run.

Reconcile loop:

Observe current phase

├─ Pending → run before probes → SteadyStateChecking
├─ SteadyStateChecking → probes pass → Running
│ → probes fail → Failed
├─ Running → duration elapsed → Completing
│ → abort condition → Aborted
├─ Completing → rollback if enabled → Recovering
│ → no rollback → Recovering (skip to after probes)
├─ Recovering → after probes pass → Completed
│ → timeout → Failed
└─ Terminal (Completed/Failed/Aborted) → no-op

The controller uses requeueAfter to poll during the Running phase for abort conditions and duration tracking.

ChaosWorkflow controller

The workflow controller builds a DAG from the spec.templates list and uses Kahn's algorithm to determine execution order.

DAG execution:

  1. Build adjacency list from dependencies fields
  2. Find all templates with in-degree 0 (no dependencies)
  3. Execute them concurrently (up to maxParallelism)
  4. As each template completes, decrement in-degree of dependent templates
  5. Templates with in-degree 0 become ready to execute
  6. Repeat until all templates are done or error handling triggers

Template execution creates or references ChaosExperiment resources and watches their status. Delay templates use time.After. Suspend templates set the workflow phase to Paused and wait for a manual resume annotation.

Probe runner

The probe runner is shared between the experiment controller and the abort condition monitor. It supports three probe types:

  • K8s probe: lists resources matching the selector and counts ready ones
  • HTTP probe: makes an HTTP request and checks status/body
  • Prometheus probe: queries the Prometheus HTTP API and evaluates the result

Probes run with a configurable timeout and retry logic.

Rollback manager

The rollback manager calls executor.Rollback() for each affected resource. It tracks which executors were called during Execute and calls them in reverse order during rollback.

Executor registry

The executor registry maps action type strings to Executor implementations:

type Executor interface {
Execute(ctx context.Context, exp *v1alpha1.ChaosExperiment) error
Rollback(ctx context.Context, exp *v1alpha1.ChaosExperiment) error
Validate(exp *v1alpha1.ChaosExperiment) error
}

Registered action types:

Action typeExecutorPackage
pod-killKillExecutorinternal/executor/pod
container-killContainerKillExecutorinternal/executor/pod
pod-cpu-stressCPUStressExecutorinternal/executor/pod
pod-memory-stressMemoryStressExecutorinternal/executor/pod
pod-io-stressIOStressExecutorinternal/executor/pod
pod-dns-errorDNSErrorExecutorinternal/executor/pod
pod-http-abortHTTPAbortExecutorinternal/executor/pod
pod-http-delayHTTPDelayExecutorinternal/executor/pod
network-delayDelayExecutorinternal/executor/network
network-lossLossExecutorinternal/executor/network
network-corruptCorruptExecutorinternal/executor/network
network-duplicateDuplicateExecutorinternal/executor/network
network-partitionPartitionExecutorinternal/executor/network
network-bandwidthBandwidthExecutorinternal/executor/network
node-drainDrainExecutorinternal/executor/node
node-taintTaintExecutorinternal/executor/node
node-restartRestartExecutorinternal/executor/node
node-cpu-stressCPUStressExecutorinternal/executor/node
stress-cpuCPUStressExecutorinternal/executor/node
stress-memoryMemoryStressExecutorinternal/executor/node

Admission webhook

The validating admission webhook intercepts ChaosExperiment creates and updates. It:

  1. Lists all BlastRadiusPolicy resources
  2. Filters to policies whose scope matches the experiment's target
  3. Evaluates each policy's 7-step chain
  4. Rejects the request if any Enforce policy is violated
  5. Logs violations for Audit policies

The webhook is registered as a ValidatingWebhookConfiguration and requires TLS. The Helm chart handles certificate generation via cert-manager or a self-signed certificate.

RBAC

The operator service account needs:

  • get, list, watch, create, update, patch, delete on ChaosExperiment, ChaosWorkflow, BlastRadiusPolicy
  • get, list, watch on Pods, Nodes
  • delete on Pods (for pod-kill)
  • update on Nodes (for node-drain, node-taint)
  • create on pods/eviction (for node-drain)
  • get, list, watch, create, update, patch on Events

Leader election

The operator uses controller-runtime's built-in leader election via a Lease resource in the chaosplane namespace. Only the leader reconciles; followers stand by.

Metrics

The operator exposes Prometheus metrics on :8080/metrics:

MetricTypeDescription
chaosplane_experiments_totalCounterTotal experiments by phase
chaosplane_experiment_duration_secondsHistogramExperiment duration
chaosplane_probe_duration_secondsHistogramProbe execution duration
chaosplane_rollback_totalCounterTotal rollbacks by result