Skip to main content

Architecture Overview

ChaosPlane is built around four main components: the operator, the daemon, the platform API, and the web UI. Each has a distinct responsibility.

┌─────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────┐ ┌──────────────────────────────┐ │
│ │ chaosctl │ │ ChaosPlane Operator │ │
│ │ (CLI) │───▶│ (controller-manager) │ │
│ └──────────────┘ │ - ChaosExperiment controller │ │
│ │ - ChaosWorkflow controller │ │
│ ┌──────────────┐ │ - Probe runner │ │
│ │ Platform │───▶│ - Rollback manager │ │
│ │ API (Gin) │ └──────────┬───────────────────┘ │
│ └──────────────┘ │ gRPC │
│ ▼ │
│ ┌──────────────┐ ┌──────────────────────────────┐ │
│ │ Web UI │ │ ChaosPlane Daemon │ │
│ │ (Next.js) │ │ (DaemonSet, 1 per node) │ │
│ └──────────────┘ │ - stress-ng │ │
│ │ - tc / iptables │ │
│ │ - container runtime │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

Components

Operator

The operator is the brain. It runs as a Deployment (2 replicas for HA) and watches ChaosExperiment, ChaosWorkflow, and BlastRadiusPolicy resources via controller-runtime.

When an experiment is created, the operator:

  1. Validates it against applicable BlastRadiusPolicy resources (via the admission webhook)
  2. Runs before steady-state probes
  3. Calls the appropriate executor (Kubernetes API or daemon gRPC)
  4. Monitors abort conditions during the running phase
  5. Waits for the duration to elapse
  6. Runs after steady-state probes
  7. Executes rollback if configured

See Operator for details.

Daemon

The daemon runs as a DaemonSet — one pod per node. It handles chaos actions that require node-level access:

  • Network chaos: tc netem for delay/loss/corrupt/duplicate, tc tbf for bandwidth, iptables for partition
  • Stress chaos: stress-ng for CPU and memory stress
  • HTTP chaos: transparent HTTP proxy for delay and abort
  • DNS chaos: DNS intercept rules
  • Container kill: container runtime socket access
  • Node restart: system reboot

The operator communicates with the daemon over gRPC. The daemon exposes three RPC methods: ExecStressChaos, ExecNetworkChaos, ExecHTTPChaos, ExecDNSChaos, ExecNodeChaos, and CancelChaos.

See Daemon for details.

Platform API

A Gin-based REST API that wraps the Kubernetes API. It provides:

  • Experiment and workflow CRUD
  • Policy management
  • Real-time updates via WebSocket
  • A unified interface for the web UI and external integrations

The API is optional — you can use chaosctl or kubectl directly without it.

Web UI

A Next.js 15 application using the Carbon Design System. It provides:

  • Dashboard with experiment status overview
  • Experiment list, detail, and creation views
  • Workflow visualization
  • Policy management

The web UI talks to the platform API.

Admission Webhook

A validating admission webhook that intercepts ChaosExperiment creates and updates. It evaluates the experiment against all applicable BlastRadiusPolicy resources and rejects it if any policy in Enforce mode is violated.

Data flow

Experiment lifecycle

kubectl apply / chaosctl create / API POST


Admission Webhook (BlastRadiusPolicy evaluation)


ChaosExperiment created (phase: Pending)


Operator reconciler picks it up

├─▶ Run before probes (phase: SteadyStateChecking)
│ │ fail → phase: Failed
│ │ pass ↓
├─▶ Execute action (phase: Running)
│ │ abort condition triggers → phase: Aborted
│ │ duration elapses ↓
├─▶ Completing (phase: Completing)
│ │
├─▶ Rollback if enabled (phase: Recovering)
│ │
├─▶ Run after probes (phase: Recovering)
│ │ timeout → phase: Failed
│ │ pass ↓
└─▶ phase: Completed

Workflow lifecycle

ChaosWorkflow created (phase: Pending)


Workflow controller builds DAG


Kahn's algorithm: find templates with no dependencies


Execute ready templates (up to maxParallelism)


As templates complete, unlock dependent templates


Repeat until all templates done or error handling triggers

Technology choices

ComponentTechnologyWhy
OperatorGo + controller-runtimeStandard Kubernetes operator pattern
DaemonGo + gRPCLow-latency node-level control
Platform APIGo + GinLightweight, fast HTTP framework
Web UINext.js 15 + CarbonReact SSR + IBM's battle-tested design system
CLIGo + CobraStandard Go CLI framework
CRDskubebuilderCode generation, validation, status subresources