Architecture Overview
ChaosPlane is built around four main components: the operator, the daemon, the platform API, and the web UI. Each has a distinct responsibility.
┌─────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────┐ ┌──────────────────────────────┐ │
│ │ chaosctl │ │ ChaosPlane Operator │ │
│ │ (CLI) │───▶│ (controller-manager) │ │
│ └──────────────┘ │ - ChaosExperiment controller │ │
│ │ - ChaosWorkflow controller │ │
│ ┌──────────────┐ │ - Probe runner │ │
│ │ Platform │───▶│ - Rollback manager │ │
│ │ API (Gin) │ └──────────┬───────────────────┘ │
│ └──────────────┘ │ gRPC │
│ ▼ │
│ ┌──────────────┐ ┌──────────────────────────────┐ │
│ │ Web UI │ │ ChaosPlane Daemon │ │
│ │ (Next.js) │ │ (DaemonSet, 1 per node) │ │
│ └──────────────┘ │ - stress-ng │ │
│ │ - tc / iptables │ │
│ │ - container runtime │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Components
Operator
The operator is the brain. It runs as a Deployment (2 replicas for HA) and watches ChaosExperiment, ChaosWorkflow, and BlastRadiusPolicy resources via controller-runtime.
When an experiment is created, the operator:
- Validates it against applicable
BlastRadiusPolicyresources (via the admission webhook) - Runs
beforesteady-state probes - Calls the appropriate executor (Kubernetes API or daemon gRPC)
- Monitors abort conditions during the running phase
- Waits for the duration to elapse
- Runs
aftersteady-state probes - Executes rollback if configured
See Operator for details.
Daemon
The daemon runs as a DaemonSet — one pod per node. It handles chaos actions that require node-level access:
- Network chaos:
tc netemfor delay/loss/corrupt/duplicate,tc tbffor bandwidth,iptablesfor partition - Stress chaos:
stress-ngfor CPU and memory stress - HTTP chaos: transparent HTTP proxy for delay and abort
- DNS chaos: DNS intercept rules
- Container kill: container runtime socket access
- Node restart: system reboot
The operator communicates with the daemon over gRPC. The daemon exposes three RPC methods: ExecStressChaos, ExecNetworkChaos, ExecHTTPChaos, ExecDNSChaos, ExecNodeChaos, and CancelChaos.
See Daemon for details.
Platform API
A Gin-based REST API that wraps the Kubernetes API. It provides:
- Experiment and workflow CRUD
- Policy management
- Real-time updates via WebSocket
- A unified interface for the web UI and external integrations
The API is optional — you can use chaosctl or kubectl directly without it.
Web UI
A Next.js 15 application using the Carbon Design System. It provides:
- Dashboard with experiment status overview
- Experiment list, detail, and creation views
- Workflow visualization
- Policy management
The web UI talks to the platform API.
Admission Webhook
A validating admission webhook that intercepts ChaosExperiment creates and updates. It evaluates the experiment against all applicable BlastRadiusPolicy resources and rejects it if any policy in Enforce mode is violated.
Data flow
Experiment lifecycle
kubectl apply / chaosctl create / API POST
│
▼
Admission Webhook (BlastRadiusPolicy evaluation)
│
▼
ChaosExperiment created (phase: Pending)
│
▼
Operator reconciler picks it up
│
├─▶ Run before probes (phase: SteadyStateChecking)
│ │ fail → phase: Failed
│ │ pass ↓
├─▶ Execute action (phase: Running)
│ │ abort condition triggers → phase: Aborted
│ │ duration elapses ↓
├─▶ Completing (phase: Completing)
│ │
├─▶ Rollback if enabled (phase: Recovering)
│ │
├─▶ Run after probes (phase: Recovering)
│ │ timeout → phase: Failed
│ │ pass ↓
└─▶ phase: Completed
Workflow lifecycle
ChaosWorkflow created (phase: Pending)
│
▼
Workflow controller builds DAG
│
▼
Kahn's algorithm: find templates with no dependencies
│
▼
Execute ready templates (up to maxParallelism)
│
▼
As templates complete, unlock dependent templates
│
▼
Repeat until all templates done or error handling triggers
Technology choices
| Component | Technology | Why |
|---|---|---|
| Operator | Go + controller-runtime | Standard Kubernetes operator pattern |
| Daemon | Go + gRPC | Low-latency node-level control |
| Platform API | Go + Gin | Lightweight, fast HTTP framework |
| Web UI | Next.js 15 + Carbon | React SSR + IBM's battle-tested design system |
| CLI | Go + Cobra | Standard Go CLI framework |
| CRDs | kubebuilder | Code generation, validation, status subresources |