Skip to main content

Daemon

The ChaosPlane daemon runs as a DaemonSet — one pod per node. It handles chaos actions that require node-level access: network manipulation, stress testing, container kills, HTTP interception, DNS injection, and node restarts.

Why a daemon?

Some chaos actions can't be performed from the operator pod:

  • Network chaos requires entering the target pod's network namespace and running tc or iptables commands
  • Stress chaos requires running stress-ng inside the pod's cgroup or on the host
  • Container kill requires access to the container runtime socket
  • Node restart requires executing a system reboot on the host

The daemon runs with elevated privileges on each node and exposes a gRPC server that the operator calls.

gRPC API

The daemon exposes these RPC methods:

ExecStressChaos

Runs stress-ng for CPU, memory, I/O, or container-kill actions.

rpc ExecStressChaos(StressChaosRequest) returns (ChaosResponse);

message StressChaosRequest {
string experiment_id = 1;
string stressor_type = 2; // "cpu", "memory", "io", "container-kill"
map<string, string> parameters = 3;
}

ExecNetworkChaos

Applies network chaos using tc or iptables.

rpc ExecNetworkChaos(NetworkChaosRequest) returns (ChaosResponse);

message NetworkChaosRequest {
string experiment_id = 1;
string action = 2; // "delay", "loss", "corrupt", "duplicate", "partition", "bandwidth"
map<string, string> parameters = 3;
}

ExecHTTPChaos

Sets up HTTP interception for delay or abort.

rpc ExecHTTPChaos(HTTPChaosRequest) returns (ChaosResponse);

message HTTPChaosRequest {
string experiment_id = 1;
string action = 2; // "delay", "abort"
int32 port = 3;
map<string, string> parameters = 4;
}

ExecDNSChaos

Injects DNS errors.

rpc ExecDNSChaos(DNSChaosRequest) returns (ChaosResponse);

message DNSChaosRequest {
string experiment_id = 1;
string action = 2; // "error"
map<string, string> parameters = 3;
}

ExecNodeChaos

Executes node-level actions like restart.

rpc ExecNodeChaos(NodeChaosRequest) returns (ChaosResponse);

message NodeChaosRequest {
string experiment_id = 1;
string action = 2; // "restart"
map<string, string> parameters = 3;
}

CancelChaos

Cancels a running chaos action by execution ID.

rpc CancelChaos(CancelRequest) returns (CancelResponse);

message CancelRequest {
string execution_id = 1;
}

Network namespace access

For pod-scoped network chaos, the daemon needs to enter the pod's network namespace. It does this by:

  1. Looking up the pod's container ID from the podName and podNamespace parameters
  2. Finding the container's network namespace path via the container runtime (e.g. /proc/<pid>/ns/net)
  3. Using nsenter or Go's unix.Setns to enter the namespace
  4. Running tc or iptables commands within that namespace

Execution tracking

The daemon assigns a unique execution_id to each chaos action and tracks running processes. When CancelChaos is called, it looks up the execution ID and terminates the associated process (stress-ng, tc, iptables rules, etc.).

Security

The daemon requires elevated privileges:

securityContext:
privileged: true
capabilities:
add:
- NET_ADMIN # tc, iptables
- SYS_ADMIN # nsenter, cgroup access
- SYS_PTRACE # process inspection

It also needs host PID namespace access (hostPID: true) to find container processes.

The daemon only accepts connections from the operator's service account. mTLS is used for gRPC communication.

Daemon endpoint resolution

The operator resolves the daemon endpoint for a pod using the pod's spec.nodeName:

func ResolveDaemonEndpoint(nodeName string) string {
return fmt.Sprintf("%s.chaosplane-daemon.chaosplane.svc.cluster.local:50051", nodeName)
}

Each daemon pod is addressable by its node name via a headless service.

Health

The daemon exposes:

  • gRPC health check (grpc.health.v1.Health)
  • HTTP /healthz on port 8081

The DaemonSet uses these for liveness and readiness probes.