Node Chaos
Node chaos actions target Kubernetes nodes directly. They test cluster-level resilience: how your workloads survive node failures, maintenance events, and resource exhaustion at the host level.
node-drain
Cordons a node (marks it unschedulable) and evicts all eligible pods. Simulates a planned maintenance event or node replacement. The operator uses the Kubernetes Eviction API directly, no daemon required.
apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: drain-worker-node
namespace: chaosplane
spec:
target:
kind: Node
labelSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
action:
type: node-drain
parameters:
timeout: "5m"
ignoreDaemonSets: "true"
deleteEmptyDirData: "true"
duration: 5m
rollback:
enabled: true
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
timeout | string | No | "5m" | Max time to wait for pod eviction |
ignoreDaemonSets | string | No | "true" | Skip DaemonSet pods during eviction |
deleteEmptyDirData | string | No | "true" | Evict pods with emptyDir volumes |
Rollback: Uncordons the node by setting spec.unschedulable: false. Pods are not automatically rescheduled back to the node.
Implementation: Uses the Kubernetes API directly (no daemon). Cordons via client.Update, evicts via PolicyV1().Evictions().Evict(). Static pods (mirror pods) are always skipped.
Draining a node in production will reschedule all its pods. Make sure your cluster has enough capacity on other nodes before running this experiment.
node-taint
Adds a taint to target nodes. Pods without a matching toleration will be evicted (for NoExecute) or not scheduled (for NoSchedule). Tests pod disruption budgets, toleration configuration, and rescheduling behavior.
apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: taint-node
namespace: chaosplane
spec:
target:
kind: Node
names:
- worker-node-1
action:
type: node-taint
parameters:
key: "chaos.chaosplane.io/test"
value: "true"
effect: "NoSchedule"
duration: 2m
rollback:
enabled: true
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
key | string | Yes | — | Taint key |
value | string | No | "" | Taint value |
effect | string | Yes | — | NoSchedule, NoExecute, or PreferNoSchedule |
Rollback: Removes the taint from the node by filtering it out of spec.taints.
Implementation: Uses the Kubernetes API directly. Idempotent: if the taint already exists with the same key and effect, it's skipped.
node-restart
Triggers a node restart via the daemon. The daemon executes a system reboot on the host. Tests cluster recovery from unexpected node failures.
apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: restart-node
namespace: chaosplane
spec:
target:
kind: Node
names:
- worker-node-2
action:
type: node-restart
parameters:
grace_period: "30s"
duration: 10m
rollback:
enabled: true
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
grace_period | string | No | "0s" | Grace period before reboot |
Rollback: Waits for the node to return to Ready state (polls every 10 seconds, up to 5 minutes).
Implementation: Uses the daemon's ExecNodeChaos RPC with action: restart. The daemon triggers a system reboot.
Node restart is destructive. All pods on the node will be terminated. Only use this in clusters where you can afford a node going offline.
node-cpu-stress
Runs CPU stress workers on the node host (not inside a pod cgroup). Tests how node-level CPU saturation affects all workloads running on that node.
apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: node-cpu-stress
namespace: chaosplane
spec:
target:
kind: Node
labelSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
action:
type: node-cpu-stress
parameters:
workers: "4"
load: "90"
duration: "60s"
duration: 60s
rollback:
enabled: true
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
workers | string | No | "1" | Number of CPU stress workers |
load | string | No | "100" | CPU load percentage (0-100) |
duration | string | No | experiment duration | How long to stress |
Rollback: Sends CancelChaos to the daemon, which terminates the stress-ng process.
Implementation: Uses the daemon's ExecStressChaos RPC. Unlike pod-cpu-stress, this runs at the node level and affects all pods on the node.
Targeting nodes
By label selector
target:
kind: Node
labelSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
By name
target:
kind: Node
names:
- worker-node-1
- worker-node-2
Note: node targets do not use namespace (nodes are cluster-scoped).
Recommended steady-state probes
For node chaos, check that your workloads are still running across the cluster:
steadyState:
before:
- name: deployments-healthy
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 2
after:
- name: deployments-recovered
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 2
recoveryTimeout: 10m
Use a longer recoveryTimeout for node chaos since rescheduling takes more time than pod restarts.