Skip to main content

Node Chaos

Node chaos actions target Kubernetes nodes directly. They test cluster-level resilience: how your workloads survive node failures, maintenance events, and resource exhaustion at the host level.

node-drain

Cordons a node (marks it unschedulable) and evicts all eligible pods. Simulates a planned maintenance event or node replacement. The operator uses the Kubernetes Eviction API directly, no daemon required.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: drain-worker-node
namespace: chaosplane
spec:
target:
kind: Node
labelSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
action:
type: node-drain
parameters:
timeout: "5m"
ignoreDaemonSets: "true"
deleteEmptyDirData: "true"
duration: 5m
rollback:
enabled: true

Parameters:

ParameterTypeRequiredDefaultDescription
timeoutstringNo"5m"Max time to wait for pod eviction
ignoreDaemonSetsstringNo"true"Skip DaemonSet pods during eviction
deleteEmptyDirDatastringNo"true"Evict pods with emptyDir volumes

Rollback: Uncordons the node by setting spec.unschedulable: false. Pods are not automatically rescheduled back to the node.

Implementation: Uses the Kubernetes API directly (no daemon). Cordons via client.Update, evicts via PolicyV1().Evictions().Evict(). Static pods (mirror pods) are always skipped.

warning

Draining a node in production will reschedule all its pods. Make sure your cluster has enough capacity on other nodes before running this experiment.

node-taint

Adds a taint to target nodes. Pods without a matching toleration will be evicted (for NoExecute) or not scheduled (for NoSchedule). Tests pod disruption budgets, toleration configuration, and rescheduling behavior.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: taint-node
namespace: chaosplane
spec:
target:
kind: Node
names:
- worker-node-1
action:
type: node-taint
parameters:
key: "chaos.chaosplane.io/test"
value: "true"
effect: "NoSchedule"
duration: 2m
rollback:
enabled: true

Parameters:

ParameterTypeRequiredDefaultDescription
keystringYesTaint key
valuestringNo""Taint value
effectstringYesNoSchedule, NoExecute, or PreferNoSchedule

Rollback: Removes the taint from the node by filtering it out of spec.taints.

Implementation: Uses the Kubernetes API directly. Idempotent: if the taint already exists with the same key and effect, it's skipped.

node-restart

Triggers a node restart via the daemon. The daemon executes a system reboot on the host. Tests cluster recovery from unexpected node failures.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: restart-node
namespace: chaosplane
spec:
target:
kind: Node
names:
- worker-node-2
action:
type: node-restart
parameters:
grace_period: "30s"
duration: 10m
rollback:
enabled: true

Parameters:

ParameterTypeRequiredDefaultDescription
grace_periodstringNo"0s"Grace period before reboot

Rollback: Waits for the node to return to Ready state (polls every 10 seconds, up to 5 minutes).

Implementation: Uses the daemon's ExecNodeChaos RPC with action: restart. The daemon triggers a system reboot.

danger

Node restart is destructive. All pods on the node will be terminated. Only use this in clusters where you can afford a node going offline.

node-cpu-stress

Runs CPU stress workers on the node host (not inside a pod cgroup). Tests how node-level CPU saturation affects all workloads running on that node.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
name: node-cpu-stress
namespace: chaosplane
spec:
target:
kind: Node
labelSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
action:
type: node-cpu-stress
parameters:
workers: "4"
load: "90"
duration: "60s"
duration: 60s
rollback:
enabled: true

Parameters:

ParameterTypeRequiredDefaultDescription
workersstringNo"1"Number of CPU stress workers
loadstringNo"100"CPU load percentage (0-100)
durationstringNoexperiment durationHow long to stress

Rollback: Sends CancelChaos to the daemon, which terminates the stress-ng process.

Implementation: Uses the daemon's ExecStressChaos RPC. Unlike pod-cpu-stress, this runs at the node level and affects all pods on the node.

Targeting nodes

By label selector

target:
kind: Node
labelSelector:
matchLabels:
node-role.kubernetes.io/worker: ""

By name

target:
kind: Node
names:
- worker-node-1
- worker-node-2

Note: node targets do not use namespace (nodes are cluster-scoped).

For node chaos, check that your workloads are still running across the cluster:

steadyState:
before:
- name: deployments-healthy
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 2
after:
- name: deployments-recovered
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 2
recoveryTimeout: 10m

Use a longer recoveryTimeout for node chaos since rescheduling takes more time than pod restarts.