Node Chaos

Node chaos actions target Kubernetes nodes directly. They test cluster-level resilience: how your workloads survive node failures, maintenance events, and resource exhaustion at the host level.

node-drain

Cordons a node (marks it unschedulable) and evicts all eligible pods. Simulates a planned maintenance event or node replacement. The operator uses the Kubernetes Eviction API directly, no daemon required.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: drain-worker-node
  namespace: chaosplane
spec:
  target:
    kind: Node
    labelSelector:
      matchLabels:
        node-role.kubernetes.io/worker: ""
  action:
    type: node-drain
    parameters:
      timeout: "5m"
      ignoreDaemonSets: "true"
      deleteEmptyDirData: "true"
  duration: 5m
  rollback:
    enabled: true

Parameters:

Parameter	Type	Required	Default	Description
`timeout`	string	No	`"5m"`	Max time to wait for pod eviction
`ignoreDaemonSets`	string	No	`"true"`	Skip DaemonSet pods during eviction
`deleteEmptyDirData`	string	No	`"true"`	Evict pods with emptyDir volumes

Rollback: Uncordons the node by setting spec.unschedulable: false. Pods are not automatically rescheduled back to the node.

Implementation: Uses the Kubernetes API directly (no daemon). Cordons via client.Update, evicts via PolicyV1().Evictions().Evict(). Static pods (mirror pods) are always skipped.

warning

Draining a node in production will reschedule all its pods. Make sure your cluster has enough capacity on other nodes before running this experiment.

node-taint

Adds a taint to target nodes. Pods without a matching toleration will be evicted (for NoExecute) or not scheduled (for NoSchedule). Tests pod disruption budgets, toleration configuration, and rescheduling behavior.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: taint-node
  namespace: chaosplane
spec:
  target:
    kind: Node
    names:
      - worker-node-1
  action:
    type: node-taint
    parameters:
      key: "chaos.chaosplane.io/test"
      value: "true"
      effect: "NoSchedule"
  duration: 2m
  rollback:
    enabled: true

Parameters:

Parameter	Type	Required	Default	Description
`key`	string	Yes	—	Taint key
`value`	string	No	`""`	Taint value
`effect`	string	Yes	—	`NoSchedule`, `NoExecute`, or `PreferNoSchedule`

Rollback: Removes the taint from the node by filtering it out of spec.taints.

Implementation: Uses the Kubernetes API directly. Idempotent: if the taint already exists with the same key and effect, it's skipped.

node-restart

Triggers a node restart via the daemon. The daemon executes a system reboot on the host. Tests cluster recovery from unexpected node failures.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: restart-node
  namespace: chaosplane
spec:
  target:
    kind: Node
    names:
      - worker-node-2
  action:
    type: node-restart
    parameters:
      grace_period: "30s"
  duration: 10m
  rollback:
    enabled: true

Parameters:

Parameter	Type	Required	Default	Description
`grace_period`	string	No	`"0s"`	Grace period before reboot

Rollback: Waits for the node to return to Ready state (polls every 10 seconds, up to 5 minutes).

Implementation: Uses the daemon's ExecNodeChaos RPC with action: restart. The daemon triggers a system reboot.

danger

Node restart is destructive. All pods on the node will be terminated. Only use this in clusters where you can afford a node going offline.

node-cpu-stress

Runs CPU stress workers on the node host (not inside a pod cgroup). Tests how node-level CPU saturation affects all workloads running on that node.

apiVersion: chaos.chaosplane.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: node-cpu-stress
  namespace: chaosplane
spec:
  target:
    kind: Node
    labelSelector:
      matchLabels:
        node-role.kubernetes.io/worker: ""
  action:
    type: node-cpu-stress
    parameters:
      workers: "4"
      load: "90"
      duration: "60s"
  duration: 60s
  rollback:
    enabled: true

Parameters:

Parameter	Type	Required	Default	Description
`workers`	string	No	`"1"`	Number of CPU stress workers
`load`	string	No	`"100"`	CPU load percentage (0-100)
`duration`	string	No	experiment duration	How long to stress

Rollback: Sends CancelChaos to the daemon, which terminates the stress-ng process.

Implementation: Uses the daemon's ExecStressChaos RPC. Unlike pod-cpu-stress, this runs at the node level and affects all pods on the node.

Targeting nodes

By label selector

target:
  kind: Node
  labelSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""

By name

target:
  kind: Node
  names:
    - worker-node-1
    - worker-node-2

Note: node targets do not use namespace (nodes are cluster-scoped).

Recommended steady-state probes

For node chaos, check that your workloads are still running across the cluster:

steadyState:
  before:
    - name: deployments-healthy
      type: k8s
      k8s:
        resource: pods
        namespace: production
        labelSelector: app=my-app
        condition:
          minReady: 2
  after:
    - name: deployments-recovered
      type: k8s
      k8s:
        resource: pods
        namespace: production
        labelSelector: app=my-app
        condition:
          minReady: 2
  recoveryTimeout: 10m

Use a longer recoveryTimeout for node chaos since rescheduling takes more time than pod restarts.

node-drain​

node-taint​

node-restart​

node-cpu-stress​

Targeting nodes​

By label selector​

By name​

Recommended steady-state probes​

node-drain

node-taint

node-restart

node-cpu-stress

Targeting nodes

By label selector

By name

Recommended steady-state probes