Skip to main content

Steady-State Probes

Steady-state probes are health checks that run before and after chaos injection. They answer the question: "Is my system behaving normally?" If the system isn't healthy before chaos, the experiment won't run. If it doesn't recover after chaos, the experiment is marked failed.

Probe types

ChaosPlane supports three probe types: k8s, http, and prometheus.

Kubernetes probes

Checks that a minimum number of matching Kubernetes resources are in a ready state.

steadyState:
before:
- name: pods-ready
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 3

Fields:

FieldDescription
resourceResource type: pods, deployments, nodes
namespaceNamespace to query (omit for cluster-scoped resources)
labelSelectorLabel selector string (e.g. app=my-app,tier=backend)
fieldSelectorField selector string
condition.minReadyMinimum number of ready resources required

HTTP probes

Makes an HTTP request and checks the response.

steadyState:
before:
- name: api-healthy
type: http
http:
url: https://api.production.svc.cluster.local/health
method: GET
expectedStatus: 200
expectedBody: '"status":"ok"'

Fields:

FieldDescription
urlFull URL to request
methodHTTP method (default: GET)
expectedStatusExpected HTTP status code (default: 200)
expectedBodyString that must appear in the response body

Prometheus probes

Queries a Prometheus endpoint and evaluates the result against a threshold.

steadyState:
before:
- name: error-rate-baseline
type: prometheus
prometheus:
url: http://prometheus.monitoring.svc.cluster.local:9090
query: 'rate(http_requests_total{status=~"5.."}[5m])'
condition:
operator: "<"
threshold: 0.01

Fields:

FieldDescription
urlPrometheus base URL
queryPromQL query (must return a scalar or single-value vector)
condition.operatorComparison operator: <, >, <=, >=, ==, !=
condition.thresholdNumeric threshold to compare against

Before and after probes

steadyState:
before:
- name: check-before
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 3
after:
- name: check-after
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 3
recoveryTimeout: 5m

before probes run during SteadyStateChecking. If any fail, the experiment moves to Failed without injecting chaos.

after probes run during Recovering. ChaosPlane polls them until they all pass or recoveryTimeout elapses.

Recovery timeout

recoveryTimeout controls how long ChaosPlane waits for after probes to pass. Set it based on how long your system typically takes to recover:

steadyState:
recoveryTimeout: 10m # wait up to 10 minutes for recovery

If the timeout elapses before all probes pass, the experiment is marked Failed.

Abort conditions

Abort conditions are probes that run continuously during the Running phase. They're defined separately from steady-state probes:

abortConditions:
- name: error-rate-spike
type: prometheus
prometheus:
url: http://prometheus.monitoring.svc.cluster.local:9090
query: 'rate(http_requests_total{status=~"5.."}[1m])'
condition:
operator: ">"
threshold: 0.05
action: abort

- name: pods-too-low
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 1
action: rollback

Abort actions:

ActionBehavior
abortStop the experiment immediately, skip rollback
pausePause the experiment (can be resumed)
rollbackStop and execute rollback

Multiple probes

You can define multiple probes. All must pass for the phase to succeed:

steadyState:
before:
- name: pods-ready
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 3
- name: api-healthy
type: http
http:
url: http://my-app.production.svc.cluster.local/health
expectedStatus: 200
- name: error-rate-low
type: prometheus
prometheus:
url: http://prometheus.monitoring.svc.cluster.local:9090
query: 'rate(http_requests_total{status=~"5.."}[5m])'
condition:
operator: "<"
threshold: 0.01

Probe naming

Give probes descriptive names. They appear in experiment status and events, making it easy to understand why an experiment failed:

# Good
- name: payment-service-pods-ready
- name: checkout-api-200-ok
- name: p99-latency-under-500ms

# Less helpful
- name: probe1
- name: check

Common patterns

Minimum replica check

- name: min-replicas
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 2

Service health endpoint

- name: service-health
type: http
http:
url: http://my-service.production.svc.cluster.local/healthz
expectedStatus: 200

Error rate threshold

- name: error-rate
type: prometheus
prometheus:
url: http://prometheus:9090
query: 'sum(rate(http_requests_total{job="my-app",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="my-app"}[5m]))'
condition:
operator: "<"
threshold: 0.05

P99 latency

- name: p99-latency
type: prometheus
prometheus:
url: http://prometheus:9090
query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="my-app"}[5m]))'
condition:
operator: "<"
threshold: 0.5