Steady-State Probes
Steady-state probes are health checks that run before and after chaos injection. They answer the question: "Is my system behaving normally?" If the system isn't healthy before chaos, the experiment won't run. If it doesn't recover after chaos, the experiment is marked failed.
Probe types
ChaosPlane supports three probe types: k8s, http, and prometheus.
Kubernetes probes
Checks that a minimum number of matching Kubernetes resources are in a ready state.
steadyState:
before:
- name: pods-ready
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 3
Fields:
| Field | Description |
|---|---|
resource | Resource type: pods, deployments, nodes |
namespace | Namespace to query (omit for cluster-scoped resources) |
labelSelector | Label selector string (e.g. app=my-app,tier=backend) |
fieldSelector | Field selector string |
condition.minReady | Minimum number of ready resources required |
HTTP probes
Makes an HTTP request and checks the response.
steadyState:
before:
- name: api-healthy
type: http
http:
url: https://api.production.svc.cluster.local/health
method: GET
expectedStatus: 200
expectedBody: '"status":"ok"'
Fields:
| Field | Description |
|---|---|
url | Full URL to request |
method | HTTP method (default: GET) |
expectedStatus | Expected HTTP status code (default: 200) |
expectedBody | String that must appear in the response body |
Prometheus probes
Queries a Prometheus endpoint and evaluates the result against a threshold.
steadyState:
before:
- name: error-rate-baseline
type: prometheus
prometheus:
url: http://prometheus.monitoring.svc.cluster.local:9090
query: 'rate(http_requests_total{status=~"5.."}[5m])'
condition:
operator: "<"
threshold: 0.01
Fields:
| Field | Description |
|---|---|
url | Prometheus base URL |
query | PromQL query (must return a scalar or single-value vector) |
condition.operator | Comparison operator: <, >, <=, >=, ==, != |
condition.threshold | Numeric threshold to compare against |
Before and after probes
steadyState:
before:
- name: check-before
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 3
after:
- name: check-after
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 3
recoveryTimeout: 5m
before probes run during SteadyStateChecking. If any fail, the experiment moves to Failed without injecting chaos.
after probes run during Recovering. ChaosPlane polls them until they all pass or recoveryTimeout elapses.
Recovery timeout
recoveryTimeout controls how long ChaosPlane waits for after probes to pass. Set it based on how long your system typically takes to recover:
steadyState:
recoveryTimeout: 10m # wait up to 10 minutes for recovery
If the timeout elapses before all probes pass, the experiment is marked Failed.
Abort conditions
Abort conditions are probes that run continuously during the Running phase. They're defined separately from steady-state probes:
abortConditions:
- name: error-rate-spike
type: prometheus
prometheus:
url: http://prometheus.monitoring.svc.cluster.local:9090
query: 'rate(http_requests_total{status=~"5.."}[1m])'
condition:
operator: ">"
threshold: 0.05
action: abort
- name: pods-too-low
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 1
action: rollback
Abort actions:
| Action | Behavior |
|---|---|
abort | Stop the experiment immediately, skip rollback |
pause | Pause the experiment (can be resumed) |
rollback | Stop and execute rollback |
Multiple probes
You can define multiple probes. All must pass for the phase to succeed:
steadyState:
before:
- name: pods-ready
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 3
- name: api-healthy
type: http
http:
url: http://my-app.production.svc.cluster.local/health
expectedStatus: 200
- name: error-rate-low
type: prometheus
prometheus:
url: http://prometheus.monitoring.svc.cluster.local:9090
query: 'rate(http_requests_total{status=~"5.."}[5m])'
condition:
operator: "<"
threshold: 0.01
Probe naming
Give probes descriptive names. They appear in experiment status and events, making it easy to understand why an experiment failed:
# Good
- name: payment-service-pods-ready
- name: checkout-api-200-ok
- name: p99-latency-under-500ms
# Less helpful
- name: probe1
- name: check
Common patterns
Minimum replica check
- name: min-replicas
type: k8s
k8s:
resource: pods
namespace: production
labelSelector: app=my-app
condition:
minReady: 2
Service health endpoint
- name: service-health
type: http
http:
url: http://my-service.production.svc.cluster.local/healthz
expectedStatus: 200
Error rate threshold
- name: error-rate
type: prometheus
prometheus:
url: http://prometheus:9090
query: 'sum(rate(http_requests_total{job="my-app",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="my-app"}[5m]))'
condition:
operator: "<"
threshold: 0.05
P99 latency
- name: p99-latency
type: prometheus
prometheus:
url: http://prometheus:9090
query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="my-app"}[5m]))'
condition:
operator: "<"
threshold: 0.5