Steady-State Probes

Steady-state probes are health checks that run before and after chaos injection. They answer the question: "Is my system behaving normally?" If the system isn't healthy before chaos, the experiment won't run. If it doesn't recover after chaos, the experiment is marked failed.

Probe types

ChaosPlane supports three probe types: k8s, http, and prometheus.

Kubernetes probes

Checks that a minimum number of matching Kubernetes resources are in a ready state.

steadyState:
  before:
    - name: pods-ready
      type: k8s
      k8s:
        resource: pods
        namespace: production
        labelSelector: app=my-app
        condition:
          minReady: 3

Fields:

Field	Description
`resource`	Resource type: `pods`, `deployments`, `nodes`
`namespace`	Namespace to query (omit for cluster-scoped resources)
`labelSelector`	Label selector string (e.g. `app=my-app,tier=backend`)
`fieldSelector`	Field selector string
`condition.minReady`	Minimum number of ready resources required

HTTP probes

Makes an HTTP request and checks the response.

steadyState:
  before:
    - name: api-healthy
      type: http
      http:
        url: https://api.production.svc.cluster.local/health
        method: GET
        expectedStatus: 200
        expectedBody: '"status":"ok"'

Fields:

Field	Description
`url`	Full URL to request
`method`	HTTP method (default: `GET`)
`expectedStatus`	Expected HTTP status code (default: `200`)
`expectedBody`	String that must appear in the response body

Prometheus probes

Queries a Prometheus endpoint and evaluates the result against a threshold.

steadyState:
  before:
    - name: error-rate-baseline
      type: prometheus
      prometheus:
        url: http://prometheus.monitoring.svc.cluster.local:9090
        query: 'rate(http_requests_total{status=~"5.."}[5m])'
        condition:
          operator: "<"
          threshold: 0.01

Fields:

Field	Description
`url`	Prometheus base URL
`query`	PromQL query (must return a scalar or single-value vector)
`condition.operator`	Comparison operator: `<`, `>`, `<=`, `>=`, `==`, `!=`
`condition.threshold`	Numeric threshold to compare against

Before and after probes

steadyState:
  before:
    - name: check-before
      type: k8s
      k8s:
        resource: pods
        namespace: production
        labelSelector: app=my-app
        condition:
          minReady: 3
  after:
    - name: check-after
      type: k8s
      k8s:
        resource: pods
        namespace: production
        labelSelector: app=my-app
        condition:
          minReady: 3
  recoveryTimeout: 5m

before probes run during SteadyStateChecking. If any fail, the experiment moves to Failed without injecting chaos.

after probes run during Recovering. ChaosPlane polls them until they all pass or recoveryTimeout elapses.

Recovery timeout

recoveryTimeout controls how long ChaosPlane waits for after probes to pass. Set it based on how long your system typically takes to recover:

steadyState:
  recoveryTimeout: 10m   # wait up to 10 minutes for recovery

If the timeout elapses before all probes pass, the experiment is marked Failed.

Abort conditions

Abort conditions are probes that run continuously during the Running phase. They're defined separately from steady-state probes:

abortConditions:
  - name: error-rate-spike
    type: prometheus
    prometheus:
      url: http://prometheus.monitoring.svc.cluster.local:9090
      query: 'rate(http_requests_total{status=~"5.."}[1m])'
      condition:
        operator: ">"
        threshold: 0.05
    action: abort

  - name: pods-too-low
    type: k8s
    k8s:
      resource: pods
      namespace: production
      labelSelector: app=my-app
      condition:
        minReady: 1
    action: rollback

Abort actions:

Action	Behavior
`abort`	Stop the experiment immediately, skip rollback
`pause`	Pause the experiment (can be resumed)
`rollback`	Stop and execute rollback

Multiple probes

You can define multiple probes. All must pass for the phase to succeed:

steadyState:
  before:
    - name: pods-ready
      type: k8s
      k8s:
        resource: pods
        namespace: production
        labelSelector: app=my-app
        condition:
          minReady: 3
    - name: api-healthy
      type: http
      http:
        url: http://my-app.production.svc.cluster.local/health
        expectedStatus: 200
    - name: error-rate-low
      type: prometheus
      prometheus:
        url: http://prometheus.monitoring.svc.cluster.local:9090
        query: 'rate(http_requests_total{status=~"5.."}[5m])'
        condition:
          operator: "<"
          threshold: 0.01

Probe naming

Give probes descriptive names. They appear in experiment status and events, making it easy to understand why an experiment failed:

# Good
- name: payment-service-pods-ready
- name: checkout-api-200-ok
- name: p99-latency-under-500ms

# Less helpful
- name: probe1
- name: check

Common patterns

Minimum replica check

- name: min-replicas
  type: k8s
  k8s:
    resource: pods
    namespace: production
    labelSelector: app=my-app
    condition:
      minReady: 2

Service health endpoint

- name: service-health
  type: http
  http:
    url: http://my-service.production.svc.cluster.local/healthz
    expectedStatus: 200

Error rate threshold

- name: error-rate
  type: prometheus
  prometheus:
    url: http://prometheus:9090
    query: 'sum(rate(http_requests_total{job="my-app",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="my-app"}[5m]))'
    condition:
      operator: "<"
      threshold: 0.05

P99 latency

- name: p99-latency
  type: prometheus
  prometheus:
    url: http://prometheus:9090
    query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="my-app"}[5m]))'
    condition:
      operator: "<"
      threshold: 0.5

Probe types​

Kubernetes probes​

HTTP probes​

Prometheus probes​

Before and after probes​

Recovery timeout​

Abort conditions​

Multiple probes​

Probe naming​

Common patterns​

Minimum replica check​

Service health endpoint​

Error rate threshold​

P99 latency​

Probe types

Kubernetes probes

HTTP probes

Prometheus probes

Before and after probes

Recovery timeout

Abort conditions

Multiple probes

Probe naming

Common patterns

Minimum replica check

Service health endpoint

Error rate threshold

P99 latency