Overview
Circuit breakers prevent cascade failures by tracking provider health and automatically stopping traffic to failing providers. Lasso implements per-provider, per-transport circuit breakers with automatic recovery and exponential backoff.State Machine
Circuit breakers operate in three states:State Descriptions
:closed (Healthy)- Provider is operating normally
- All requests are allowed
- Failures increment counter but don’t block traffic
- Transitions to
:openafterfailure_thresholdconsecutive failures
- Provider has exceeded failure threshold
- All requests are rejected immediately
- No traffic sent to provider
- Transitions to
:half_openafterrecovery_timeoutelapsed
- Provider is testing recovery
- Limited concurrent requests allowed (
half_open_max_inflight) - Success increments recovery counter
- Any failure immediately reopens circuit
- Transitions to
:closedaftersuccess_thresholdconsecutive successes
Circuit Breaker Keying
Circuit breakers are keyed by{instance_id, transport} where:
- Same provider instance shared across profiles
- Independent circuit breakers for HTTP and WebSocket
- Deduplication prevents redundant circuit state
Configuration
Circuit breaker behavior is configured via application config:Configuration Parameters
failure_threshold (default: 5)- Consecutive failures required to open circuit
- Lower values = more aggressive protection
- Higher values = more tolerance for transient failures
- Consecutive successes required to close from half-open
- Lower values = faster recovery
- Higher values = more conservative recovery
- Base timeout before attempting recovery
- Applies to first open episode
- Subsequent reopens use exponential backoff
- Maximum timeout after exponential backoff
- Prevents unbounded backoff
- Caps at 10 minutes by default
- Maximum concurrent requests in half-open state
- Limits blast radius during recovery testing
- Excess requests rejected with
:half_open_busy
- Per-error-category failure thresholds
- Overrides
failure_thresholdfor specific error types - Example: Open faster on network errors (3) than server errors (5)
State Transitions
Closed → Open
Triggered when consecutive failures reach threshold:Open → Half-Open
Triggered by recovery timeout or traffic-triggered recovery: Proactive Recovery (timer-based):Half-Open → Closed
Triggered when consecutive successes reach threshold:Half-Open → Open (Reopen)
Triggered by any failure in half-open state:Exponential Backoff
On consecutive reopens, recovery timeout increases exponentially:| Consecutive Reopens | Multiplier | Recovery Timeout |
|---|---|---|
| 0 | 1 | 60 seconds |
| 1 | 2 | 120 seconds |
| 2 | 4 | 240 seconds |
| 3 | 8 | 480 seconds |
| 4+ | 16 | 600 seconds (capped) |
Rate Limit Handling
Rate limit errors receive special treatment:Retry-After Headers
If error includesretry_after_ms, use it instead of exponential backoff:
Fast Recovery
Rate limit circuits usesuccess_threshold=1 for faster recovery:
No Breaker Penalty
Rate limit errors don’t count toward circuit breaker failures in shared mode:Health Probe Integration
Health probes signal recovery to circuit breakers:- :open → Transitions to
:half_openif recovery deadline passed - :half_open → Counts toward success threshold
- :closed → No-op (doesn’t need recovery signals)
ETS State Management
Circuit breaker state is written to ETS on every transition:- Survives GenServer restarts
- Fast reads for provider selection (no GenServer calls)
- Shared across profiles for consistent state
PubSub Fan-Out
Circuit events are broadcast to all profiles using the instance:- Dashboard LiveViews (real-time UI updates)
- EventStream (metrics aggregation)
- Telemetry handlers (logging, alerting)
Admission Control
Circuit breaker guards requests with admission control:Admission Logic
:closed - Allow all requests:Rejection Reasons
| Reason | Description |
|---|---|
:circuit_open | Circuit is open due to failures |
:half_open_busy | Circuit is half-open but at max inflight |
:admission_timeout | Admission check timed out (500ms) |
:not_found | Circuit breaker process not found |
Error Classification
Circuit breaker penalties depend on error category:Retriable Errors (Breaker Penalty)
:server_error- 5xx status, upstream failure:network_error- Connection refused, timeout:timeout- Request timeout (except in shared mode)
Non-Retriable Errors (No Penalty)
:invalid_params- User error, not provider fault:user_error- Client mistake:client_error- 4xx status
Special Categories
:rate_limit (Retriable, No Penalty in Shared Mode):
- Temporary backpressure
- Known recovery (retry-after headers)
- Fast recovery (
success_threshold=1)
:capability_violation (Retriable, No Penalty):
- Permanent constraint, not transient failure
- Provider doesn’t support method/params
- Should failover to different provider
Telemetry Events
All circuit breaker events emit telemetry:Event Schema
| Event | Metadata |
|---|---|
[:lasso, :circuit_breaker, :open] | instance_id, transport, from_state, to_state, reason, error_category, failure_count, recovery_timeout_ms, consecutive_open_count |
[:lasso, :circuit_breaker, :close] | instance_id, transport, from_state, to_state, reason |
[:lasso, :circuit_breaker, :half_open] | instance_id, transport, from_state, to_state, reason, consecutive_open_count |
[:lasso, :circuit_breaker, :proactive_recovery] | instance_id, transport, from_state, to_state, reason, consecutive_open_count |
[:lasso, :circuit_breaker, :failure] | instance_id, transport, error_category, circuit_state |
[:lasso, :circuit_breaker, :admit] | instance_id, transport, decision, admit_call_ms |
[:lasso, :circuit_breaker, :timeout] | instance_id, transport, timeout_ms |
Example Telemetry Handler
Best Practices
Tuning Thresholds
Low Traffic (<10 req/s):Category Thresholds
Half-Open Inflight
Next Steps
Provider Selection
Understand how circuit state affects selection
Routing Strategies
Learn about health-based tiering
Profiles
Configure circuit breaker thresholds
Architecture
Explore shared circuit breaker infrastructure