Skip to main content

Overview

Circuit breakers prevent cascade failures by tracking provider health and automatically stopping traffic to failing providers. Lasso implements per-provider, per-transport circuit breakers with automatic recovery and exponential backoff.

State Machine

Circuit breakers operate in three states:

State Descriptions

:closed (Healthy)
  • Provider is operating normally
  • All requests are allowed
  • Failures increment counter but don’t block traffic
  • Transitions to :open after failure_threshold consecutive failures
:open (Failing)
  • Provider has exceeded failure threshold
  • All requests are rejected immediately
  • No traffic sent to provider
  • Transitions to :half_open after recovery_timeout elapsed
:half_open (Recovering)
  • Provider is testing recovery
  • Limited concurrent requests allowed (half_open_max_inflight)
  • Success increments recovery counter
  • Any failure immediately reopens circuit
  • Transitions to :closed after success_threshold consecutive successes

Circuit Breaker Keying

Circuit breakers are keyed by {instance_id, transport} where:
instance_id = :crypto.hash(:sha256, "#{chain}:#{url}:#{auth_hash}")
             |> Base.encode16(case: :lower)
transport = :http | :ws
Key Properties:
  • Same provider instance shared across profiles
  • Independent circuit breakers for HTTP and WebSocket
  • Deduplication prevents redundant circuit state
Example:
Profile A: uses https://eth.llamarpc.com → instance_id: "abc123"
Profile B: uses https://eth.llamarpc.com → instance_id: "abc123" (same)

Circuit Breakers:
- {:circuit_breaker, "abc123:http"} (shared)
- {:circuit_breaker, "abc123:ws"} (shared)

Configuration

Circuit breaker behavior is configured via application config:
config :lasso, :circuit_breaker,
  failure_threshold: 5,        # Consecutive failures before opening
  success_threshold: 2,        # Consecutive successes to close
  recovery_timeout: 60_000,    # Base recovery timeout in ms (1 minute)
  max_recovery_timeout: 600_000,  # Maximum timeout (10 minutes)
  half_open_max_inflight: 3,   # Max concurrent requests in half-open
  category_thresholds: %{
    server_error: 5,
    network_error: 3,
    timeout: 2,
    auth_error: 2
  }

Configuration Parameters

failure_threshold (default: 5)
  • Consecutive failures required to open circuit
  • Lower values = more aggressive protection
  • Higher values = more tolerance for transient failures
success_threshold (default: 2)
  • Consecutive successes required to close from half-open
  • Lower values = faster recovery
  • Higher values = more conservative recovery
recovery_timeout (default: 60,000ms)
  • Base timeout before attempting recovery
  • Applies to first open episode
  • Subsequent reopens use exponential backoff
max_recovery_timeout (default: 600,000ms)
  • Maximum timeout after exponential backoff
  • Prevents unbounded backoff
  • Caps at 10 minutes by default
half_open_max_inflight (default: 3)
  • Maximum concurrent requests in half-open state
  • Limits blast radius during recovery testing
  • Excess requests rejected with :half_open_busy
category_thresholds (optional)
  • Per-error-category failure thresholds
  • Overrides failure_threshold for specific error types
  • Example: Open faster on network errors (3) than server errors (5)

State Transitions

Closed → Open

Triggered when consecutive failures reach threshold:
new_failure_count = state.failure_count + 1
threshold = Map.get(state.category_thresholds, error_category, state.failure_threshold)

if new_failure_count >= threshold do
  # Compute recovery deadline with jitter
  delay = state.base_recovery_timeout
  delay_with_jitter = add_jitter(delay)
  
  # Schedule proactive recovery timer
  timer_ref = Process.send_after(self(), {:attempt_proactive_recovery, gen}, delay_with_jitter)
  
  # Update state
  %{state |
    state: :open,
    failure_count: new_failure_count,
    last_failure_time: now,
    recovery_timer_ref: timer_ref,
    recovery_deadline_ms: now + delay_with_jitter,
    effective_recovery_delay: delay,
    last_open_error: extract_error_info(error)
  }
end
Telemetry Event:
:telemetry.execute([:lasso, :circuit_breaker, :open], %{count: 1}, %{
  instance_id: instance_id,
  transport: transport,
  from_state: :closed,
  to_state: :open,
  reason: :failure_threshold_exceeded,
  error_category: :server_error,
  failure_count: 5,
  recovery_timeout_ms: 60_000
})

Open → Half-Open

Triggered by recovery timeout or traffic-triggered recovery: Proactive Recovery (timer-based):
handle_info({:attempt_proactive_recovery, gen}, state) do
  case state.state do
    :open ->
      %{state |
        state: :half_open,
        success_count: 0,
        inflight_count: 0,
        recovery_timer_ref: nil
      }
    _ ->
      state
  end
end
Traffic-Triggered Recovery (admission check):
handle_call({:admit, now_ms}, _from, state) do
  case state.state do
    :open ->
      if should_attempt_recovery?(state) do
        # Recovery deadline has passed
        {:reply, {:allow, :half_open}, %{state | state: :half_open}}
      else
        {:reply, {:deny, :open}, state}
      end
    _ ->
      # ...
  end
end

def should_attempt_recovery?(state) do
  case state.recovery_deadline_ms do
    nil -> true
    deadline -> System.monotonic_time(:millisecond) >= deadline
  end
end
Telemetry Event:
:telemetry.execute([:lasso, :circuit_breaker, :half_open], %{count: 1}, %{
  instance_id: instance_id,
  transport: transport,
  from_state: :open,
  to_state: :half_open,
  reason: :proactive_recovery,
  consecutive_open_count: 0
})

Half-Open → Closed

Triggered when consecutive successes reach threshold:
new_success_count = state.success_count + 1

if new_success_count >= effective_success_threshold(state) do
  %{state |
    state: :closed,
    failure_count: 0,
    success_count: 0,
    last_failure_time: nil,
    opened_by_category: nil,
    recovery_timer_ref: nil,
    last_open_error: nil,
    consecutive_open_count: 0,
    recovery_deadline_ms: nil,
    effective_recovery_delay: nil
  }
end

# Rate limit errors use success_threshold=1 for faster recovery
def effective_success_threshold(%{opened_by_category: :rate_limit}), do: 1
def effective_success_threshold(state), do: state.success_threshold
Telemetry Event:
:telemetry.execute([:lasso, :circuit_breaker, :close], %{count: 1}, %{
  instance_id: instance_id,
  transport: transport,
  from_state: :half_open,
  to_state: :closed,
  reason: :recovered
})

Half-Open → Open (Reopen)

Triggered by any failure in half-open state:
# Exponential backoff on consecutive reopens
new_consecutive_open_count = state.consecutive_open_count + 1

delay = compute_reopen_delay(state, error, error_category)
# Base: base_recovery_timeout × 2^min(consecutive_open_count, 4)
# Cap: max_recovery_timeout

delay_with_jitter = add_jitter(delay)
timer_ref = Process.send_after(self(), {:attempt_proactive_recovery, gen}, delay_with_jitter)

%{state |
  state: :open,
  failure_count: new_failure_count,
  success_count: 0,
  recovery_timer_ref: timer_ref,
  recovery_deadline_ms: now + delay_with_jitter,
  effective_recovery_delay: delay,
  last_open_error: extract_error_info(error),
  consecutive_open_count: new_consecutive_open_count,
  opened_by_category: error_category
}
Telemetry Event:
:telemetry.execute([:lasso, :circuit_breaker, :open], %{count: 1}, %{
  instance_id: instance_id,
  transport: transport,
  from_state: :half_open,
  to_state: :open,
  reason: :reopen_due_to_failure,
  error_category: :server_error,
  failure_count: 1,
  recovery_timeout_ms: 120_000,
  consecutive_open_count: 1
})

Exponential Backoff

On consecutive reopens, recovery timeout increases exponentially:
base = state.base_recovery_timeout  # 60,000ms
multiplier = trunc(:math.pow(2, min(state.consecutive_open_count, 4)))
delay = min(base * multiplier, state.max_recovery_timeout)
Backoff Schedule:
Consecutive ReopensMultiplierRecovery Timeout
0160 seconds
12120 seconds
24240 seconds
38480 seconds
4+16600 seconds (capped)
Jitter: ±5% random jitter prevents synchronized recovery storms:
jitter_ms = :rand.uniform(max(1, div(delay_ms, 20)))
delay_ms + jitter_ms

Rate Limit Handling

Rate limit errors receive special treatment:

Retry-After Headers

If error includes retry_after_ms, use it instead of exponential backoff:
adjusted_recovery_timeout =
  if error_category == :rate_limit and is_struct(error, JError) do
    extract_retry_after(error) || state.base_recovery_timeout
  else
    state.base_recovery_timeout
  end

def extract_retry_after(%JError{data: data}) when is_map(data) do
  Map.get(data, :retry_after_ms) || Map.get(data, "retry_after_ms")
end

Fast Recovery

Rate limit circuits use success_threshold=1 for faster recovery:
def effective_success_threshold(%{opened_by_category: :rate_limit}), do: 1
def effective_success_threshold(state), do: state.success_threshold

No Breaker Penalty

Rate limit errors don’t count toward circuit breaker failures in shared mode:
def shared_breaker_penalty?(%JError{category: :rate_limit}, %{shared_mode: true}), do: false
This prevents one profile’s rate limit from affecting other profiles sharing the provider.

Health Probe Integration

Health probes signal recovery to circuit breakers:
# ProbeCoordinator signals recovery on successful probe
case execute_health_probe(instance_id) do
  {:ok, _result} ->
    CircuitBreaker.signal_recovery_cast({instance_id, :http})
  {:error, _reason} ->
    CircuitBreaker.record_failure({instance_id, :http}, reason)
end
Signal Recovery:
def signal_recovery_cast(cb_id) do
  GenServer.cast(via_name(cb_id), {:report_external, {:ok, :success}})
  :ok
end
Behavior by State:
  • :open → Transitions to :half_open if recovery deadline passed
  • :half_open → Counts toward success threshold
  • :closed → No-op (doesn’t need recovery signals)

ETS State Management

Circuit breaker state is written to ETS on every transition:
def write_ets_state(state) do
  :ets.insert(:lasso_instance_state, {
    {:circuit, state.instance_id, state.transport},
    %{
      state: state.state,
      error: state.last_open_error,
      recovery_deadline_ms: state.recovery_deadline_ms
    }
  })
end
State Shape:
{:circuit, instance_id, transport} => %{
  state: :closed | :half_open | :open,
  error: %{code: -32000, category: :server_error, message: "..."} | nil,
  recovery_deadline_ms: 1736894871234 | nil
}
Benefits:
  • Survives GenServer restarts
  • Fast reads for provider selection (no GenServer calls)
  • Shared across profiles for consistent state

PubSub Fan-Out

Circuit events are broadcast to all profiles using the instance:
refs = Catalog.get_instance_refs(state.instance_id)
# => ["default", "production", "analytics"]

Enum.each(refs, fn profile ->
  provider_id = Catalog.reverse_lookup_provider_id(profile, chain, state.instance_id)
  
  event = {:circuit_breaker_event, %{
    ts: System.system_time(:millisecond),
    profile: profile,
    chain: chain,
    provider_id: provider_id,
    instance_id: state.instance_id,
    transport: state.transport,
    from: from_state,
    to: to_state,
    reason: reason,
    error: error_info,
    source_node_id: Lasso.Cluster.Topology.get_self_node_id(),
    source_node: node()
  }}
  
  Phoenix.PubSub.broadcast(Lasso.PubSub, "circuit:events:#{profile}:#{chain}", event)
end)
Subscribers:
  • Dashboard LiveViews (real-time UI updates)
  • EventStream (metrics aggregation)
  • Telemetry handlers (logging, alerting)

Admission Control

Circuit breaker guards requests with admission control:
case CircuitBreaker.call({instance_id, transport}, fn ->
  execute_request(...)
end) do
  {:executed, result} -> result
  {:rejected, :circuit_open} -> {:error, "Provider unavailable"}
  {:rejected, :half_open_busy} -> {:error, "Provider recovering"}
end

Admission Logic

:closed - Allow all requests:
{:reply, {:allow, :closed}, state}
:open - Check recovery deadline:
if should_attempt_recovery?(state) do
  {:reply, {:allow, :half_open}, %{state | state: :half_open}}
else
  {:reply, {:deny, :open}, state}
end
:half_open - Check inflight capacity:
if state.inflight_count < state.half_open_max_inflight do
  {:reply, {:allow, :half_open}, %{state | inflight_count: state.inflight_count + 1}}
else
  {:reply, {:deny, :half_open_busy}, state}
end

Rejection Reasons

ReasonDescription
:circuit_openCircuit is open due to failures
:half_open_busyCircuit is half-open but at max inflight
:admission_timeoutAdmission check timed out (500ms)
:not_foundCircuit breaker process not found

Error Classification

Circuit breaker penalties depend on error category:

Retriable Errors (Breaker Penalty)

  • :server_error - 5xx status, upstream failure
  • :network_error - Connection refused, timeout
  • :timeout - Request timeout (except in shared mode)

Non-Retriable Errors (No Penalty)

  • :invalid_params - User error, not provider fault
  • :user_error - Client mistake
  • :client_error - 4xx status

Special Categories

:rate_limit (Retriable, No Penalty in Shared Mode):
  • Temporary backpressure
  • Known recovery (retry-after headers)
  • Fast recovery (success_threshold=1)
:capability_violation (Retriable, No Penalty):
  • Permanent constraint, not transient failure
  • Provider doesn’t support method/params
  • Should failover to different provider

Telemetry Events

All circuit breaker events emit telemetry:

Event Schema

EventMetadata
[:lasso, :circuit_breaker, :open]instance_id, transport, from_state, to_state, reason, error_category, failure_count, recovery_timeout_ms, consecutive_open_count
[:lasso, :circuit_breaker, :close]instance_id, transport, from_state, to_state, reason
[:lasso, :circuit_breaker, :half_open]instance_id, transport, from_state, to_state, reason, consecutive_open_count
[:lasso, :circuit_breaker, :proactive_recovery]instance_id, transport, from_state, to_state, reason, consecutive_open_count
[:lasso, :circuit_breaker, :failure]instance_id, transport, error_category, circuit_state
[:lasso, :circuit_breaker, :admit]instance_id, transport, decision, admit_call_ms
[:lasso, :circuit_breaker, :timeout]instance_id, transport, timeout_ms

Example Telemetry Handler

:telemetry.attach(
  "circuit-breaker-logger",
  [:lasso, :circuit_breaker, :open],
  &handle_circuit_open/4,
  nil
)

def handle_circuit_open(_event_name, _measurements, metadata, _config) do
  Logger.warning(
    "Circuit breaker opened",
    instance_id: metadata.instance_id,
    transport: metadata.transport,
    error_category: metadata.error_category,
    failure_count: metadata.failure_count,
    recovery_timeout_ms: metadata.recovery_timeout_ms
  )
end

Best Practices

Tuning Thresholds

Low Traffic (<10 req/s):
failure_threshold: 3  # Open faster
success_threshold: 1  # Recover faster
recovery_timeout: 30_000  # Shorter timeout
High Traffic (>100 req/s):
failure_threshold: 10  # More tolerance
success_threshold: 3  # More conservative recovery
recovery_timeout: 60_000  # Longer timeout

Category Thresholds

category_thresholds: %{
  server_error: 5,      # Provider-side issues
  network_error: 3,     # Connection problems (open faster)
  timeout: 2,           # Timeout issues (open fastest)
  auth_error: 2         # Auth failures (open fast)
}

Half-Open Inflight

half_open_max_inflight: 3  # Conservative (limits blast radius)
half_open_max_inflight: 10  # Aggressive (faster recovery verification)

Next Steps

Provider Selection

Understand how circuit state affects selection

Routing Strategies

Learn about health-based tiering

Profiles

Configure circuit breaker thresholds

Architecture

Explore shared circuit breaker infrastructure