Documentation Index
Fetch the complete documentation index at: https://docs.lasso.sh/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Lasso’s benchmarking system measures RPC performance passively by recording metrics from production traffic. The system tracks latency, success rates, and error patterns per provider, per method, and per transport to enable intelligent routing decisions.
Architecture
BenchmarkStore
GenServer managing performance benchmarking data using ETS tables.
Location: Lasso.Benchmarking.BenchmarkStore
Key Features:
- Per-(profile, chain) ETS tables for isolated metrics
- Dual timestamp tracking (monotonic + system)
- Automatic cleanup of stale data (24-hour retention)
- Method-specific latency percentiles (P50, P90, P95, P99)
- Profile-scoped isolation for multi-tenancy
ETS Table Structure
RPC Metrics Table (bag table):
# Table name: :rpc_metrics_{profile}_{chain}
# Entry format:
{monotonic_ts, system_ts, provider_id, method, duration_ms, result}
# Example:
{1736894871234, 1736894871000, "infura", "eth_getLogs", 150, :success}
Score Table (set table):
# Table name: :provider_scores_{profile}_{chain}
# Key: {provider_id, method, :rpc}
# Value: {successes, total, avg_duration, recent_latencies, monotonic_ts, system_ts}
# Example:
{{"alchemy", "eth_getBlockByNumber", :rpc},
245, 250, 125.5, [120, 130, 125, ...], 1736894871234, 1736894871000}
Dual Timestamp Strategy
Monotonic Timestamp (System.monotonic_time(:millisecond)):
- For internal calculations (cleanup, time windows)
- Not affected by NTP adjustments or clock skew
- Guarantees monotonically increasing values
System Timestamp (System.system_time(:millisecond)):
- For display, logging, and external APIs
- Human-readable (maps to wall-clock time)
- May jump forward/backward due to NTP
Rationale: Cleanup and time-window queries use monotonic timestamps to avoid bugs from clock adjustments, while system timestamps provide human-readable context.
Passive Measurement
Benchmarking occurs automatically on every RPC request:
# After request completes (success or error)
BenchmarkStore.record_rpc_call(
profile,
chain_name,
provider_id,
method,
duration_ms,
result # :success | :error | :timeout | :rate_limit | ...
)
Captured at call site (timestamps synchronized):
monotonic_ts = System.monotonic_time(:millisecond)
system_ts = System.system_time(:millisecond)
Benefits:
- Zero test traffic overhead
- Reflects real-world usage patterns
- Captures method-specific performance
- Includes all error categories
Per-Method Tracking
Each (provider_id, method) combination is tracked independently:
Metrics Tracked
Latency Statistics:
- Average duration (moving average)
- Percentiles: P50, P90, P95, P99 (from 100 most recent samples)
Success Metrics:
- Total calls
- Successful calls
- Success rate (ratio)
Error Breakdown:
- By category:
:timeout, :rate_limit, :network_error, etc.
- Error counts per category
Temporal Data:
- Last updated timestamp
- Sample count
- Hourly statistics
Score Calculation
def calculate_rpc_provider_score(success_rate, avg_latency_ms, total_calls) do
confidence_factor = :math.log10(max(total_calls, 1))
latency_factor = if avg_latency_ms > 0, do: 1000 / (1000 + avg_latency_ms), else: 1.0
success_rate * latency_factor * confidence_factor
end
Components:
- Success Rate (0.0 - 1.0): Higher is better
- Latency Factor (0.0 - 1.0): Asymptotic function where 0ms → 1.0, 1000ms → 0.5
- Confidence Factor (log10): More calls → higher confidence (10 calls → 1.0, 100 calls → 2.0, 1000 calls → 3.0)
Result: Composite score balancing speed, reliability, and statistical confidence.
Percentile Calculation
def calculate_percentiles(latencies) when is_list(latencies) do
sorted = Enum.sort(latencies)
count = length(sorted)
p50_index = max(0, round(count * 0.5) - 1)
p90_index = max(0, round(count * 0.9) - 1)
p95_index = max(0, round(count * 0.95) - 1)
p99_index = max(0, round(count * 0.99) - 1)
%{
p50: Enum.at(sorted, p50_index, 0),
p90: Enum.at(sorted, p90_index, 0),
p95: Enum.at(sorted, p95_index, 0),
p99: Enum.at(sorted, p99_index, 0)
}
end
Sample Window: 100 most recent successful requests per (provider, method) combination.
Usage: Provider selection strategies (:fastest, :latency_weighted) use percentiles for robust latency estimates.
Provider Selection Integration
Strategy: :fastest
Selects provider with lowest average latency for the specific method:
# Lookup method performance for all candidates
performance_data = BenchmarkStore.get_all_method_performance(profile, chain)
# Filter by method
method_data = Enum.filter(performance_data, fn entry ->
entry.method == method
end)
# Sort by average latency (ascending)
ranked = Enum.sort_by(method_data, & &1.avg_duration_ms)
# Select top provider
best_provider = List.first(ranked)
Strategy: :latency_weighted
Weighted random selection using inverse latency as probability:
# Calculate weights (inverse latency)
weights = Enum.map(providers, fn p ->
latency = get_avg_latency(p, method) || 1000
weight = 1000 / (1000 + latency)
{p, weight}
end)
# Weighted random selection
selected = weighted_random_choice(weights)
Benefits:
- Distributes load across fast providers
- Avoids overloading single “fastest” provider
- Naturally adapts to performance changes
Strategy: :load_balanced
Distributes requests across healthy providers with health-aware tiering (benchmarking used for health assessment).
Automatic Cleanup
Time-Based Cleanup
Periodic cleanup removes stale entries:
Interval: 1 hour (3,600,000ms)
Retention: 24 hours (86,400,000ms)
Implementation:
def cleanup_rpc_table_by_monotonic_timestamp(table_name, cutoff_time) do
match_spec = [
{{:"$1", :_, :_, :_, :_, :_}, [{:<, :"$1", cutoff_time}], [true]}
]
deleted = :ets.select_delete(table_name, match_spec)
Logger.debug("Cleaned up #{deleted} old RPC entries from #{table_name}")
end
Uses monotonic timestamp for reliable cleanup (immune to clock adjustments).
Size-Based Cleanup
Enforces maximum table size to prevent unbounded growth:
Max Entries: 86,400 per chain (~1 entry per second for 24 hours)
Check Interval: 10 seconds
Overflow Strategy:
if current_size >= @max_entries_per_chain do
entries_to_remove = div(@max_entries_per_chain, 10) # Remove 10%
# Remove oldest entries
oldest_entries =
table_name
|> :ets.tab2list()
|> Enum.sort_by(fn {monotonic_ts, _, _, _, _, _} -> monotonic_ts end)
|> Enum.take(entries_to_remove)
Enum.each(oldest_entries, &:ets.delete_object(table_name, &1))
end
Rationale: FIFO eviction preserves recent data while preventing memory exhaustion.
Cluster-Wide Aggregation
When BEAM clustering is enabled, metrics are aggregated across nodes.
Location: LassoWeb.Dashboard.MetricsStore
Aggregation Strategy
RPC Calls:
# Query all responding nodes
node_results = :rpc.multicall(
responding_nodes,
BenchmarkStore,
:get_calls_in_window,
[profile, chain, 60],
5_000 # 5s timeout
)
# Sum call counts across nodes
aggregated = Enum.reduce(node_results, %{}, fn node_data, acc ->
Map.merge(acc, node_data, fn _k, v1, v2 -> v1 + v2 end)
end)
Latency Percentiles:
# Weighted average by call volume
weighted_avg_latency =
(node1_calls * node1_avg + node2_calls * node2_avg) /
(node1_calls + node2_calls)
Percentiles: Cannot be aggregated accurately. Dashboard shows per-node percentiles.
Stale-While-Revalidate Caching
Cache TTL: 15 seconds
RPC Timeout: 5 seconds
Invalidation: Automatic on node connect/disconnect
case cache_lookup(key) do
{:hit, data, age} when age < 15_000 ->
# Fresh cache
{:ok, data, stale: false}
{:hit, stale_data, _age} ->
# Serve stale while refreshing in background
Task.start(fn -> refresh_cache(key) end)
{:ok, stale_data, stale: true}
:miss ->
# Blocking fetch
data = fetch_from_cluster()
cache_put(key, data)
{:ok, data, stale: false}
end
API Reference
Recording Metrics
BenchmarkStore.record_rpc_call(
profile, # "default"
chain_name, # "ethereum"
provider_id, # "alchemy"
method, # "eth_getLogs"
duration_ms, # 150
result # :success | :error | :timeout | :rate_limit
)
# Provider leaderboard
BenchmarkStore.get_provider_leaderboard("default", "ethereum")
# => [
# %{provider_id: "alchemy", avg_latency_ms: 125, success_rate: 0.98, score: 2.85},
# %{provider_id: "infura", avg_latency_ms: 150, success_rate: 0.95, score: 2.42}
# ]
# Method-specific performance
BenchmarkStore.get_rpc_method_performance("default", "ethereum", "eth_getLogs")
# => %{
# method: "eth_getLogs",
# providers: [
# %{provider_id: "alchemy", avg_duration_ms: 180, success_rate: 0.97},
# %{provider_id: "infura", avg_duration_ms: 210, success_rate: 0.94}
# ]
# }
# Percentiles for provider + method
BenchmarkStore.get_rpc_method_performance_with_percentiles(
"default", "ethereum", "alchemy", "eth_getLogs"
)
# => %{
# provider_id: "alchemy",
# method: "eth_getLogs",
# avg_duration_ms: 180,
# success_rate: 0.97,
# percentiles: %{p50: 150, p90: 250, p95: 300, p99: 450},
# total_calls: 1523
# }
Cluster Metrics
# RPS calculation (60-second window)
BenchmarkStore.get_calls_in_window("default", "ethereum", 60)
# => %{"infura" => 150, "alchemy" => 200}
# Cluster-wide aggregated leaderboard
MetricsStore.get_provider_leaderboard("default", "ethereum")
# => %{
# data: [...],
# coverage: %{responding: 3, total: 3},
# stale: false,
# timestamp: 1736894871234
# }
Telemetry Events
Benchmarking does not emit telemetry directly (passive recording). Dashboard and selection logic query BenchmarkStore synchronously.
Memory Management
Estimation:
# Per entry size
entry_size = 8 (ts) + 8 (ts) + 50 (provider_id) + 30 (method) + 8 (duration) + 8 (result)
≈ 112 bytes
# Max entries per chain
max_memory = 86,400 entries * 112 bytes ≈ 9.7 MB per chain
Multi-chain/profile:
# 5 profiles, 10 chains each
total_memory ≈ 5 * 10 * 9.7 MB ≈ 485 MB (worst case)
Actual usage (compressed ETS tables):
- Compression ratio: ~2-3x
- Realistic memory: 150-250 MB for large deployments
Recording Overhead:
- ETS insert: <0.1ms
- Score update: <0.5ms
- Total per request: <1ms
Query Performance:
- Single provider lookup: <0.5ms
- Leaderboard (all providers): 2-5ms
- Method performance (all providers): 3-8ms
- Percentile calculation: 5-10ms (sorts 100 samples)
Cleanup Performance:
- Time-based cleanup: 10-50ms per chain
- Size-based cleanup: 20-100ms (sorts full table)
Best Practices
For High-Traffic Deployments
- Monitor table sizes: Alert if approaching 86,400 entries before cleanup
- Use cluster aggregation: Share metrics across geo-distributed nodes
- Cache leaderboard queries: Refresh every 5-10 seconds, not per request
For Multi-Tenant Deployments
- Profile isolation: Each profile has independent metrics (no cross-contamination)
- Per-profile cleanup: Failed profile doesn’t affect others
- Resource limits: Consider per-profile memory quotas
- Check percentiles: P99 latency reveals tail behavior
- Compare methods: Some methods naturally slower (eth_getLogs vs eth_blockNumber)
- Track error rates: High error rate may indicate provider issues, not latency
- Monitor sample counts: Low counts (<10) mean insufficient data for reliable selection