Overview
Lasso’s benchmarking system measures RPC performance passively by recording metrics from production traffic. The system tracks latency, success rates, and error patterns per provider, per method, and per transport to enable intelligent routing decisions.Architecture
BenchmarkStore
GenServer managing performance benchmarking data using ETS tables. Location:Lasso.Benchmarking.BenchmarkStore
Key Features:
- Per-(profile, chain) ETS tables for isolated metrics
- Dual timestamp tracking (monotonic + system)
- Automatic cleanup of stale data (24-hour retention)
- Method-specific latency percentiles (P50, P90, P95, P99)
- Profile-scoped isolation for multi-tenancy
ETS Table Structure
RPC Metrics Table (bag table):Dual Timestamp Strategy
Monotonic Timestamp (System.monotonic_time(:millisecond)):
- For internal calculations (cleanup, time windows)
- Not affected by NTP adjustments or clock skew
- Guarantees monotonically increasing values
System.system_time(:millisecond)):
- For display, logging, and external APIs
- Human-readable (maps to wall-clock time)
- May jump forward/backward due to NTP
Passive Measurement
Benchmarking occurs automatically on every RPC request:monotonic_ts = System.monotonic_time(:millisecond)system_ts = System.system_time(:millisecond)
- Zero test traffic overhead
- Reflects real-world usage patterns
- Captures method-specific performance
- Includes all error categories
Per-Method Tracking
Each(provider_id, method) combination is tracked independently:
Metrics Tracked
Latency Statistics:- Average duration (moving average)
- Percentiles: P50, P90, P95, P99 (from 100 most recent samples)
- Total calls
- Successful calls
- Success rate (ratio)
- By category:
:timeout,:rate_limit,:network_error, etc. - Error counts per category
- Last updated timestamp
- Sample count
- Hourly statistics
Score Calculation
- Success Rate (0.0 - 1.0): Higher is better
- Latency Factor (0.0 - 1.0): Asymptotic function where 0ms → 1.0, 1000ms → 0.5
- Confidence Factor (log10): More calls → higher confidence (10 calls → 1.0, 100 calls → 2.0, 1000 calls → 3.0)
Percentile Calculation
(provider, method) combination.
Usage: Provider selection strategies (:fastest, :latency_weighted) use percentiles for robust latency estimates.
Provider Selection Integration
Strategy: :fastest
Selects provider with lowest average latency for the specific method:
Strategy: :latency_weighted
Weighted random selection using inverse latency as probability:
- Distributes load across fast providers
- Avoids overloading single “fastest” provider
- Naturally adapts to performance changes
Strategy: :load_balanced
Distributes requests across healthy providers with health-aware tiering (benchmarking used for health assessment).
Automatic Cleanup
Time-Based Cleanup
Periodic cleanup removes stale entries: Interval: 1 hour (3,600,000ms) Retention: 24 hours (86,400,000ms) Implementation:Size-Based Cleanup
Enforces maximum table size to prevent unbounded growth: Max Entries: 86,400 per chain (~1 entry per second for 24 hours) Check Interval: 10 seconds Overflow Strategy:Cluster-Wide Aggregation
When BEAM clustering is enabled, metrics are aggregated across nodes. Location:LassoWeb.Dashboard.MetricsStore
Aggregation Strategy
RPC Calls:Stale-While-Revalidate Caching
Cache TTL: 15 seconds RPC Timeout: 5 seconds Invalidation: Automatic on node connect/disconnectAPI Reference
Recording Metrics
Querying Performance Data
Cluster Metrics
Telemetry Events
Benchmarking does not emit telemetry directly (passive recording). Dashboard and selection logic query BenchmarkStore synchronously.Memory Management
Estimation:- Compression ratio: ~2-3x
- Realistic memory: 150-250 MB for large deployments
Performance Characteristics
Recording Overhead:- ETS insert: <0.1ms
- Score update: <0.5ms
- Total per request: <1ms
- Single provider lookup: <0.5ms
- Leaderboard (all providers): 2-5ms
- Method performance (all providers): 3-8ms
- Percentile calculation: 5-10ms (sorts 100 samples)
- Time-based cleanup: 10-50ms per chain
- Size-based cleanup: 20-100ms (sorts full table)
Best Practices
For High-Traffic Deployments
- Monitor table sizes: Alert if approaching 86,400 entries before cleanup
- Use cluster aggregation: Share metrics across geo-distributed nodes
- Cache leaderboard queries: Refresh every 5-10 seconds, not per request
For Multi-Tenant Deployments
- Profile isolation: Each profile has independent metrics (no cross-contamination)
- Per-profile cleanup: Failed profile doesn’t affect others
- Resource limits: Consider per-profile memory quotas
For Debugging Performance Issues
- Check percentiles: P99 latency reveals tail behavior
- Compare methods: Some methods naturally slower (eth_getLogs vs eth_blockNumber)
- Track error rates: High error rate may indicate provider issues, not latency
- Monitor sample counts: Low counts (<10) mean insufficient data for reliable selection
Related Documentation
- Block Sync - Lag metrics used alongside latency for selection
- Error Classification - Error categories tracked in benchmarking
- WebSocket Subscriptions - Provider selection for failover uses benchmarking