Overview
What Clustering Provides
- Dashboard aggregation: View metrics across all nodes in a single interface
- Per-region drill-down: Compare provider performance by geographic region
- Cluster health monitoring: Node status, region discovery, and topology visualization
- Circuit breaker visibility: See breaker states across all nodes and regions
What Clustering Does NOT Affect
- Routing decisions: Each node routes independently based on local latency
- Request hot path: No cross-node coordination during request handling
- Circuit breakers: Per-node state, no shared breaker coordination
- Provider selection: Based on local measurements only
Architecture
Lasso useslibcluster with DNS-based node discovery:
Configuration
Required Environment Variables
Both variables must be set for clustering to activate:| Variable | Description | Example |
|---|---|---|
CLUSTER_DNS_QUERY | DNS name resolving to all node IPs | lasso.internal |
CLUSTER_NODE_BASENAME | Erlang node basename for distribution | lasso |
LASSO_NODE_ID | Unique node identifier (typically region name) | us-east-1 |
CLUSTER_DNS_QUERY or CLUSTER_NODE_BASENAME is missing, the node runs standalone.
Configuration in runtime.exs
The clustering configuration is loaded from environment variables:DNS Service Discovery
Clustering requires a DNS name that resolves to all node IPs. This is typically provided by:- Kubernetes: Headless service (returns all pod IPs)
- Consul: Service discovery with DNS interface
- Internal DNS: Custom DNS server resolving to node IPs
- Cloud DNS: AWS Route 53, GCP Cloud DNS, etc.
DNS Requirements
- Multiple A records: DNS query must return all node IPs
- Internal network: Nodes must reach each other on EPMD port (4369) and distribution ports
- TTL: Low TTL for fast node discovery (recommended: 5-30 seconds)
Port Requirements
Erlang distribution requires open ports between nodes:| Port | Protocol | Description |
|---|---|---|
| 4369 | TCP | EPMD (Erlang Port Mapper Daemon) |
| Dynamic | TCP | Distribution ports (typically 9000-9999) |
Example Configurations
Kubernetes
Docker Compose
For local testing with multiple nodes:VM/Bare Metal with Consul
Node Identity
Each node requires a uniqueLASSO_NODE_ID. Convention: use geographic region names for geo-distributed deployments.
Recommended Naming
| Deployment Pattern | Naming Convention | Examples |
|---|---|---|
| Multi-region | Cloud region codes | us-east-1, eu-west-1, ap-southeast-1 |
| Multi-datacenter | Datacenter abbreviations | iad, lhr, sin |
| Multi-AZ | Availability zones | us-east-1a, us-east-1b |
| Development | Descriptive names | dev-local, staging-1 |
Why Node ID Matters
- Metrics partitioning: State is keyed by
{provider_id, node_id} - Regional comparison: Dashboard groups metrics by region
- Circuit breaker visibility: See which regions have open breakers
- Traffic analysis: Understand request distribution across nodes
Cluster Topology
TheLasso.Cluster.Topology module manages cluster membership:
Node States
| State | Description |
|---|---|
:connected | Erlang distribution connection established |
:discovering | Region identification via RPC in progress |
:responding | Passes health checks, region known |
:unresponsive | Connected but failing health checks (3+ failures) |
:disconnected | Previously connected, now offline |
Health Checks
- Interval: 15 seconds
- Timeout: 5 seconds
- Failure threshold: 3 consecutive failures →
:unresponsive - Method:
:rpc.multicall/4to all connected nodes
Topology Events
The topology module broadcasts events via Phoenix PubSub on thecluster:topology topic:
Dashboard Integration
The dashboard aggregates metrics from all responding nodes:MetricsStore
LassoWeb.Dashboard.MetricsStore provides cluster-wide metrics with stale-while-revalidate caching:
- TTL: 15 seconds
- RPC timeout: 5 seconds
- Invalidation: Automatic on node connect/disconnect
- Aggregation: Weighted averages by call volume
Regional Drill-Down
The dashboard groups metrics by node_id for regional comparison:- View aggregate performance across all regions
- Drill into specific regions to identify geographic issues
- Compare provider performance region-by-region
- See which regions have circuit breakers open
Troubleshooting
Nodes Not Connecting
Nodes Becoming Unresponsive
Check node health:- Network partitions
- High CPU/memory usage preventing health check responses
- Firewall blocking distribution ports
Best Practices
- Use stable node IDs: Don’t change
LASSO_NODE_IDafter deployment - Monitor cluster health: Watch for nodes in
:unresponsivestate - Plan for network partitions: Nodes gracefully degrade to standalone mode
- Use internal DNS: Don’t expose Erlang distribution to public internet
- Test failover: Verify dashboard still works when nodes disconnect
Next Steps
- Geo-Distributed Deployment - Multi-region architecture
- Production Checklist - Pre-launch verification
- Architecture - Understand cluster design