Building a Cluster
Building production clusters with Ergo technologies
Ergo provides a complete technology stack for building distributed systems. Service discovery, load balancing, failover, observability - all integrated and working together. No external dependencies except the registrar. No API gateways, service meshes, or orchestration layers between your services.
This chapter shows how to use Ergo technologies to build production clusters. You'll see how service discovery enables automatic load balancing, how the leader actor provides failover, how metrics and Observer give you visibility into cluster state. Each technology solves a specific problem; together they cover the full spectrum of distributed system requirements.
The Integration Cost Problem
Traditional microservice architectures pay a heavy integration tax. Each service needs:
HTTP/gRPC endpoints for communication
Client libraries with retry logic and circuit breakers
Service mesh sidecars for traffic management
API gateways for routing and load balancing
Health check endpoints and probes
Metrics exporters and tracing spans
Configuration management and secret injection
Each layer adds latency, complexity, and failure modes. A simple call between two services traverses client library, sidecar proxy, load balancer, another sidecar, server library. Each hop serializes, deserializes, and can fail independently.
Ergo eliminates these layers. Processes communicate directly through message passing. The framework handles serialization, routing, load balancing, and failure detection. No sidecars, no API gateways, no client libraries.
One network hop. One serialization. Built-in load balancing and failover. This isn't a philosophical difference - it's orders of magnitude less infrastructure to deploy, maintain, and debug.
Service Discovery with Registrars
Service discovery is the foundation of clustering. How does node A find node B? How does a process locate the right service instance? Ergo provides three registrar options, each suited for different scales and requirements.
Embedded Registrar
The embedded registrar requires no external infrastructure. The first node on a host becomes the registrar server; others connect as clients.
Cross-host discovery uses UDP queries. When node 2 needs to reach node 4, it asks its local registrar server (node 1), which queries node 4's host via UDP.
Use for: Development, testing, single-host deployments, simple multi-host setups without firewalls blocking UDP.
Limitations: No application discovery, no configuration management, no event notifications.
etcd Registrar
etcd provides centralized discovery with application routing, configuration management, and event notifications. Nodes register with etcd and maintain leases for automatic cleanup.
etcd registrar capabilities:
Node discovery
Find all nodes in the cluster
Application discovery
Find which nodes run specific applications
Weighted routing
Load balance based on application weights
Configuration
Hierarchical config with type conversion
Events
Real-time notifications of cluster changes
Use for: Teams already running etcd, clusters up to 50-70 nodes, deployments needing application discovery.
Saturn Registrar
Saturn is purpose-built for Ergo. Instead of polling (like etcd), it maintains persistent connections and pushes updates immediately. Topology changes propagate in milliseconds.
Saturn vs etcd:
Update propagation
Polling (seconds)
Push (milliseconds)
Connection model
HTTP requests
Persistent TCP
Scalability
50-70 nodes
Thousands of nodes
Event latency
Next poll cycle
Immediate
Infrastructure
General-purpose KV store
Purpose-built for Ergo
Use for: Large clusters, real-time topology awareness, production systems where discovery latency matters.
Application Discovery and Load Balancing
Applications are the unit of deployment in Ergo. A node can load multiple applications, start them with different modes, and register them with the registrar. Other nodes discover applications and route requests based on weights.
Registering Applications
When you start an application, it automatically registers with the registrar (if using etcd or Saturn):
The registrar now knows: application "api" is running on this node with weight 100.
Discovering Applications
Other nodes can discover where applications run:
Output might show:
Weighted Load Balancing
Weights enable traffic distribution. A node with weight 100 receives twice as much traffic as a node with weight 50. Use this for:
Canary deployments: New version with weight 10, stable with weight 90
Capacity matching: Powerful nodes get higher weights
Graceful draining: Set weight to 0 before maintenance
Routing Requests
Once you know where applications run, route requests using weighted selection:
This is application-level load balancing without external infrastructure. No load balancer service, no sidecar proxies.
Running Multiple Instances for Load Balancing
Horizontal scaling means running the same application on multiple nodes. Each instance handles a portion of traffic. Add nodes to increase capacity; remove nodes to reduce costs.
Deployment Pattern
Each client discovers all api instances and distributes requests based on weights.
Implementation
On each worker node:
On coordinator/client nodes:
Scaling Operations
Scale up: Start new node with the same application. It registers with the registrar. Other nodes discover it through events or next resolution.
Scale down: Set weight to 0 (drain), wait for in-flight work, stop the node. Registrar removes the registration when the lease expires.
Reacting to Topology Changes
Subscribe to registrar events to react when instances join or leave:
No polling. No service mesh. Events arrive within milliseconds (Saturn) or at the next poll cycle (etcd).
Running Multiple Instances for Failover
Failover means having standby instances ready to take over when the primary fails. The leader actor implements distributed leader election - exactly one instance is active (leader) while others wait (followers).
The Leader Actor
The leader.Actor from ergo.services/actor/leader implements Raft-based leader election. Embed it in your actor to participate in elections:
Election Mechanics
All instances start as followers
If no heartbeats arrive, a follower becomes candidate
Candidate requests votes from peers
Majority vote wins; candidate becomes leader
Leader sends periodic heartbeats
If leader fails, followers detect timeout and elect new leader
Failover Scenario
Failover happens automatically. No manual intervention. The surviving nodes elect a new leader within the election timeout (150-300ms by default).
Use Cases
Single-writer coordination: Only the leader writes to prevent conflicts.
Task scheduling: Only the leader runs periodic tasks.
Distributed locks: Leader grants exclusive access.
Quorum and Split-Brain
Leader election requires a majority (quorum) to prevent split-brain:
3 nodes
2
1
5 nodes
3
2
7 nodes
4
3
If a network partition splits 5 nodes into groups of 3 and 2:
The group of 3 can elect a leader (has quorum)
The group of 2 cannot (no quorum)
This prevents both sides from having leaders and making conflicting decisions.
Observability with Metrics
The metrics actor from ergo.services/actor/metrics exposes Prometheus-format metrics. Base metrics are collected automatically; you add custom metrics for application-specific telemetry.
Basic Setup
This starts an HTTP server at :9090/metrics with base metrics:
ergo_node_uptime_seconds
Node uptime
ergo_processes_total
Total process count
ergo_processes_running
Actively processing
ergo_memory_used_bytes
Memory from OS
ergo_memory_alloc_bytes
Heap allocation
ergo_connected_nodes_total
Remote connections
ergo_remote_messages_in_total
Messages received per node
ergo_remote_messages_out_total
Messages sent per node
ergo_remote_bytes_in_total
Bytes received per node
ergo_remote_bytes_out_total
Bytes sent per node
Custom Metrics
Extend the metrics actor for application-specific telemetry:
Update metrics from your application:
Prometheus Integration
Now you have cluster-wide visibility: process counts, memory usage, network traffic, custom business metrics - all in Prometheus/Grafana.
Inspecting with Observer
Observer is a web UI for cluster inspection. Run it as an application within your node or as a standalone tool.
Embedding Observer
Open http://localhost:9911 to see:
Node info: Uptime, memory, CPU, process counts
Network: Connected nodes, acceptors, traffic graphs
Process list: All processes with state, mailbox depth, runtime
Process details: Links, monitors, aliases, environment
Logs: Real-time log stream with filtering
Standalone Observer Tool
For inspecting remote nodes without embedding:
Connect to any node in your cluster and inspect its state remotely.
Process Inspection
Observer calls HandleInspect on processes to get internal state:
This data appears in the Observer UI, updated every second.
Debugging Production Issues
Observer helps diagnose:
Memory leaks: Watch
ergo_memory_alloc_bytes, find processes with growing mailboxesStuck processes: Check "Top Running" sort to find processes consuming CPU
Message backlogs: "Top Mailbox" shows processes falling behind
Network issues: Traffic graphs show bytes/messages per remote node
Process relationships: Links and monitors show supervision structure
Remote Operations
Ergo supports starting processes and applications on remote nodes. This enables dynamic workload distribution and orchestration.
Remote Spawn
Start a process on a remote node:
The spawned process runs on the remote node but can communicate with any process in the cluster.
Remote Application Start
Start an application on a remote node:
Use this for:
Dynamic orchestration: Coordinator decides which apps run where
Staged deployment: Start apps in order, waiting for health checks
Capacity management: Start/stop apps based on load
Configuration Management
etcd and Saturn registrars provide cluster-wide configuration with hierarchical overrides.
Configuration Hierarchy
Node-specific overrides cluster-wide, which overrides global.
Typed Configuration
Values are stored as strings with type prefixes:
Configuration Events
React to config changes in real-time:
No restart required. Configuration propagates to all nodes automatically.
Putting It Together
Here's a complete example: a job processing cluster with load balancing, failover, metrics, and observability.
Architecture
Coordinator Node
Worker Node
What You Get
Load balancing: Jobs distribute across workers based on weights
Failover: If coordinator leader fails, another takes over in <300ms
Discovery: Workers auto-register; coordinators discover them via events
Metrics: Prometheus scrapes all nodes for cluster-wide visibility
Inspection: Observer UI shows processes, mailboxes, network traffic
Configuration: Update settings via etcd; changes propagate immediately
All of this with:
No API gateways
No service mesh
No load balancer services
No orchestration layers
No client libraries with retry logic
Just Ergo nodes communicating directly through message passing.
Summary
Ergo provides integrated technologies for building production clusters:
Registrars
Service discovery
ergo.services/registrar/etcd, registrar/saturn
Applications
Deployment units with weights
Core framework
Leader Actor
Failover via leader election
ergo.services/actor/leader
Metrics Actor
Prometheus observability
ergo.services/actor/metrics
Observer
Web UI for inspection
ergo.services/application/observer, tools/observer
Remote Spawn
Dynamic process creation
Core framework
Remote App Start
Dynamic application deployment
Core framework
Configuration
Hierarchical config management
Registrar feature
These components eliminate the integration layers that dominate traditional microservice architectures. Instead of building infrastructure, you build applications.
For implementation details, see:
Last updated
