Radar
Running an Ergo node in production typically requires two things: health probes for Kubernetes and a Prometheus metrics endpoint. Setting them up separately means two HTTP servers on two ports, two actor packages to import, and the same wiring code repeated on every node.
Radar bundles both into a single application on one HTTP port. Internally it runs a Health actor for probe endpoints, a Metrics actor for base Ergo telemetry, and a pool of metrics workers for custom metric updates, all behind a shared mux served by one HTTP server. Actors interact with Radar through helper functions in the radar package without importing the underlying packages or knowing the internal actor names.
Adding to Your Node
import (
"ergo.services/application/radar"
"ergo.services/ergo"
"ergo.services/ergo/gen"
)
func main() {
node, _ := ergo.StartNode("mynode@localhost", gen.NodeOptions{
Applications: []gen.ApplicationBehavior{
radar.CreateApp(radar.Options{Port: 9090}),
},
})
// Health: http://localhost:9090/health/live
// http://localhost:9090/health/ready
// http://localhost:9090/health/startup
// Metrics: http://localhost:9090/metrics
node.Wait()
}With no signals registered, all three health endpoints return 200 with {"status":"healthy"}. The metrics endpoint immediately serves base Ergo metrics. No additional configuration is required for a working production setup.
Configuration
Host determines which network interface the HTTP server binds to. Default is "localhost". Use "0.0.0.0" for containerized environments where probes and scraping come from outside the pod.
Port sets the single HTTP port for all endpoints. Default is 9090. Choose a port that does not conflict with your application's own listeners.
HealthPath sets the URL prefix for health probe endpoints. Default is "/health". The actual endpoints become HealthPath+"/live", HealthPath+"/ready", HealthPath+"/startup". Change this when deploying behind a reverse proxy that expects a different path prefix.
MetricsPath sets the URL path for the Prometheus scrape target. Default is "/metrics".
HealthCheckInterval controls how often the health actor checks for expired heartbeats. Default is 1 second. Shorter intervals detect failures faster but increase internal message traffic. For most applications, 1-2 seconds provides a good balance.
MetricsCollectInterval sets how often base Ergo metrics are collected (processes, memory, CPU, network, events). Default is 10 seconds. Align this with your Prometheus scrape interval; collecting more frequently than Prometheus scrapes wastes CPU; collecting less frequently means Prometheus may see stale values.
MetricsTopN limits the number of entries in per-process and per-event top-N metrics tables. Default is 50. Increase this for large nodes with thousands of processes where you need broader visibility into the tail. The collection cost scales linearly with TopN.
MetricsPoolSize sets the number of worker actors in the custom metrics pool. Default is 3. Under normal load, a single worker is sufficient. Increase this if many actors send frequent metric updates and you observe the metrics mailbox growing.
Health Probes
Actors register signals with Radar, specifying which probes the signal affects and an optional heartbeat timeout. The health actor monitors the registering process; if it terminates, all its signals are automatically marked as down.
Registering a Signal
The signal "postgres" participates in both liveness and readiness probes. If the heartbeat stops arriving (timeout expires) or the process terminates, Kubernetes receives a 503 on both /health/live and /health/ready.
Probe Types
radar.ProbeLiveness
/health/live
radar.ProbeReadiness
/health/ready
radar.ProbeStartup
/health/startup
Combine with bitwise OR. A signal registered for ProbeLiveness|ProbeReadiness affects both endpoints independently.
Manual Signal Control
When you can detect failures immediately without waiting for a timeout:
Helper Functions
RegisterService and UnregisterService are synchronous calls that return an error on failure. Heartbeat, ServiceUp, and ServiceDown are asynchronous sends (fire-and-forget).
For a detailed explanation of the heartbeat model, failure detection mechanisms, and the HTTP response format, see the Health actor documentation.
Custom Metrics
Actors register Prometheus metric collectors and update them through Radar's helper functions. The underlying metrics actor manages the Prometheus registry and HTTP exposition. Registration is synchronous, updates are asynchronous.
All custom metrics automatically receive a node const label set to the node name. Do not include "node" in your variable label names; it will cause a "duplicate label names" registration error.
Registering Metrics
The labels parameter defines the label names for the metric. When updating, you provide label values in the same order. Pass nil for metrics without labels. The buckets parameter in RegisterHistogram defines histogram bucket boundaries; pass nil for Prometheus default buckets.
Updating Metrics
Updates are distributed across the worker pool. Under high throughput, multiple actors can send updates concurrently without contending on a single actor's mailbox.
Automatic Cleanup
When a process that registered metrics terminates, all its metrics are automatically unregistered from the Prometheus registry. No explicit cleanup is needed. To remove a metric while the process is still running, use radar.UnregisterMetric(process, name).
Helper Functions
For a detailed explanation of metric types, the Grafana dashboard, and advanced usage (embedding, shared mode), see the Metrics actor documentation.
Top-N Metrics
Top-N metrics track the N highest (or lowest) values observed during each collection cycle and flush them to Prometheus as a GaugeVec. This is useful when you want to identify outliers (slowest queries, busiest workers, largest payloads) without creating a time series per item.
Registering and Observing
Registration is synchronous (returns error). Observations are asynchronous (fire-and-forget). Each top-N metric is managed by a dedicated actor that accumulates observations and flushes the top entries to Prometheus on the same interval as base metrics collection.
Ordering Modes
radar.TopNMax: keeps the N largest values (e.g., slowest queries, busiest actors, highest memory)radar.TopNMin: keeps the N smallest values (e.g., lowest latency, least active processes)
Automatic Cleanup
When the process that registered a top-N metric terminates, the metric actor cleans up and unregisters from Prometheus. No explicit teardown needed.
Helper Functions
Common Patterns
Database Connection Pool
An actor that manages a connection pool reports both health and metrics through Radar:
A single periodic check updates both the health signal and connection pool metrics. If the database becomes unreachable, the heartbeat stops and Kubernetes removes the pod from service. The metrics endpoint continues to show the last known pool state until the pod restarts.
Startup Gate with Progress
An actor that runs migrations uses the startup probe to prevent premature traffic, and reports progress via a gauge:
While migrations run, the startup probe returns 503, Kubernetes waits, and Prometheus shows the remaining migration count. Once complete, the startup signal is released and liveness/readiness probes take over.
Kubernetes Configuration
Configure Kubernetes probes and Prometheus scraping to point at the same port:
Prometheus scrape configuration:
Align scrape_interval with MetricsCollectInterval in Radar options. The default collect interval is 10 seconds; scraping more frequently than the collect interval returns identical data.
Relationship to Health and Metrics Actors
Radar uses Health and Metrics actors internally. The helper functions in the radar package delegate to these actors by their internal registered names. If you need capabilities beyond what the helpers expose (embedding the metrics actor for direct Prometheus registry access, custom health actor behavior with HandleSignalDown callbacks, or shared mux with additional HTTP handlers), use the underlying actors directly.
Radar is designed for the common case: production nodes that need standard health probes and Prometheus metrics with minimal setup. For advanced scenarios, the building blocks are available as separate packages.
Last updated
