Metrics

The metrics actor provides observability for Ergo applications by collecting and exposing runtime statistics in Prometheus format. Instead of manually instrumenting your code with counters and gauges scattered throughout, the metrics actor centralizes telemetry into a single process that exposes an HTTP endpoint for Prometheus to scrape.

This approach separates monitoring concerns from application logic. Your actors focus on business functionality while the metrics actor handles collection, aggregation, and exposure of operational data. Prometheus or compatible monitoring systems poll the /metrics endpoint periodically, building time-series data for alerting and visualization.

Why Monitor Actors

Actor systems present unique monitoring challenges. Traditional thread-based applications have predictable resource usage patterns - you monitor thread pools, request queues, and database connections. Actor systems are more dynamic - processes spawn and terminate constantly, messages flow asynchronously through mailboxes, and work distribution depends on supervision trees and message routing.

The metrics actor addresses this by tracking:

Process metrics - How many processes exist, how many are running vs. idle vs. zombie. This reveals whether your node is under load or experiencing process leaks.

Mailbox metrics - Queue depth and latency for every process on the node. Depth shows how many messages are waiting in each mailbox; latency shows how long the oldest message has been waiting. Together they answer whether actors are keeping up with their workload and which specific processes are falling behind.

Utilization and throughput metrics - How much time each process spends executing callbacks relative to its lifetime, and how many messages flow through the node per second. These reveal compute-bound actors, idle capacity, and overall system throughput.

Memory metrics - Heap allocation and actual memory used. Actor systems can accumulate small allocations across thousands of processes. Memory metrics help identify whether garbage collection keeps pace with allocation.

Network metrics - For distributed Ergo clusters, tracking bytes and messages flowing between nodes reveals network bottlenecks, routing inefficiencies, or failing connections.

Application metrics - How many applications are loaded and running. Applications failing to start or terminating unexpectedly appear in these counts.

These base metrics provide system-level visibility. For application-specific metrics (request rates, business transactions, custom counters), you extend the metrics actor with your own Prometheus collectors.

ActorBehavior Interface

The metrics actor extends gen.ProcessBehavior with a specialized interface:

type ActorBehavior interface {
    gen.ProcessBehavior

    Init(args ...any) (Options, error)

    HandleMessage(from gen.PID, message any) error
    HandleCall(from gen.PID, ref gen.Ref, message any) (any, error)
    HandleEvent(event gen.MessageEvent) error
    HandleInspect(from gen.PID, item ...string) map[string]string

    CollectMetrics() error
    Terminate(reason error)
}

Only Init() is required - register your custom metrics and return options; all other callbacks have default implementations you can override as needed.

You have two main patterns:

Periodic collection - Implement CollectMetrics() to query state at intervals. Use when metrics reflect current state from other actors or external sources.

Event-driven updates - Implement HandleMessage() or HandleEvent() to update metrics when events occur. Use when your application produces natural event streams or publishes events.

How It Works

When you spawn the metrics actor:

  1. HTTP endpoint starts at the configured host and port. The /metrics endpoint immediately serves Prometheus-formatted data.

  2. Base metrics collect automatically. Node information (processes, memory, CPU), network statistics (connected nodes, message rates), and per-process metrics (mailbox depth, utilization, latency, aggregates) update at the configured interval.

  3. Custom metrics update via CollectMetrics() callback or HandleMessage() processing, depending on your implementation.

  4. Prometheus scrapes the /metrics endpoint and receives current values for all registered collectors (base + custom).

The actor handles HTTP serving and registry management. You focus on defining metrics and updating their values.

Basic Usage

Spawn the metrics actor like any other process:

Default configuration:

  • Host: localhost

  • Port: 3000

  • CollectInterval: 10 seconds

  • TopN: 50

The HTTP endpoint starts automatically during initialization. The first metrics collection happens immediately, and subsequent collections run at the configured interval.

Configuration

Customize the HTTP endpoint and collection frequency:

Host determines which network interface the HTTP server binds to. Use "localhost" to restrict access to local connections only (development, testing). Use "0.0.0.0" to accept connections from any interface (production, containerized environments).

Port should not conflict with other services. Prometheus conventionally uses 9090, but many Ergo applications use that for other purposes. Choose a port that doesn't collide with your application's HTTP servers, Observer UI (default 9911), or other metrics exporters.

TopN sets how many top processes are tracked for each per-process metric group -- mailbox depth, utilization, and latency (default: 50). Higher values provide more visibility but increase Prometheus cardinality. Set to 0 is not supported; the minimum effective value is 1.

CollectInterval controls how frequently the actor queries node statistics. Shorter intervals provide more granular time-series data but increase CPU usage for collection. Longer intervals reduce overhead but miss short-lived spikes. For most applications, 10-15 seconds balances responsiveness with resource usage. Prometheus typically scrapes every 15-60 seconds, so collecting more frequently than your scrape interval wastes resources.

Base Metrics

The metrics actor automatically exposes these Prometheus metrics without any configuration:

Node Metrics

Metric
Type
Description

ergo_node_uptime_seconds

Gauge

Time since node started. Useful for detecting node restarts and calculating availability.

ergo_processes_total

Gauge

Total number of processes including running, idle, and zombie. High counts suggest process leaks or inefficient cleanup.

ergo_processes_running

Gauge

Processes actively handling messages. Low relative to total suggests most processes are idle (good) or blocked (bad - investigate what they're waiting for).

ergo_processes_zombie

Gauge

Processes terminated but not yet fully cleaned up. These should be transient. Persistent zombies indicate bugs in termination handling.

ergo_processes_spawned_total

Gauge

Cumulative number of successfully spawned processes since node start. Monotonically increasing counter useful for tracking spawn rate over time.

ergo_processes_spawn_failed_total

Gauge

Cumulative number of failed spawn attempts. Non-zero values indicate initialization errors or resource constraints preventing process creation.

ergo_processes_terminated_total

Gauge

Cumulative number of terminated processes. Compare to spawned count to understand process lifecycle patterns.

ergo_memory_used_bytes

Gauge

Total memory obtained from OS (uses runtime.MemStats.Sys).

ergo_memory_alloc_bytes

Gauge

Bytes of allocated heap objects (uses runtime.MemStats.Alloc).

ergo_cpu_user_seconds

Gauge

CPU time spent executing user code. Increases as the node does work. Rate of change indicates CPU utilization.

ergo_cpu_system_seconds

Gauge

CPU time spent in kernel (system calls). High system time relative to user time suggests I/O bottlenecks or excessive syscalls.

ergo_cpu_cores

Gauge

Number of CPU cores available to the process. Useful for normalizing CPU utilization metrics.

ergo_applications_total

Gauge

Number of applications loaded. Should match your expected count. Unexpected changes indicate applications starting or stopping.

ergo_applications_running

Gauge

Applications currently active. Compare to total to identify stopped or failed applications.

ergo_registered_names_total

Gauge

Processes registered with atom names. High counts suggest heavy use of named processes for routing.

ergo_registered_aliases_total

Gauge

Total number of registered aliases. Includes aliases created by processes via CreateAlias() and aliases identifying meta-processes.

ergo_registered_events_total

Gauge

Event subscriptions active in the node. High counts indicate extensive pub/sub usage.

Network Metrics

Metric
Type
Labels
Description

ergo_connected_nodes_total

Gauge

-

Number of remote nodes connected. For distributed systems, this should match your expected cluster size.

ergo_remote_node_uptime_seconds

Gauge

remote_node

Uptime of each connected remote node. Resets when the remote node restarts.

ergo_remote_messages_in_total

Gauge

remote_node

Messages received from each remote node. Rate indicates traffic volume.

ergo_remote_messages_out_total

Gauge

remote_node

Messages sent to each remote node. Asymmetric in/out rates may reveal routing issues.

ergo_remote_bytes_in_total

Gauge

remote_node

Bytes received from each remote node. Disproportionate bytes-to-messages ratio suggests large messages or inefficient serialization.

ergo_remote_bytes_out_total

Gauge

remote_node

Bytes sent to each remote node. Monitors network bandwidth usage per peer.

Network metrics use labels (remote_node="...") to separate per-node data. This creates multiple time series - one per connected node. Prometheus queries can aggregate across labels or filter to specific nodes.

Mailbox Latency Metrics

When built with -tags=latency, the metrics actor automatically collects per-process mailbox latency data. This enables detection of stressed processes whose mailboxes are growing.

Without the tag, latency measurement is disabled and no additional metrics are registered. There is zero overhead.

Metric
Type
Labels
Description

ergo_mailbox_latency_distribution

Gauge

range

Number of processes in each latency range. Snapshot per collect cycle -- values reflect the current state, not cumulative history.

ergo_mailbox_latency_max_seconds

Gauge

-

Maximum mailbox latency across all processes on this node. When this exceeds 1 second, at least one process is significantly behind.

ergo_mailbox_latency_processes

Gauge

-

Number of processes with non-empty mailbox (latency > 0). High count relative to total processes indicates widespread backpressure.

ergo_mailbox_latency_top_seconds

Gauge

pid, name, application, behavior

Top-N processes by mailbox latency. Directly identifies which processes are the bottlenecks.

The distribution metric uses gauge-based snapshots rather than a Prometheus histogram. Each collect cycle iterates over all processes, counts how many fall into each latency range, and sets the gauge values from scratch. This approach is a better fit for periodic state observation than cumulative histograms, which are designed for discrete events like HTTP requests. The ranges are: 1ms, 5ms, 10ms, 50ms, 100ms, 500ms, 1s, 5s, 10s, 30s, 60s, and 60s+. Each range represents an upper boundary -- for example, "5ms" counts processes with latency between 1ms and 5ms.

The TopN option (default: 50) controls how many processes appear in the top-N metric. The same setting applies to all per-process top-N metrics (latency, depth, utilization).

Mailbox Depth Metrics

The metrics actor collects per-process mailbox queue depth -- the number of messages waiting in the mailbox at the moment of collection. While latency measures how long the oldest message has been waiting, depth measures how many messages are queued. The two metrics are complementary: a process may have high depth with low latency if it processes messages quickly but receives many at once, or low depth with high latency if a single message is taking a long time to process.

No build tags required. Depth metrics are always active.

Metric
Type
Labels
Description

ergo_mailbox_depth_distribution

Gauge

range

Number of processes in each depth range. Snapshot per collect cycle.

ergo_mailbox_depth_max

Gauge

-

Maximum mailbox depth across all processes on this node.

ergo_mailbox_depth_top

Gauge

pid, name, application, behavior

Top-N processes by mailbox depth.

Distribution ranges: 1, 5, 10, 50, 100, 500, 1K, 5K, 10K, 10K+. Each range represents an upper boundary. Processes with empty mailboxes are not counted.

Process Utilization Metrics

The metrics actor collects per-process utilization -- the ratio of callback running time to process uptime. A process that has been alive for 100 seconds and spent 30 of those seconds inside callbacks has a utilization of 0.30 (30%). This is a lifetime average computed from cumulative counters that the framework maintains for each process. It answers the question "which actors have been busiest over their entire lifetime?"

Utilization is not the same as current CPU load. A process that was heavily loaded an hour ago but is idle now will still show high lifetime utilization. For current load, the dashboard provides rate(ergo_process_running_time_seconds) which shows how much callback time is happening right now per second.

No build tags required. Utilization metrics are always active.

Metric
Type
Labels
Description

ergo_process_utilization_distribution

Gauge

range

Number of processes in each utilization range. Snapshot per collect cycle.

ergo_process_utilization_max

Gauge

-

Maximum process utilization on this node.

ergo_process_utilization_top

Gauge

pid, name, application, behavior

Top-N processes by utilization.

Distribution ranges: 1%, 5%, 10%, 25%, 50%, 75%, 90%, 90%+. Processes with zero running time or zero uptime are excluded. Utilization is capped at 1.0 (100%).

Process Aggregate Metrics

The metrics actor computes node-level aggregate counters by summing per-process values across all processes on the node. These provide a high-level view of how much work the node is doing without the cardinality cost of per-process series.

Metric
Type
Description

ergo_process_messages_in

Gauge

Sum of messages received by all processes on this node.

ergo_process_messages_out

Gauge

Sum of messages sent by all processes on this node.

ergo_process_running_time_seconds

Gauge

Sum of callback running time across all processes on this node (seconds).

These are cumulative values -- apply rate() in Prometheus to get per-second rates. When a process terminates, its contribution is removed from the sum, which may cause the aggregate to decrease momentarily. This is expected and rate() handles it correctly in most cases.

rate(ergo_process_messages_in) and rate(ergo_process_messages_out) give the node-level message throughput in messages per second. rate(ergo_process_running_time_seconds) gives the node-level actor CPU utilization in seconds of callback execution per second -- when this value approaches the number of available CPU cores, the node is compute-saturated.

Per-Process Metrics Collection

All per-process metrics (latency, depth, utilization, aggregates) are collected in a single pass using Node.ProcessRangeShortInfo(). The iterator visits each process once, and each observation is dispatched to the latency, depth, and utilization collectors simultaneously. Top-N selection uses a min-heap for O(N) efficiency. This design ensures that adding more metric types does not multiply the number of iterations over the process table.

Cardinality

For a cluster of 500 nodes with TopN=50:

  • Depth distribution + max + top-N: 500 x (10 + 1 + 50) = 30,500

  • Utilization distribution + max + top-N: 500 x (8 + 1 + 50) = 29,500

  • Aggregates: 500 x 3 = 1,500

  • Latency distribution + max + count + top-N: 500 x (12 + 2 + 50) = 32,000 (with -tags=latency)

  • Total without latency: ~61,500 series

  • Total with latency: ~93,500 series

For a typical cluster of 30 nodes, the total is approximately 6,000 series (or 10,000 with latency). At a 15-second Prometheus scrape interval and default 15-day retention, this amounts to roughly 1.3 GB of disk space -- negligible for any modern monitoring setup.

Custom Metrics

Extend the metrics actor by embedding metrics.Actor. You register custom Prometheus collectors in Init() and update them via CollectMetrics() or HandleMessage().

Approach 1: Periodic Collection

Implement CollectMetrics() to poll state at regular intervals:

Use this when metrics reflect state you need to query - current values from other actors, computed aggregates, external API calls.

Approach 2: Event-Driven Updates

Update metrics immediately when events occur:

Application actors send events to the metrics actor:

Use this when your application naturally produces events. Metrics update in real-time without polling.

Metric Types

Prometheus defines four metric types, each suited for different use cases:

Counter - Monotonically increasing value. Use for events that accumulate (requests processed, errors occurred, bytes sent). Counters never decrease except on process restart. Prometheus queries typically use rate() to calculate per-second rates or increase() for total change over a time window.

Gauge - Value that can go up or down. Use for current state (active connections, queue depth, memory usage, CPU utilization). Gauges represent snapshots. Prometheus queries can graph them directly or use functions like avg_over_time() to smooth spikes.

Histogram - Observations bucketed into configurable ranges. Use for latency or size distributions. Histograms let you calculate percentiles (p50, p95, p99) in Prometheus queries. They're more resource-intensive than gauges because they maintain multiple buckets per metric.

Summary - Similar to histogram but calculates quantiles client-side. Use when you need precise quantiles but can't predict bucket boundaries. Summaries are more expensive than histograms because they track exact quantiles, not approximations.

For most use cases, counters and gauges suffice. Use histograms when you need latency percentiles. Avoid summaries unless you have specific reasons - histograms are more flexible for Prometheus queries.

Integration with Prometheus

Configure Prometheus to scrape the metrics endpoint:

Prometheus fetches /metrics every 15 seconds, parses the text format, and stores time-series data. You can then query, alert, and visualize metrics using Prometheus queries or Grafana dashboards.

For dynamic discovery in Kubernetes or cloud environments, use Prometheus service discovery instead of static targets. The metrics actor itself doesn't need to know about Prometheus - it just exposes an HTTP endpoint.

Grafana Dashboard

The metrics package includes a pre-built Grafana dashboard (ergo-cluster.json) designed for monitoring Ergo clusters. The dashboard provides a comprehensive view of cluster health with automatic refresh every 10 seconds.

Importing the Dashboard

  1. Open Grafana and navigate to Dashboards

  2. Click "Import"

  3. Upload the ergo-cluster.json file from the metrics package or paste its contents

  4. Select your Prometheus data source

The dashboard includes a $node variable dropdown that filters all panels by selected nodes. By default, all nodes are displayed.

Understanding the Panels

The dashboard organizes metrics into logical groups arranged from high-level overview at the top to detailed breakdowns below. Rows marked "collapsed" are hidden by default -- click the row header to expand them.

Summary Row (expanded) - Six stat panels showing aggregated values: total processes, running processes, zombie count (red when non-zero), memory used, memory allocated, and node count. These provide immediate cluster health at a glance. A gap between total and running processes indicates idle capacity or blocked processes. Non-zero zombies require investigation.

Mailbox Latency (expanded, requires -tags=latency) - Six panels for latency analysis described in detail in the next section. When the latency tag is not used, these panels show "No data".

Mailbox Depth (expanded) - Three panels showing mailbox queue depth. Max Depth per Node tracks the largest mailbox on each node over time. Depth Distribution is a stacked area chart with a flame color gradient (green for 1-10 messages, yellow for 50-100, orange for 500-1K, red for 5K-10K+) showing how many processes fall into each depth range. Top Processes by Depth is a table listing the processes with the deepest queues across the cluster. Depth is complementary to latency: depth tells you "how many messages are queued," while latency tells you "how long the oldest one has been waiting."

Process Activity (collapsed) - Four panels covering utilization and throughput. Utilization Distribution is a stacked area chart showing how many processes fall into each utilization range (1% through 90%+), using a flame gradient from green to red. Message Throughput per Node shows rate(messages_in) and rate(messages_out) per node in messages per second. Top Processes by Utilization is a table showing the busiest actors by lifetime utilization. Actor Running Time per Node shows rate(running_time_seconds) per node -- effectively the node-level actor CPU utilization. When this value approaches the available CPU core count, the node is compute-saturated.

Processes (collapsed) - Four timeseries panels showing per-node process counts (total and running) and lifecycle rates (spawn rate with failures in red, termination rate). Steady growth in total without plateau suggests process leaks. Spawn failures indicate resource exhaustion. When termination rate exceeds spawn rate, the node is draining.

Resources (collapsed) - Four panels covering CPU and memory. CPU User Time and CPU System Time are normalized by core count and displayed as percentages. High user CPU means compute-bound workload; high system CPU relative to user suggests excessive I/O or syscalls. Memory (OS:used) and Memory (Runtime:alloc) show memory usage over time. Monotonic growth signals memory leaks. Sawtooth pattern in runtime allocation is normal (GC cycles). Rising baseline between GC cycles indicates uncollected objects.

Network (collapsed) - Six panels covering cluster totals, per-node breakdowns, and node-pair detail for both message rates and byte rates. Sudden drops may indicate partitions. Disproportionate bytes-to-messages ratio reveals large message sizes. The detail panels show traffic between specific node pairs, useful for tracing inter-node communication paths and identifying saturated links.

Nodes Overview - A table listing all nodes with uptime, process counts, and memory. Sorted by process count. Quickly identifies recently restarted nodes (low uptime), overloaded nodes (high process count), or unhealthy nodes (non-zero zombies).

Working with the Dashboard

The dashboard is designed around a top-down investigation pattern. You start with high-level signals that tell you whether something is wrong, then drill into progressively more specific panels to understand what, where, and why. This section describes the investigation flow.

Routine check

Open the dashboard and look at the Summary row. Six stat panels at the top answer the first question: is the cluster intact?

  • Zombie count should be zero. Non-zero means processes have terminated abnormally and were not cleaned up. This requires immediate investigation.

  • Node count should match your expectation. A missing node means it has left the cluster or lost connectivity.

  • Memory used should be within expected bounds. A sharp increase since the last check suggests a leak or a load spike.

If the Summary looks normal and you have latency enabled, glance at the Latency row directly below. If Max Latency is under 100ms and the Stressed Processes panel is mostly empty or light-blue -- the system is healthy. Routine check complete.

If you are not using -tags=latency, check the Mailbox Depth row below. Max Depth per Node is the closest equivalent to Max Latency as a backpressure signal. If all nodes show zero or low depth, the system is healthy. For a deeper routine check, expand the Process Activity row and glance at Message Throughput -- a sudden drop compared to the previous period may indicate stalled processes even when depth looks normal.

Something is wrong: start with latency

When a problem is detected -- alerts fire, users report slowness, or the Summary shows unusual values -- the latency panels are the first place to investigate. They answer "are my actors keeping up with their workload?"

If you are not using -tags=latency, skip to the Mailbox Depth row. Rising depth on a node means processes are receiving messages faster than they can handle them. Use the Depth Distribution panel to assess severity and the Top Processes by Depth table to identify the specific actors. Then continue with the "Understand severity" and "Correlate across panels" sections below -- they apply regardless of whether latency is enabled.

The Max Latency panel shows the highest mailbox latency across all selected nodes as a red timeseries. Under normal conditions this stays under 100ms. Values above 1 second mean at least one process is significantly behind -- it is either overloaded, stuck in a long-running callback, or waiting for an external resource.

The Stressed Processes panel shows a stacked area chart with two layers. Light-blue represents processes with latency under 1ms (negligible, normal operation). Orange represents processes with latency of 1ms or above (worth investigating). A growing orange area means more processes are falling behind over time.

Read these two panels together:

  • Max Latency spikes above 1 second -- at least one process is severely behind. Scroll down to the Top Stressed Processes table to identify it by application, behavior, name, and PID.

  • Orange area growing in Stressed Processes -- multiple processes are accumulating latency. The problem is broader than a single actor. Check the Latency Distribution to understand severity, and the CPU panels to see if the system is compute-bound.

  • Max Latency spike followed by quick return to normal -- a temporary burst. Compare timing with Process Spawn Rate in the Processes row to check for lifecycle event correlation.

  • Max Latency persistently elevated (minutes, not seconds) -- a stuck process. Unlike overload (which fluctuates with traffic), a stuck process shows a steadily increasing or flat high value. Find it in the Top Stressed Processes table.

Narrow down: node or cluster?

The Max Latency per Node and Stressed Processes per Node panels break the cluster-wide picture into individual nodes.

If one node stands out while others are calm, the problem is localized. Cross-reference with Resources and Network panels for that node. A node with high max latency but low stressed count has one problematic process. A node with moderate latency but high stressed count is generally overloaded.

If multiple nodes show similar patterns, the problem is systemic -- a shared external dependency, a cluster-wide traffic pattern, or a deployment issue.

Understand severity: the distribution panels

The Latency Distribution panel shows a stacked area chart where each layer represents a latency range. Colors run from green (1ms-10ms, normal) through yellow (50ms-100ms, elevated) to orange (500ms-1s, concerning) and red (5s-60s+, critical). The legend is sorted from highest to lowest range.

This panel distinguishes two scenarios that look similar in Max Latency: one stuck process (a single red sliver at the top of an otherwise green chart) versus widespread degradation (the entire chart shifting from green toward orange). The first requires investigating a specific process. The second requires scaling or load shedding.

The Depth Distribution panel (in the Mailbox Depth row) provides a complementary view. Where latency distribution shows how long messages wait, depth distribution shows how many messages are queued. The color gradient follows the same flame convention: green for 1-10 messages, yellow for 50-100, red for 5K-10K+. A process with high depth but low latency processes messages quickly but receives many at once. A process with low depth but high latency is slow but not overwhelmed.

The Utilization Distribution panel (expand the Process Activity row) shows the fraction of lifetime each process spends inside callbacks. Most processes should be in the low ranges (1%-10%). A shift toward higher ranges (50%+) across many processes means the cluster is running compute-heavy workloads and may need scaling.

Find the specific process

Each metric group has a top-N table at the bottom of its row:

  • Top Stressed Processes (Latency row) -- processes with the highest mailbox latency. Answers "which process is the bottleneck?"

  • Top Processes by Depth (Mailbox Depth row) -- processes with the deepest queues. Answers "which process has the most messages waiting?"

  • Top Processes by Utilization (Process Activity row) -- processes that spend the most time in callbacks. Answers "which process is doing the most work?"

All tables show Application, Behavior, Name, PID, Node, and the metric value (plus Kubernetes labels when available). Multiple entries from the same application suggest that application is under pressure as a whole. A single entry with extreme values points to a specific process that needs investigation -- see the Debugging section for techniques.

Correlate across panels

Individual metrics become most powerful when combined:

  • High latency + high depth -- the process is both slow and receiving more messages than it can handle. The mailbox is growing. This is the clearest sign of overload.

  • High latency + low depth -- a single message is being processed slowly (long-running callback) or the process is blocked waiting for something. The mailbox is not growing because new messages are not arriving.

  • High depth + low latency -- the process receives bursts but processes them quickly. The depth spikes are transient. Usually not a problem unless the bursts grow over time.

  • High latency + high CPU -- processes are compute-bound. Actors are doing heavy work in callbacks and cannot keep up with the message rate. Consider distributing work across more processes or offloading expensive computation.

  • High latency + low CPU -- processes are blocked on something other than computation. Common causes: waiting for external I/O, waiting for responses from other actors via synchronous calls, or contention on shared resources.

  • High latency + growing memory -- mailboxes are accumulating messages faster than processes can handle them. The unprocessed messages consume memory. If this continues, the node will run out of memory.

  • High latency + network traffic spike -- a burst of remote messages is overwhelming the receiving processes. Check Network per Node and Network Detail panels to identify the responsible link.

  • Running time per node approaching CPU core count -- the node is compute-saturated. All available CPU time is spent inside actor callbacks. Either reduce the workload or add capacity.

  • Message throughput drop with stable process count -- processes are alive but not doing work. They may be blocked on external calls or waiting for messages that are no longer arriving (upstream failure).

Observer Integration

The metrics actor includes built-in Observer support via HandleInspect(). When you inspect it in Observer UI (http://localhost:9911), you see:

  • Total number of registered metrics

  • HTTP endpoint URL for Prometheus scraping

  • Collection interval

  • Current values for all metrics (base + custom)

This works automatically for custom metrics - register them in Init() and they appear in Observer alongside base metrics.

If you need custom inspection behavior, override HandleInspect() in your implementation:

For detailed configuration options, see the metrics.Options struct and ActorBehavior interface in the package. For examples of custom metrics, see the metrics actor repositoryarrow-up-right.

Last updated