Debugging

Debugging distributed actor systems presents unique challenges. Traditional debugging tools struggle with concurrent message passing, process isolation, and distributed state. This article covers the debugging capabilities built into Ergo Framework and demonstrates practical techniques for troubleshooting common issues.

Build Tags

Ergo Framework uses Go build tags to enable debugging features without affecting production performance. These tags control compile-time behavior, ensuring zero overhead when disabled.

The pprof Tag

The pprof tag enables the built-in profiler and goroutine labeling:

go run --tags pprof ./cmd

This activates:

  • pprof HTTP endpoint at http://localhost:9009/debug/pprof/

  • PID labels on actor goroutines and Alias labels on meta process goroutines for identification in profiler output

The endpoint address can be customized via environment variables:

  • PPROF_HOST - host to bind (default: localhost)

  • PPROF_PORT - port to listen on (default: 9009)

The profiler endpoint exposes standard Go profiling data:

Endpoint
Description

/debug/pprof/goroutine

Stack traces of all goroutines

/debug/pprof/heap

Heap memory allocations

/debug/pprof/profile

CPU profile (30-second sample)

/debug/pprof/block

Goroutine blocking events

/debug/pprof/mutex

Mutex contention

The norecover Tag

By default, Ergo Framework recovers from panics in actor callbacks to prevent a single misbehaving actor from crashing the entire node. While this improves resilience in production, it can hide bugs during development.

With norecover, panics propagate normally, providing full stack traces and allowing debuggers to catch the exact failure point. This is particularly useful when:

  • Investigating nil pointer dereferences in message handlers

  • Tracking down type assertion failures

  • Understanding the call sequence leading to a panic

The trace Tag

The trace tag enables verbose logging of framework internals:

This produces detailed output about:

  • Process lifecycle events (spawn, terminate, state changes)

  • Message routing decisions

  • Network connection establishment and teardown

  • Supervision tree operations

To see trace output, also set the node's log level:

Combining Tags

Tags can be combined for comprehensive debugging:

This enables all debugging features simultaneously. Use this combination when investigating complex issues that span multiple subsystems.

Profiler Integration

The Go profiler is a powerful tool for understanding runtime behavior. Ergo Framework enhances its usefulness by labeling goroutines with their identifiers.

Identifying Actor and Meta Process Goroutines

When built with the pprof tag, each actor's goroutine carries a label containing its PID, and each meta process goroutine carries a label with its Alias. This creates a direct link between the logical identity and the runtime goroutine.

To find labeled goroutines:

Example output for actors:

Example output for meta processes:

Meta processes have two goroutines with different roles:

  • "role":"reader" - External Reader goroutine running the Start() method (blocking I/O)

  • "role":"handler" - Actor Handler goroutine processing messages (HandleMessage/HandleCall)

The output shows:

  • The goroutine's stack trace

  • The identifier label (PID for actors, Alias for meta processes)

  • The exact location in your code where the goroutine is currently executing

Debugging Stuck Processes

During graceful shutdown, Ergo Framework logs processes that are taking too long to terminate. These logs include PIDs that can be matched against profiler output.

Consider a shutdown scenario where the node reports:

To investigate why <ABC123.0.1005> is stuck:

  1. Capture the goroutine profile:

  1. Search for the specific PID:

  1. Analyze the stack trace to understand what the actor is waiting on.

The debug=2 parameter provides full stack traces with argument values, which is more verbose than debug=1 but contains more diagnostic information.

Common Patterns in Stack Traces

Different types of blocking have characteristic stack traces:

Blocked on channel receive:

Blocked on mutex:

Blocked on network I/O:

Blocked on synchronous call (waiting for response):

Understanding these patterns helps quickly identify the root cause of stuck processes.

Shutdown Diagnostics

Ergo Framework provides built-in diagnostics during graceful shutdown. When ShutdownTimeout is configured (default: 3 minutes), the framework logs pending processes every 5 seconds.

The shutdown log includes:

  • PID: Process identifier for correlation with profiler

  • State: Current process state (running, sleep, etc.)

  • Queue: Number of messages waiting in the mailbox

A process with state=running and queue=0 is actively processing something (likely stuck in a callback). A process with state=running and queue>0 is stuck while new messages continue to arrive. A process with state=sleep and queue=0 is idle - during shutdown this typically means the process is waiting for its children to terminate first (normal supervision tree behavior).

Practical Debugging Scenarios

Scenario: Message Handler Never Returns

Symptoms:

  • Process stops responding to messages

  • Other processes waiting on Call timeout

  • Shutdown hangs on specific process

Investigation:

  1. Note the PID from shutdown logs or observer

  2. Capture goroutine profile with debug=2

  3. Find the goroutine by PID label

  4. Examine the stack trace

Common causes:

  • Infinite loop in message handler

  • Blocking channel operation

  • Deadlock with another process via synchronous calls

  • External service call without timeout

Solution approach:

  • Never use blocking operations (channels, mutexes) in actor callbacks

  • Always use timeouts for external calls

  • Use asynchronous messaging patterns where possible

Scenario: Memory Growth

Symptoms:

  • Heap size increases over time

  • Process eventually killed by OOM

Investigation:

  1. Capture heap profile:

  1. In pprof, use top to see largest allocators:

  1. Use list to examine specific functions:

Common causes:

  • Messages accumulating in mailbox faster than processing

  • Actor state holding references to large data

  • Unbounded caches or buffers in actor state

Scenario: Distributed Deadlock

Symptoms:

  • Two or more processes stop responding

  • Circular dependency in synchronous calls

Investigation:

  1. Identify stuck processes from shutdown logs

  2. For each process, capture its goroutine stack

  3. Look for waitResponse in stack traces (indicates waiting for synchronous call response)

  4. Map the call targets to build a dependency graph

Prevention:

  • Prefer asynchronous messaging over synchronous calls

  • Design clear hierarchies where calls flow in one direction

  • Use timeouts on all synchronous operations

  • Consider using request-response patterns with explicit message types

Scenario: Process Crash Investigation

Symptoms:

  • Process terminates unexpectedly

  • TerminateReasonPanic in logs

Investigation:

  1. Build with --tags norecover to get full panic stack

  2. Run the scenario that triggers the crash

  3. Examine the complete stack trace

With norecover, the panic propagates with full context:

This shows exactly which line in your code triggered the panic.

Observer Integration

The Observerarrow-up-right tool provides a web interface for inspecting running nodes. While not strictly a debugging tool, it complements profiler-based debugging by providing:

  • Real-time process list with state and mailbox sizes

  • Application and supervision tree visualization

  • Network topology view

  • Message inspection capabilities

Observer runs at http://localhost:9911 by default when included in your node.

Best Practices

  1. Always use build tags in development: Run with --tags pprof during development to have profiler and goroutine labels available when needed.

  2. Configure reasonable shutdown timeout: A shorter timeout (30-60 seconds) in development helps identify stuck processes quickly.

  3. Use framework logging: The framework's Log() method automatically includes PID/Alias in log output, enabling correlation with profiler data.

  4. Use structured logging: The framework's logging system supports log levels and structured fields. Add context with AddFields() for correlation:

    For scoped logging, use PushFields()/PopFields() to save and restore field sets.

  5. Profile regularly: Periodic profiling during development helps catch performance regressions before production.

  6. Test shutdown paths: Explicitly test graceful shutdown to verify all actors terminate cleanly.

Summary

Debugging actor systems requires tools that bridge the gap between logical actors and runtime goroutines. Ergo Framework provides this bridge through:

  • Build tags that enable profiling and diagnostics without production overhead

  • Goroutine labels that link runtime goroutines to their actor (PID) and meta process (Alias) identities

  • Shutdown diagnostics that identify processes preventing clean termination

  • Observer integration for visual inspection of running systems

Combined with Go's standard profiling tools, these capabilities enable effective debugging of even complex distributed systems.

Last updated