Debugging
Debugging distributed actor systems presents unique challenges. Traditional debugging tools struggle with concurrent message passing, process isolation, and distributed state. This article covers the debugging capabilities built into Ergo Framework and demonstrates practical techniques for troubleshooting common issues.
Build Tags
Ergo Framework uses Go build tags to enable debugging features without affecting production performance. These tags control compile-time behavior, ensuring zero overhead when disabled.
The pprof Tag
pprof TagThe pprof tag enables the built-in profiler and goroutine labeling:
go run --tags pprof ./cmdThis activates:
pprof HTTP endpoint at
http://localhost:9009/debug/pprof/PID labels on actor goroutines and Alias labels on meta process goroutines for identification in profiler output
The endpoint address can be customized via environment variables:
PPROF_HOST- host to bind (default:localhost)PPROF_PORT- port to listen on (default:9009)
The profiler endpoint exposes standard Go profiling data:
/debug/pprof/goroutine
Stack traces of all goroutines
/debug/pprof/heap
Heap memory allocations
/debug/pprof/profile
CPU profile (30-second sample)
/debug/pprof/block
Goroutine blocking events
/debug/pprof/mutex
Mutex contention
The norecover Tag
norecover TagBy default, Ergo Framework recovers from panics in actor callbacks to prevent a single misbehaving actor from crashing the entire node. While this improves resilience in production, it can hide bugs during development.
With norecover, panics propagate normally, providing full stack traces and allowing debuggers to catch the exact failure point. This is particularly useful when:
Investigating nil pointer dereferences in message handlers
Tracking down type assertion failures
Understanding the call sequence leading to a panic
The trace Tag
trace TagThe trace tag enables verbose logging of framework internals:
This produces detailed output about:
Process lifecycle events (spawn, terminate, state changes)
Message routing decisions
Network connection establishment and teardown
Supervision tree operations
To see trace output, also set the node's log level:
Combining Tags
Tags can be combined for comprehensive debugging:
This enables all debugging features simultaneously. Use this combination when investigating complex issues that span multiple subsystems.
Profiler Integration
The Go profiler is a powerful tool for understanding runtime behavior. Ergo Framework enhances its usefulness by labeling goroutines with their identifiers.
Identifying Actor and Meta Process Goroutines
When built with the pprof tag, each actor's goroutine carries a label containing its PID, and each meta process goroutine carries a label with its Alias. This creates a direct link between the logical identity and the runtime goroutine.
To find labeled goroutines:
Example output for actors:
Example output for meta processes:
Meta processes have two goroutines with different roles:
"role":"reader"- External Reader goroutine running theStart()method (blocking I/O)"role":"handler"- Actor Handler goroutine processing messages (HandleMessage/HandleCall)
The output shows:
The goroutine's stack trace
The identifier label (PID for actors, Alias for meta processes)
The exact location in your code where the goroutine is currently executing
Debugging Stuck Processes
During graceful shutdown, Ergo Framework logs processes that are taking too long to terminate. These logs include PIDs that can be matched against profiler output.
Consider a shutdown scenario where the node reports:
To investigate why <ABC123.0.1005> is stuck:
Capture the goroutine profile:
Search for the specific PID:
Analyze the stack trace to understand what the actor is waiting on.
The debug=2 parameter provides full stack traces with argument values, which is more verbose than debug=1 but contains more diagnostic information.
Common Patterns in Stack Traces
Different types of blocking have characteristic stack traces:
Blocked on channel receive:
Blocked on mutex:
Blocked on network I/O:
Blocked on synchronous call (waiting for response):
Understanding these patterns helps quickly identify the root cause of stuck processes.
Shutdown Diagnostics
Ergo Framework provides built-in diagnostics during graceful shutdown. When ShutdownTimeout is configured (default: 3 minutes), the framework logs pending processes every 5 seconds.
The shutdown log includes:
PID: Process identifier for correlation with profiler
State: Current process state (running, sleep, etc.)
Queue: Number of messages waiting in the mailbox
A process with state=running and queue=0 is actively processing something (likely stuck in a callback). A process with state=running and queue>0 is stuck while new messages continue to arrive. A process with state=sleep and queue=0 is idle - during shutdown this typically means the process is waiting for its children to terminate first (normal supervision tree behavior).
Practical Debugging Scenarios
Scenario: Message Handler Never Returns
Symptoms:
Process stops responding to messages
Other processes waiting on
CalltimeoutShutdown hangs on specific process
Investigation:
Note the PID from shutdown logs or observer
Capture goroutine profile with
debug=2Find the goroutine by PID label
Examine the stack trace
Common causes:
Infinite loop in message handler
Blocking channel operation
Deadlock with another process via synchronous calls
External service call without timeout
Solution approach:
Never use blocking operations (channels, mutexes) in actor callbacks
Always use timeouts for external calls
Use asynchronous messaging patterns where possible
Scenario: Memory Growth
Symptoms:
Heap size increases over time
Process eventually killed by OOM
Investigation:
Capture heap profile:
In pprof, use
topto see largest allocators:
Use
listto examine specific functions:
Common causes:
Messages accumulating in mailbox faster than processing
Actor state holding references to large data
Unbounded caches or buffers in actor state
Scenario: Distributed Deadlock
Symptoms:
Two or more processes stop responding
Circular dependency in synchronous calls
Investigation:
Identify stuck processes from shutdown logs
For each process, capture its goroutine stack
Look for
waitResponsein stack traces (indicates waiting for synchronous call response)Map the call targets to build a dependency graph
Prevention:
Prefer asynchronous messaging over synchronous calls
Design clear hierarchies where calls flow in one direction
Use timeouts on all synchronous operations
Consider using request-response patterns with explicit message types
Scenario: Process Crash Investigation
Symptoms:
Process terminates unexpectedly
TerminateReasonPanicin logs
Investigation:
Build with
--tags norecoverto get full panic stackRun the scenario that triggers the crash
Examine the complete stack trace
With norecover, the panic propagates with full context:
This shows exactly which line in your code triggered the panic.
Observer Integration
The Observer tool provides a web interface for inspecting running nodes. While not strictly a debugging tool, it complements profiler-based debugging by providing:
Real-time process list with state and mailbox sizes
Application and supervision tree visualization
Network topology view
Message inspection capabilities
Observer runs at http://localhost:9911 by default when included in your node.
Best Practices
Always use build tags in development: Run with
--tags pprofduring development to have profiler and goroutine labels available when needed.Configure reasonable shutdown timeout: A shorter timeout (30-60 seconds) in development helps identify stuck processes quickly.
Use framework logging: The framework's
Log()method automatically includes PID/Alias in log output, enabling correlation with profiler data.Use structured logging: The framework's logging system supports log levels and structured fields. Add context with
AddFields()for correlation:For scoped logging, use
PushFields()/PopFields()to save and restore field sets.Profile regularly: Periodic profiling during development helps catch performance regressions before production.
Test shutdown paths: Explicitly test graceful shutdown to verify all actors terminate cleanly.
Summary
Debugging actor systems requires tools that bridge the gap between logical actors and runtime goroutines. Ergo Framework provides this bridge through:
Build tags that enable profiling and diagnostics without production overhead
Goroutine labels that link runtime goroutines to their actor (PID) and meta process (Alias) identities
Shutdown diagnostics that identify processes preventing clean termination
Observer integration for visual inspection of running systems
Combined with Go's standard profiling tools, these capabilities enable effective debugging of even complex distributed systems.
Last updated
