Debugging

Debugging distributed actor systems presents unique challenges. Traditional debugging tools struggle with concurrent message passing, process isolation, and distributed state. This article covers the debugging capabilities built into Ergo Framework and demonstrates practical techniques for troubleshooting common issues.

Build Tags

Ergo Framework uses Go build tags to enable debugging features without affecting production performance. These tags control compile-time behavior, ensuring zero overhead when disabled.

The `pprof` Tag

The pprof tag enables the built-in profiler and goroutine labeling:

go run --tags pprof ./cmd

This activates:

pprof HTTP endpoint at http://localhost:9009/debug/pprof/
PID labels on actor goroutines and Alias labels on meta process goroutines for identification in profiler output

The endpoint address can be customized via environment variables:

PPROF_HOST - host to bind (default: localhost)
PPROF_PORT - port to listen on (default: 9009)

The profiler endpoint exposes standard Go profiling data:

Endpoint

Description

/debug/pprof/goroutine

Stack traces of all goroutines

/debug/pprof/heap

Heap memory allocations

/debug/pprof/profile

CPU profile (30-second sample)

/debug/pprof/block

Goroutine blocking events

/debug/pprof/mutex

Mutex contention

The `norecover` Tag

By default, Ergo Framework recovers from panics in actor callbacks to prevent a single misbehaving actor from crashing the entire node. While this improves resilience in production, it can hide bugs during development.

go run --tags norecover ./cmd

With norecover, panics propagate normally, providing full stack traces and allowing debuggers to catch the exact failure point. This is particularly useful when:

Investigating nil pointer dereferences in message handlers
Tracking down type assertion failures
Understanding the call sequence leading to a panic

The `trace` Tag

The trace tag enables verbose logging of framework internals:

go run --tags trace ./cmd

This produces detailed output about:

Process lifecycle events (spawn, terminate, state changes)
Message routing decisions
Network connection establishment and teardown
Supervision tree operations

To see trace output, also set the node's log level:

options := gen.NodeOptions{
    Log: gen.LogOptions{
        Level: gen.LogLevelTrace,
    },
}

Combining Tags

Tags can be combined for comprehensive debugging:

go run --tags "pprof,norecover,trace" ./cmd

This enables all debugging features simultaneously. Use this combination when investigating complex issues that span multiple subsystems.

Profiler Integration

The Go profiler is a powerful tool for understanding runtime behavior. Ergo Framework enhances its usefulness by labeling goroutines with their identifiers.

Identifying Actor and Meta Process Goroutines

When built with the pprof tag, each actor's goroutine carries a label containing its PID, and each meta process goroutine carries a label with its Alias. This creates a direct link between the logical identity and the runtime goroutine.

To find labeled goroutines:

# Find actor goroutines by PID
curl -s "http://localhost:9009/debug/pprof/goroutine?debug=1" | grep -B5 'labels:.*pid'

# Find meta process goroutines by Alias
curl -s "http://localhost:9009/debug/pprof/goroutine?debug=1" | grep -B5 'labels:.*meta'

Example output for actors:

1 @ 0x100c17fa0 0x100c18abc 0x100c19def ...
# labels: {"pid":"<ABC123.0.1005>"}
#   main.(*Worker).HandleMessage+0x27  /path/worker.go:45

Example output for meta processes:

1 @ 0x100c17fa0 0x100c18abc 0x100c19def ...
# labels: {"meta":"Alias#<ABC123.0.1.2>", "role":"reader"}
#   main.(*TCPServer).Start+0x1bc  /path/tcp_server.go:52

Meta processes have two goroutines with different roles:

"role":"reader" - External Reader goroutine running the Start() method (blocking I/O)
"role":"handler" - Actor Handler goroutine processing messages (HandleMessage/HandleCall)

The output shows:

The goroutine's stack trace
The identifier label (PID for actors, Alias for meta processes)
The exact location in your code where the goroutine is currently executing

Debugging Stuck Processes

During graceful shutdown, Ergo Framework logs processes that are taking too long to terminate. These logs include PIDs that can be matched against profiler output.

Consider a shutdown scenario where the node reports:

[warning] shutdown: waiting for 3 processes
[warning]   <ABC123.0.1005> state=running queue=5
[warning]   <ABC123.0.1012> state=running queue=0
[warning]   <ABC123.0.1018> state=sleep queue=0

To investigate why <ABC123.0.1005> is stuck:

Capture the goroutine profile:

curl -s "http://localhost:9009/debug/pprof/goroutine?debug=2" > goroutines.txt

Search for the specific PID:

grep -A30 'pid.*ABC123.0.1005' goroutines.txt

Analyze the stack trace to understand what the actor is waiting on.

The debug=2 parameter provides full stack traces with argument values, which is more verbose than debug=1 but contains more diagnostic information.

Common Patterns in Stack Traces

Different types of blocking have characteristic stack traces:

Blocked on channel receive:

runtime.chanrecv1
    /usr/local/go/src/runtime/chan.go:442

Blocked on mutex:

sync.(*Mutex).Lock
    /usr/local/go/src/sync/mutex.go:81

Blocked on network I/O:

internal/poll.(*FD).Read
    /usr/local/go/src/internal/poll/fd_unix.go:163

Blocked on synchronous call (waiting for response):

ergo.services/ergo/node.(*process).waitResponse
    /path/node/process.go:1961

Understanding these patterns helps quickly identify the root cause of stuck processes.

Shutdown Diagnostics

Ergo Framework provides built-in diagnostics during graceful shutdown. When ShutdownTimeout is configured (default: 3 minutes), the framework logs pending processes every 5 seconds.

options := gen.NodeOptions{
    ShutdownTimeout: 30 * time.Second, // shorter timeout for debugging
}

The shutdown log includes:

PID: Process identifier for correlation with profiler
State: Current process state (running, sleep, etc.)
Queue: Number of messages waiting in the mailbox

A process with state=running and queue=0 is actively processing something (likely stuck in a callback). A process with state=running and queue>0 is stuck while new messages continue to arrive. A process with state=sleep and queue=0 is idle - during shutdown this typically means the process is waiting for its children to terminate first (normal supervision tree behavior).

Practical Debugging Scenarios

Scenario: Message Handler Never Returns

Symptoms:

Process stops responding to messages
Other processes waiting on Call timeout
Shutdown hangs on specific process

Investigation:

Note the PID from shutdown logs or observer
Capture goroutine profile with debug=2
Find the goroutine by PID label
Examine the stack trace

Common causes:

Infinite loop in message handler
Blocking channel operation
Deadlock with another process via synchronous calls
External service call without timeout

Solution approach:

Never use blocking operations (channels, mutexes) in actor callbacks
Always use timeouts for external calls
Use asynchronous messaging patterns where possible

Scenario: Memory Growth

Symptoms:

Heap size increases over time
Process eventually killed by OOM

Investigation:

Capture heap profile:

curl -s "http://localhost:9009/debug/pprof/heap" > heap.prof
go tool pprof heap.prof

In pprof, use top to see largest allocators:

(pprof) top 10

Use list to examine specific functions:

(pprof) list HandleMessage

Common causes:

Messages accumulating in mailbox faster than processing
Actor state holding references to large data
Unbounded caches or buffers in actor state

Scenario: Distributed Deadlock

Symptoms:

Two or more processes stop responding
Circular dependency in synchronous calls

Investigation:

Identify stuck processes from shutdown logs
For each process, capture its goroutine stack
Look for waitResponse in stack traces (indicates waiting for synchronous call response)
Map the call targets to build a dependency graph

Prevention:

Prefer asynchronous messaging over synchronous calls
Design clear hierarchies where calls flow in one direction
Use timeouts on all synchronous operations
Consider using request-response patterns with explicit message types

Scenario: Process Crash Investigation

Symptoms:

Process terminates unexpectedly
TerminateReasonPanic in logs

Investigation:

Build with --tags norecover to get full panic stack
Run the scenario that triggers the crash
Examine the complete stack trace

With norecover, the panic propagates with full context:

panic: runtime error: invalid memory address or nil pointer dereference

goroutine 42 [running]:
main.(*MyActor).HandleMessage(0x140001a2000, {0x100d12345, 0x140001b0000})
    /path/myactor.go:45 +0x1bc

This shows exactly which line in your code triggered the panic.

Observer Integration

The Observer tool provides a web interface for inspecting running nodes. While not strictly a debugging tool, it complements profiler-based debugging by providing:

Real-time process list with state and mailbox sizes
Application and supervision tree visualization
Network topology view
Message inspection capabilities

Observer runs at http://localhost:9911 by default when included in your node.

Best Practices

Always use build tags in development: Run with --tags pprof during development to have profiler and goroutine labels available when needed.
Configure reasonable shutdown timeout: A shorter timeout (30-60 seconds) in development helps identify stuck processes quickly.
Use framework logging: The framework's Log() method automatically includes PID/Alias in log output, enabling correlation with profiler data.

Use structured logging: The framework's logging system supports log levels and structured fields. Add context with AddFields() for correlation:

func (a *MyActor) HandleMessage(from gen.PID, message any) error {
    log := a.Log()
    log.AddFields(
        gen.LogField{Name: "request_id", Value: requestID},
        gen.LogField{Name: "user_id", Value: userID},
    )
    defer log.DeleteFields("request_id", "user_id")

    log.Info("processing request")
    // all log messages now include request_id and user_id
    return nil
}

For scoped logging, use PushFields()/PopFields() to save and restore field sets.

Profile regularly: Periodic profiling during development helps catch performance regressions before production.
Test shutdown paths: Explicitly test graceful shutdown to verify all actors terminate cleanly.

Summary

Debugging actor systems requires tools that bridge the gap between logical actors and runtime goroutines. Ergo Framework provides this bridge through:

Build tags that enable profiling and diagnostics without production overhead
Goroutine labels that link runtime goroutines to their actor (PID) and meta process (Alias) identities
Shutdown diagnostics that identify processes preventing clean termination
Observer integration for visual inspection of running systems

Combined with Go's standard profiling tools, these capabilities enable effective debugging of even complex distributed systems.

PreviousBuilding a Cluster NextActors

Last updated 5 hours ago

Good evening

hashtagBuild Tags

hashtagThe pprof Tag

hashtagThe norecover Tag

hashtagThe trace Tag

hashtagCombining Tags

hashtagProfiler Integration

hashtagIdentifying Actor and Meta Process Goroutines

hashtagDebugging Stuck Processes

hashtagCommon Patterns in Stack Traces

hashtagShutdown Diagnostics

hashtagPractical Debugging Scenarios

hashtagScenario: Message Handler Never Returns

hashtagScenario: Memory Growth

hashtagScenario: Distributed Deadlock

hashtagScenario: Process Crash Investigation

hashtagObserver Integration

hashtagBest Practices

hashtagSummary

Build Tags

The `pprof` Tag

The `norecover` Tag

The `trace` Tag

Combining Tags

Profiler Integration

Identifying Actor and Meta Process Goroutines

Debugging Stuck Processes

Common Patterns in Stack Traces

Shutdown Diagnostics

Practical Debugging Scenarios

Scenario: Message Handler Never Returns

Scenario: Memory Growth

Scenario: Distributed Deadlock

Scenario: Process Crash Investigation

Observer Integration

Best Practices

Summary