Supervisor
Actors fail. They panic, encounter errors, or lose external resources. In traditional systems, you add defensive code: catch exceptions, retry operations, validate state. This spreads failure handling throughout your codebase, mixing recovery logic with business logic.
The actor model takes a different approach: let it crash. When an actor fails, terminate it cleanly and restart it in a known-good state. This requires something watching the actor and managing its lifecycle - a supervisor.
act.Supervisor is an actor that manages child processes. It starts them during initialization, monitors them for failures, and applies restart strategies when they terminate. Supervisors can manage other supervisors, creating hierarchical fault tolerance trees where failures are isolated and recovered automatically.
Like act.Actor, the act.Supervisor struct implements the low-level gen.ProcessBehavior interface and has the embedded gen.Process interface. To create a supervisor, you embed act.Supervisor in your struct and implement the act.SupervisorBehavior interface:
type SupervisorBehavior interface {
gen.ProcessBehavior
// Init invoked on supervisor spawn - MANDATORY
Init(args ...any) (SupervisorSpec, error)
// HandleChildStart invoked when a child starts (if EnableHandleChild is true)
HandleChildStart(name gen.Atom, pid gen.PID) error
// HandleChildTerminate invoked when a child terminates (if EnableHandleChild is true)
HandleChildTerminate(name gen.Atom, pid gen.PID, reason error) error
// HandleMessage invoked for regular messages
HandleMessage(from gen.PID, message any) error
// HandleCall invoked for synchronous requests
HandleCall(from gen.PID, ref gen.Ref, request any) (any, error)
// HandleEvent invoked for subscribed events
HandleEvent(message gen.MessageEvent) error
// HandleInspect invoked for inspection requests
HandleInspect(from gen.PID, item ...string) map[string]string
// Terminate invoked on supervisor termination
Terminate(reason error)
}Only Init is mandatory. All other methods are optional - act.Supervisor provides default implementations that log warnings. The Init method returns SupervisorSpec which defines the supervisor's behavior, children, and restart strategy.
Creating a Supervisor
Embed act.Supervisor and implement Init to define the supervision spec:
type AppSupervisor struct {
act.Supervisor
}
func (s *AppSupervisor) Init(args ...any) (act.SupervisorSpec, error) {
return act.SupervisorSpec{
Type: act.SupervisorTypeOneForOne,
Children: []act.SupervisorChildSpec{
{
Name: "database",
Factory: createDBWorker,
Args: []any{"postgres://..."},
},
{
Name: "api",
Factory: createAPIServer,
Args: []any{8080},
},
},
Restart: act.SupervisorRestart{
Strategy: act.SupervisorStrategyTransient,
Intensity: 5,
Period: 5,
},
}, nil
}
func createSupervisorFactory() gen.ProcessBehavior {
return &AppSupervisor{}
}
// Spawn the supervisor
pid, err := node.Spawn(createSupervisorFactory, gen.ProcessOptions{})The supervisor spawns all children during Init (except Simple One For One, which starts with zero children). Each child is linked bidirectionally to the supervisor (LinkChild and LinkParent set automatically). If a child terminates, the supervisor receives an exit signal and applies the restart strategy.
Children are started sequentially in declaration order. If any child's spawn fails (the factory's ProcessInit returns an error), the supervisor terminates immediately with that error. This ensures the supervision tree is fully initialized or not at all - no partial states.
Supervision Types
The Type field in SupervisorSpec determines what happens when a child fails.
One For One
Each child is independent. When one child terminates, only that child is restarted. Other children continue running unaffected.
Type: act.SupervisorTypeOneForOne,
Children: []act.SupervisorChildSpec{
{Name: "worker1", Factory: createWorker},
{Name: "worker2", Factory: createWorker},
{Name: "worker3", Factory: createWorker},
},If worker2 crashes, the supervisor restarts only worker2. worker1 and worker3 keep running. Use this when children are independent - databases, caches, API handlers that don't depend on each other.
Each child runs with a registered name (the Name from the spec). This means only one instance per child spec. To run multiple instances of the same worker, use Simple One For One instead.
All For One
Children are tightly coupled. When any child terminates, all children are stopped and restarted together.
Type: act.SupervisorTypeAllForOne,
Children: []act.SupervisorChildSpec{
{Name: "cache", Factory: createCache},
{Name: "processor", Factory: createProcessor}, // Depends on cache
{Name: "api", Factory: createAPI}, // Depends on both
},If cache crashes, the supervisor stops processor and api (in reverse order if KeepOrder is true, simultaneously otherwise), then restarts all three in declaration order. Use this when children share state or dependencies that can't survive partial failures.
Rest For One
When a child terminates, only children started after it are affected. Children started before it continue running.
Type: act.SupervisorTypeRestForOne,
Children: []act.SupervisorChildSpec{
{Name: "database", Factory: createDB}, // Independent
{Name: "cache", Factory: createCache}, // Depends on database
{Name: "api", Factory: createAPI}, // Depends on cache
},If cache crashes, the supervisor stops api, then restarts cache and api in order. database is unaffected. Use this for dependency chains where later children depend on earlier ones, but earlier ones don't depend on later ones.
With KeepOrder: true, children are stopped sequentially (last to first). With KeepOrder: false, they stop simultaneously. Either way, restart happens in declaration order after all affected children have stopped.
Simple One For One
All children run the same code, spawned dynamically instead of at supervisor startup.
Type: act.SupervisorTypeSimpleOneForOne,
Children: []act.SupervisorChildSpec{
{
Name: "worker",
Factory: createWorker,
Args: []any{"default-config"},
},
},The supervisor starts with zero children. Call supervisor.StartChild("worker", "custom-args") to spawn instances:
// Start 5 worker instances
for i := 0; i < 5; i++ {
supervisor.StartChild("worker", fmt.Sprintf("worker-%d", i))
}Each instance is independent. They're not registered by name (no SpawnRegister), so you track them by PID. When an instance terminates, only that instance is restarted (if the restart strategy allows). Other instances continue running.
Use Simple One For One for worker pools where you dynamically scale the number of identical workers based on load. The child spec is a template - each StartChild creates a new instance from that template.
Restart Strategies
The Restart.Strategy field determines when children are restarted.
Transient (Default)
Restart only on abnormal termination. If a child returns gen.TerminateReasonNormal or gen.TerminateReasonShutdown, it's not restarted:
Restart: act.SupervisorRestart{
Strategy: act.SupervisorStrategyTransient, // Default
}Use this for workers that can gracefully stop - maybe they finished their work, or received a shutdown command. Crashes (panics, errors, kills) trigger restarts. Normal termination doesn't.
Temporary
Never restart, regardless of termination reason:
Restart: act.SupervisorRestart{
Strategy: act.SupervisorStrategyTemporary,
}The child runs once. If it terminates (normal or crash), it stays terminated. Use this for initialization tasks or processes that shouldn't be restarted automatically.
Permanent
Always restart, regardless of termination reason:
Restart: act.SupervisorRestart{
Strategy: act.SupervisorStrategyPermanent,
}Even gen.TerminateReasonNormal triggers restart. Use this for critical processes that must always be running - maybe a health monitor or connection manager that should never stop.
With Permanent strategy, DisableAutoShutdown is ignored, and the Significant flag has no effect - every child termination triggers restart.
Restart Intensity
Restarts aren't free. If a child crashes repeatedly, restarting it repeatedly just wastes resources. The Intensity and Period options limit restart frequency:
Restart: act.SupervisorRestart{
Strategy: act.SupervisorStrategyTransient,
Intensity: 5, // Maximum 5 restarts
Period: 10, // Within 10 seconds
}The supervisor tracks restart timestamps (in milliseconds). When a child terminates and needs restart, the supervisor checks: have there been more than Intensity restarts in the last Period seconds? If yes, the restart intensity is exceeded. The supervisor stops all children and terminates itself with act.ErrSupervisorRestartsExceeded.
Old restarts outside the period window are discarded from tracking. This is a sliding window: if your child crashes 5 times in 10 seconds, then runs stable for 11 seconds, then crashes again - the counter resets. It's 1 restart in the window, not 6 total.
Default values are Intensity: 5 and Period: 5 if you don't specify them.
Significant Children
In All For One and Rest For One supervisors, the Significant flag marks children whose termination can trigger supervisor shutdown:
Children: []act.SupervisorChildSpec{
{
Name: "critical_service",
Factory: createCriticalService,
Significant: true, // If this stops cleanly, supervisor stops
},
{
Name: "helper",
Factory: createHelper,
// Significant: false (default)
},
},With SupervisorStrategyTransient:
Significant child terminates normally → supervisor stops all children and terminates
Significant child crashes → restart strategy applies
Non-significant child → restart strategy applies regardless of termination reason
With SupervisorStrategyTemporary:
Significant child terminates (any reason) → supervisor stops all children and terminates
Non-significant child → no restart, child stays terminated
With SupervisorStrategyPermanent:
Significantflag is ignoredAll terminations trigger restart
For One For One and Simple One For One, Significant is always ignored.
Use significant children when a specific child's clean termination means "mission accomplished, shut down the subtree." Example: a batch processor that finishes its work and terminates normally should stop the entire supervision tree, not get restarted.
Auto Shutdown
By default, if all children terminate normally (not crashes) and none are significant, the supervisor stops itself with gen.TerminateReasonNormal. This is auto shutdown.
DisableAutoShutdown: false, // Default - supervisor stops when children stopEnable DisableAutoShutdown to keep the supervisor running even with zero children:
DisableAutoShutdown: true, // Supervisor stays alive with zero childrenAuto shutdown is ignored for Simple One For One supervisors (they're designed for dynamic children) and ignored when using Permanent strategy.
Use auto shutdown when your supervisor's purpose is managing those specific children. When they're all gone, the supervisor has no purpose. Disable it when the supervisor manages dynamically added children or should stay alive to accept management commands.
Keep Order
For All For One and Rest For One, the KeepOrder flag controls how children are stopped:
Restart: act.SupervisorRestart{
KeepOrder: true, // Stop sequentially in reverse order
}With KeepOrder: true:
Children stop one at a time, last to first
Supervisor waits for each child to fully terminate before stopping the next
Slow but orderly - useful when children have shutdown dependencies
With KeepOrder: false (default):
All affected children receive
SendExitsimultaneouslyThey terminate in parallel
Fast but unordered - use when children can shut down independently
After stopping (either way), children restart sequentially in declaration order. KeepOrder only affects stopping, not starting.
For One For One and Simple One For One, KeepOrder is ignored (only one child is affected).
Dynamic Management
Supervisors provide methods for runtime adjustments:
// Start a child from the spec (if not already running)
err := supervisor.StartChild("worker")
// Start with different args (overrides spec)
err := supervisor.StartChild("worker", "new-config")
// Add a new child spec and start it
err := supervisor.AddChild(act.SupervisorChildSpec{
Name: "new_worker",
Factory: createWorker,
})
// Disable a child (stops it, won't restart on crash)
err := supervisor.DisableChild("worker")
// Re-enable a disabled child (starts it again)
err := supervisor.EnableChild("worker")
// Get list of children
children := supervisor.Children()
for _, child := range children {
fmt.Printf("Spec: %s, PID: %s, Disabled: %v\n",
child.Spec, child.PID, child.Disabled)
}Critical: These methods fail with act.ErrSupervisorStrategyActive if called while the supervisor is executing a restart strategy. The supervisor is in supStateStrategy mode - it's stopping children, waiting for exit signals, or starting replacements. You must wait for it to return to supStateNormal before making management calls.
When the supervisor is applying a strategy, it processes only the Urgent queue (where exit signals arrive) and ignores System and Main queues. This ensures exit signals are handled promptly without interference from management commands or regular messages.
StartChild with args updates the spec's args for future restarts. This lets you reconfigure children dynamically - call StartChild("worker", newConfig) and subsequent restarts use newConfig.
Child Callbacks
Enable EnableHandleChild: true to receive notifications when children start or stop:
func (s *AppSupervisor) Init(args ...any) (act.SupervisorSpec, error) {
return act.SupervisorSpec{
EnableHandleChild: true,
// ... rest of spec
}, nil
}
func (s *AppSupervisor) HandleChildStart(name gen.Atom, pid gen.PID) error {
s.Log().Info("child %s started with PID %s", name, pid)
// Maybe register in service discovery, send init message
return nil
}
func (s *AppSupervisor) HandleChildTerminate(name gen.Atom, pid gen.PID, reason error) error {
s.Log().Info("child %s (PID %s) terminated: %s", name, pid, reason)
// Maybe deregister from service discovery, clean up resources
return nil
}These callbacks run after the restart strategy completes. For example:
Child crashes
Supervisor applies restart strategy (stops affected children if needed)
Supervisor starts replacement children
Then
HandleChildTerminateis called for the terminated childThen
HandleChildStartis called for the replacement
The callbacks are invoked as regular messages sent by the supervisor to itself. They arrive in the Main queue, so they're processed after the restart logic (which happens in the exit signal handler).
If HandleChildStart or HandleChildTerminate returns an error, the supervisor terminates with that error. Use these callbacks for integration with external systems, not for restart decisions - restart logic is handled by the supervisor type and strategy.
Supervisor as a Regular Actor
Supervisors are actors. They have mailboxes, handle messages, and can communicate with other processes:
func (s *AppSupervisor) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case ScaleCommand:
if msg.Up {
s.AddWorkers(msg.Count)
} else {
s.RemoveWorkers(msg.Count)
}
case HealthCheckRequest:
children := s.Children()
s.Send(from, HealthResponse{
Running: len(children),
Healthy: s.countHealthy(children),
})
}
return nil
}
func (s *AppSupervisor) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
switch request.(type) {
case GetChildrenRequest:
return s.Children(), nil
}
return nil, nil
}This lets you build management APIs: query supervisor state, scale children dynamically, reconfigure at runtime. The supervisor processes these messages between handling exit signals.
Restart Intensity Behavior
Understanding restart intensity is critical for reliable systems. Here's exactly how it works:
The supervisor maintains a list of restart timestamps in milliseconds. When a child terminates and restart is needed:
Append current timestamp to the list
Remove timestamps older than
PeriodsecondsIf list length >
Intensity, intensity is exceededIf exceeded: stop all children, terminate supervisor with
act.ErrSupervisorRestartsExceededIf not exceeded: proceed with restart
Example with Intensity: 3, Period: 5:
Time 0s: Child crashes → restart (count: 1)
Time 1s: Child crashes → restart (count: 2)
Time 2s: Child crashes → restart (count: 3)
Time 3s: Child crashes → EXCEEDED (count: 4 within 5s window)
→ Stop all children, supervisor terminatesBut if the child runs stable between crashes:
Time 0s: Child crashes → restart (count: 1)
Time 6s: Child crashes → restart (count: 1, previous outside window)
Time 12s: Child crashes → restart (count: 1, previous outside window)The sliding window means intermittent failures don't accumulate. Only rapid repeated failures exceed intensity.
Shutdown Behavior
When a supervisor terminates (receives exit signal, calls terminate from HandleMessage, or crashes), it stops all children first:
Send
gen.TerminateReasonShutdownviaSendExitto all running childrenWait for all children to terminate
Call
TerminatecallbackRemove supervisor from node
With KeepOrder: true (All For One / Rest For One), children stop sequentially. With KeepOrder: false, they stop in parallel. Either way, the supervisor waits for all to finish before terminating itself.
If a non-child process sends the supervisor an exit signal (via Link or SendExit), the supervisor initiates shutdown. This is how parent supervisors stop child supervisors - send an exit signal, and the entire subtree shuts down cleanly.
Dynamic Children (Simple One For One)
Simple One For One supervisors start with empty children and spawn them on demand:
Type: act.SupervisorTypeSimpleOneForOne,
Children: []act.SupervisorChildSpec{
{
Name: "worker", // Template name
Factory: createWorker,
Args: []any{"default-config"},
},
},Start instances with StartChild:
// Start 10 workers with different args
for i := 0; i < 10; i++ {
supervisor.StartChild("worker", fmt.Sprintf("worker-%d", i))
}Each call spawns a new worker. The args passed to StartChild are used for that specific instance. Future restarts of that instance use those args, not the template args from the spec.
Workers are not registered by name (no SpawnRegister). You track them by PID from the return value or via supervisor.Children().
Disabling a child spec stops all running instances with that spec name:
// Stops all "worker" instances
supervisor.DisableChild("worker")Simple One For One ignores DisableAutoShutdown - the supervisor never auto-shuts down, even with zero children. It's designed for dynamic workloads where zero children is a valid state.
Patterns and Pitfalls
Set restart intensity carefully. Too low and transient failures kill your supervisor. Too high and crash loops consume resources. Start with defaults (Intensity: 5, Period: 5) and tune based on observed behavior.
Use Significant sparingly. Marking a child significant couples its lifecycle to the entire supervision tree. This is powerful but reduces isolation. Prefer non-significant children and handle critical failures at a higher supervision level.
Don't call management methods during restart. StartChild, AddChild, EnableChild, DisableChild fail with ErrSupervisorStrategyActive if the supervisor is mid-restart. Wait for the restart to complete (check via Inspect or wait for HandleChildStart callback).
Disable auto shutdown for dynamic supervisors. If your supervisor uses AddChild to add children at runtime, enable DisableAutoShutdown. Otherwise, it terminates when it starts with zero children or when all dynamically added children eventually stop.
Use HandleChildStart for integration, not validation. By the time HandleChildStart is called, the child is already spawned and linked. Returning an error terminates the supervisor, but doesn't prevent the child from running. Use child's Init for validation instead.
KeepOrder is only for stopping. Children always start sequentially in declaration order. KeepOrder controls only the stopping phase of All For One and Rest For One restarts.
Simple One For One args override the template. Args passed to StartChild are used for that instance's restarts, not the template args from the child spec. Update the template args with AddChild if you need to change the default.
Last updated
