Supervision¶
In traditional programming, errors propagate upward through the call stack via exceptions. In the actor model, errors propagate to the supervisor. This is the "let it crash" philosophy pioneered by Erlang/OTP: instead of writing defensive code within the actor to handle every possible failure, let the actor fail fast and let its supervisor decide the recovery strategy.
A supervisor is any actor that has spawned children. The supervision strategy defines what happens when a child fails:
| Directive | Effect |
|---|---|
Directive.restart |
Restart the actor with its initial behavior, resetting state |
Directive.stop |
Stop the actor permanently |
Directive.escalate |
Propagate the failure to the next supervisor up the hierarchy |
OneForOneStrategy supervises each child independently. It tracks restart counts within a configurable time window — if a child exceeds the limit, it is stopped instead of restarted:
def unreliable_worker() -> Behavior[str]:
async def receive(ctx: ActorContext[str], msg: str) -> Behavior[str]:
if msg == "crash":
raise RuntimeError("something went wrong")
print(f"Processed: {msg}")
return Behaviors.same()
return Behaviors.receive(receive)
async def main() -> None:
strategy = OneForOneStrategy(
max_restarts=3,
within=60.0,
decider=lambda exc: Directive.restart,
)
async with ActorSystem() as system:
ref = system.spawn(
Behaviors.supervise(unreliable_worker(), strategy),
"worker",
)
ref.tell("crash") # fails, supervisor restarts
await asyncio.sleep(0.2)
ref.tell("hello") # succeeds — actor recovered
await asyncio.sleep(0.1)
asyncio.run(main())
The decider function receives the exception and returns a directive. This allows fine-grained control: restart on transient errors, stop on fatal ones, escalate on unknown failures.
An important interaction to note: when a supervised actor without event sourcing is restarted, its state is reset to the initial behavior. This means the bank account from previous sections would lose its balance on restart. Event sourcing (covered in Persistence) solves this by replaying persisted events to reconstruct state after a restart.
Next: State Machines