Recently I’ve been pondering an interesting book called Release It!, by Michael Nygard. It’s full of anecdotes from someone who has spent a major portion of his career troubleshooting high-profile crashes of some of the most complex production software systems in the world–airline reservations, financial institutions, leading online retailers, and so forth.
One design pattern that Nygard recommends was new to me, but it rang true as soon as I saw its description. Like many classic patterns, I’ve implemented variations on it without knowing the terminology. I like Nygard’s formulation, so I thought I’d summarize it here; as I’ve said before, good code plans for problems.
The pattern is called circuit breaker, and its purpose is to prevent runaway failures.
In systems without circuit breakers, failures in an external call may cause an exception on the caller’s side; this can cause the caller to log, retry, and/or execute other specialized logic. Since errors are supposed to be the corner case, the blocks of code that handle them are often expensive to execute. The very slowness of the error-handling codepath can be the source of further failures, because locks are held longer than normal, or because we poll until a connection is restored, overwhelming a system that’s already limping.
Or, to borrow an old idiom, “it never rains but it pours.”
In the circuit breaker pattern, on the other hand, the caller assigns each “circuit” (a codepath that invokes an external entity) to one of three possible states: closed, open, or half-open. Continue reading