Codecraft

software = science + art + people

Recently I’ve been pondering an interesting book called Release It!, by Michael Nygard. It’s full of anecdotes from someone who has spent a major portion of his career troubleshooting high-profile crashes of some of the most complex production software systems in the world — airline reservations, financial institutions, leading online retailers, and so forth.

circuit breaker
A circuit breaker. Photo credit: Wikimedia Commons.

One good code plans for problems.

The pattern is called circuit breaker, and its purpose is to prevent runaway failures.

In systems without circuit breakers, failures in an external call may cause an exception on the caller’s side; this can cause the caller to log, retry, and/or execute other specialized logic. Since errors are supposed to be the corner case, the blocks of code that handle them are often expensive to execute. The very slowness of the error-handling codepath can be the source of further failures, because locks are held longer than normal, or because we poll until a connection is restored, overwhelming a system that’s already limping.

Or, to borrow an old idiom, “it never rains but it pours.”

In the circuit breaker pattern, on the other hand, the caller assigns each “circuit” (a codepath that invokes an external entity) to one of three possible states: closed, open, or half-open. In the closed state, all is copacetic; calls succeed quickly. However, if the caller starts seeing failures or brownouts, and if these failures eventually create enough resistance on the circuit, the circuit’s state is considered open — all traffic on the circuit is suspended for a while, allowing backlogs to clear and former equilibrium to return. While in the open state, code that attempts to use the circuit gets an immediate and cheap failure. After enough time has passed, the circuit breaker assumes a half-open state, where it is willing to try again to see if things are now better. With success, the circuit transitions back to closed (normal); with failure, it reverts to open for more waiting.

Nygard’s war stories are an excellent argument for building circuit breakers. I see eloquent support in other contexts as well.

Consider biology. Life manages incredible complexity to equilibrium, at both micro and macro scales, in ways that software barely begins to contemplate. In fact, life’s 8 key characteristics. I find it interesting that in many cases, life achieves this balance using feedback loops that temporarily suspend or throttle complex processes in much the same way as the circuit breaker pattern we’re discussing here. A cell regulates its internal pH and salinity based on signals exchanged with the external environment; predator populations grow, shrink, and migrate based upon the abundance of prey; our bodies develop hunger when they need energy, and fatigue when they need time to repair.

If biology isn’t your thing, what about finances? Remember the cascading failures in the financial system that led to the “flash crash” of 2010? Remember how Bear Stearns and Lehman Brothers and AIG fell like dominoes before that? The NYSE has instituted trading curbs that temporarily suspend normal activity when danger signals are observed. Smart.

We need software built with this same “expect failure and plan to handle it” mindset.

Action Item

Next time you are designing an interaction with an external component or subsystem, consider implementing a circuit breaker to make the interaction less fragile and more prone to auto-balancing.