Don’t forget the circuit breakers

Recently I’ve been pondering an interesting book called Release It!, by Michael Nygard. It’s full of anecdotes from someone who has spent a major portion of his career troubleshooting high-profile crashes of some of the most complex production software systems in the world–airline reservations, financial institutions, leading online retailers, and so forth.

circuit breaker

A circuit breaker. Photo credit: Wikimedia Commons.

One design pattern that Nygard recommends was new to me, but it rang true as soon as I saw its description. Like many classic patterns, I’ve implemented variations on it without knowing the terminology. I like Nygard’s formulation, so I thought I’d summarize it here; as I’ve said before, good code plans for problems.

The pattern is called circuit breaker, and its purpose is to prevent runaway failures.

In systems without circuit breakers, failures in an external call may cause an exception on the caller’s side; this can cause the caller to log, retry, and/or execute other specialized logic. Since errors are supposed to be the corner case, the blocks of code that handle them are often expensive to execute. The very slowness of the error-handling codepath can be the source of further failures, because locks are held longer than normal, or because we poll until a connection is restored, overwhelming a system that’s already limping.

Or, to borrow an old idiom, “it never rains but it pours.”

In the circuit breaker pattern, on the other hand, the caller assigns each “circuit” (a codepath that invokes an external entity) to one of three possible states: closed, open, or half-open. In the closed state, all is copacetic; calls succeed quickly. However, if the caller starts seeing failures or brownouts, and if these failures eventually create enough resistance on the circuit, the circuit’s state is considered open–all traffic on the circuit is suspended for a while, allowing backlogs to clear and former equilibrium to return. While in the open state, code that attempts to use the circuit gets an immediate and cheap failure. After enough time has passed, the circuit breaker assumes a half-open state, where it is willing to try again to see if things are now better. With success, the circuit transitions back to closed (normal); with failure, it reverts to open for more waiting.

Nygard’s war stories are an excellent argument for building circuit breakers. I see eloquent support in other contexts as well.

Consider biology. Life manages incredible complexity to equilibrium, at both micro and macro scales, in ways that software barely begins to contemplate. In fact, homeostasis is one of life’s 8 key characteristics. I find it interesting that in many cases, life achieves this balance using feedback loops that temporarily suspend or throttle complex processes in much the same way as the circuit breaker pattern we’re discussing here. A cell regulates its internal pH and salinity based on signals exchanged with the external environment; predator populations grow, shrink, and migrate based upon the abundance of prey; our bodies develop hunger when they need energy, and fatigue when they need time to repair.

If biology isn’t your thing, what about finances? Remember the cascading failures in the financial system that led to the “flash crash” of 2010? Remember how Bear Stearns and Lehman Brothers and AIG fell like dominoes before that? The NYSE has instituted trading curbs that temporarily suspend normal activity when danger signals are observed. Smart.

We need software built with this same “expect failure and plan to handle it” mindset.

Action Item

Next time you are designing an interaction with an external component or subsystem, consider implementing a circuit breaker to make the interaction less fragile and more prone to auto-balancing.

14 thoughts on “Don’t forget the circuit breakers

  1. Rob Stutton says:

    I’m loving coding a fat client UI in JavaScript – and learning to worry less about comms errors; I never change my state unless the call succeeds and just report any failures to the user with no retries or logging. Since all I/O is tied directly to user actions, they are free to retry whatever they did … it’s very relaxing :-)

    • Interesting. I hadn’t considered the benefits of running inside a robust and well implemented browser–but you’ve certainly put your finger on one of them. +1 for not solving problems when you don’t have to!

  2. Don says:

    I call this subject, defensive architecture. Its common to write code within an architecture and design based upon a story working. In architecture I think that there is an opportunity to use “circuit breaker” techniques but I also think there is an opportunity to build in sensors and diagnostics that can run inline or better yet in “white space”.
    Enterprise code should be written with sensors that pass information to another system whose purpose it is to monitor performance, stability and analyze potential failure. This system could also decide on routing the code another way in case of failure. In this case the routing switches would replace circuit breakers and the “service” essentially reroutes around a failure or poorly functioning element……
    Bottom line is that we have lots of sensors in most hardware, few in software ….
    I see this need for monitoring from within code becoming more critical as the use of open source increases and enterprises provision their code inside the likes of AWS. I bet that Netflix would agree with this notion after the holiday fiasco …….

    • Don: Your background with hardware gives you a valuable angle on this that those of us who are pure software folks miss. Thanks for chiming in. I agree with your prediction about this becoming more and more important as software gets more complex. In fact, I’ve been meaning for quite a while to write a post about how living systems (in biology) profit from the ability that all living things share, to react to stimuli. Your observation about sensors points to the same truth.

  3. I have often found it interesting that over half the code is devoted to checking for error conditions and dealing with errors but we don’t often test error conditions but the “golden path” instead. Perhaps the “circuit breaker” would push us to test errors more often.

    • One of my favorite job interview questions (for potential dev or QA hires) is to ask someone how they’d test a simple program. Some people stare at me blankly. Some just regurgitate the golden path. The ones I like to hire are the ones that immediately reel off half a dozen ways that they could imagine the code being broken. That kind of thinker is not only better at writing error-resistant code and better at writing comprehensive tests, but is also more creative and fun to work with overall.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s