Add some more extra redundancy again

It’s the season for coughs and sniffles, and last week I took my turn. I went to bed one night with a stuffy nose, and it got me thinking about software.

What’s the connection between sniffles and software, you ask?

Let’s talk redundancy. It’s a familiar technique in software design, but I believe we compartmentalize it too much under the special topic of “high availability”–as if only when that’s an explicit requirement do we need to pay any attention.

Redundancy can be a big deal. Image credit: ydant (Flickr)

Redundancy in nature

Mother Nature’s use of redundancy is so pervasive that we may not even realize it’s at work. We could learn a thing or two from how she weaves it–seamlessly, consistently, tenaciously–into the tapestry of life.

Redundancy had everything to do with the fact that I didn’t asphyxiate as I slept with a cold. People have more than one sinus, so sleeping with a few of them plugged up isn’t life-threatening. If nose breathing isn’t an option, we can always open our mouths. We have two lungs, not one–and each consists of huge numbers of alveoli that does part of the work of exchanging oxygen and carbon dioxide. Continue reading

Why we need try…finally, not just RAII

The claim has been made that because C++ supports RAII (resource acquisition is initialization), it doesn’t need try…finally. I think this is wrong. There is at least one use case for try…finally that’s very awkward to model with RAII, and one that’s so ugly it ought to be outlawed.

The awkward use case

What if what you want to do during a …finally block has nothing to do with freeing resources? For example, suppose you’re writing the next version of Guitar Hero, and you want to guarantee that when your avatar leaves the stage, the last thing she does is take a bow–even if the player interrupts the performance or an error occurs.

…finally, take a bow. Photo credit: gavinzac (Flickr)

Of course you can ensure this behavior with an RAII pattern, but it’s silly and artificial. Which of the following two snippets is cleaner and better expresses intent? Continue reading

Don’t forget the circuit breakers

Recently I’ve been pondering an interesting book called Release It!, by Michael Nygard. It’s full of anecdotes from someone who has spent a major portion of his career troubleshooting high-profile crashes of some of the most complex production software systems in the world–airline reservations, financial institutions, leading online retailers, and so forth.

circuit breaker

A circuit breaker. Photo credit: Wikimedia Commons.

One design pattern that Nygard recommends was new to me, but it rang true as soon as I saw its description. Like many classic patterns, I’ve implemented variations on it without knowing the terminology. I like Nygard’s formulation, so I thought I’d summarize it here; as I’ve said before, good code plans for problems.

The pattern is called circuit breaker, and its purpose is to prevent runaway failures.

In systems without circuit breakers, failures in an external call may cause an exception on the caller’s side; this can cause the caller to log, retry, and/or execute other specialized logic. Since errors are supposed to be the corner case, the blocks of code that handle them are often expensive to execute. The very slowness of the error-handling codepath can be the source of further failures, because locks are held longer than normal, or because we poll until a connection is restored, overwhelming a system that’s already limping.

Or, to borrow an old idiom, “it never rains but it pours.”

In the circuit breaker pattern, on the other hand, the caller assigns each “circuit” (a codepath that invokes an external entity) to one of three possible states: closed, open, or half-open. Continue reading

Why Exceptions Aren’t Enough

(This post is a logical sequel to my earlier musings about having a coherent strategy to handle problems.)

Back in the dark ages, programmers wrote functions that returned numeric errors:

if (prepare() == SUCCESS) {
  doIt();
}

This methodology has the virtue of being simple and fast. We could switch based on the error code. A “feature” of our apps was that our users could google an error code to see if they had company:

Image credit: xkcd.com

However, as we wrote code, we sometimes forgot to check errors, or tell users about them:

prepare();
doIt();

Admit it; you’ve written code like this. So have I. The mechanism lets a caller be irresponsible and ignore the signal the called function sends. Not good. Even if you are being responsible, the set of possible return values is nearly unbounded, and you get subtle downstream bugs if a called function adds a new return value when a caller is switching return values.

Another problem with this approach to errors Continue reading

Good Code Plans for Problems

(Another post in my “What is ‘Good Code’?” series…)

What should you do when software doesn’t work the way you expect?

Surprising behavior. Photo credit: epicfail.com.

You have to have a plan. The plan could bring one (or several) of the following strategies to bear:

  • Reject bad input as early as possible using preconditions.
  • Get garbage in, put garbage out.
  • Throw an exception and make it someone else’s problem.
  • Return a meaningful error.
  • Log the problem.

These choices are not equally good, IMO, and there are times when each is more or less appropriate. Perhaps I’ll blog about that in another post…

Regardless of which strategy or strategies you pick, the overriding rule is: develop, announce, and execute a specific plan for handling problems.

This rule applies at all levels of code — low-level algorithms, modules, applications, entire software ecosystems (see my post about how software is like biology). The plan can be (perhaps should be) different in different places. But just as the actions of dozens of squads or platoons might need to be coordinated to take a hill, the individual choices you make about micro problem-handling should contribute to a cohesive whole at the macro level.

Notice that I’ve talked about “problems,” not “exceptions” or “errors.” Problems certainly include exceptions or errors, but using either of those narrower terms tends to confine your thinking. Exceptions are a nifty mechanism, but they don’t propagate across process boundaries (at least, not without some careful planning). Sometimes a glut of warnings is a serious problem, even if no individual event rises to the level of an “error.”

Action Item

Evaluate the plan(s) for problem-handling in one corner of your code. Is the plan explicit and understood by all users and maintainers? Is it implemented consistently? Is it tested? Pick one thing you can do to improve the plan.