It’s the season for coughs and sniffles, and last week I took my turn. I went to bed one night with a stuffy nose, and it got me thinking about software.
What’s the connection between sniffles and software, you ask?
Let’s talk redundancy. It’s a familiar technique in software design, but I believe we compartmentalize it too much under the special topic of “high availability”–as if only when that’s an explicit requirement do we need to pay any attention.
Redundancy in nature
Mother Nature’s use of redundancy is so pervasive that we may not even realize it’s at work. We could learn a thing or two from how she weaves it–seamlessly, consistently, tenaciously–into the tapestry of life.
Redundancy had everything to do with the fact that I didn’t asphyxiate as I slept with a cold. People have more than one sinus, so sleeping with a few of them plugged up isn’t life-threatening. If nose breathing isn’t an option, we can always open our mouths. We have two lungs, not one–and each consists of huge numbers of alveoli that does part of the work of exchanging oxygen and carbon dioxide. Even when one lung is damaged and the other is clogged, we keep on oxygenating our blood. Amazing, when you think about it.
One example of redundancy in software
It’s not hard to find examples of redundancy in software. Consider streaming video on demand. You are probably well aware of the redundancy involved in TCP’s reliable and ordered delivery guarantees. You can imagine the redundancy in Netflix’s data centers, and maybe you know all about CDNs and adaptive routing on the internet. But giving you a good viewing experience goes well beyond that. The video source you select often includes multiple sources–high, medium, and low quality–and your browser selects among them based on auto-detected bandwidth constraints. There’s a fallback mechanism to gracefully degrade. The browser downloads video in chunks, and it tries to stay ahead of the point of playback. That means it’s got both a live and a cached source of data to manage. Codecs are designed to be able to interpolate missed frames if bandwidth gets scarce. Audio is separable from video so video frames can be dropped while preserving continuous sound. If you’re watching on your iPad, you may have both a wireless and a cell phone network that the device can use to stream content. The CPU that renders video can offload chunks of work to a GPU when it’s more efficient–but it can also decompress and rasterize and texturize and all the rest of it, all on its own. The list goes on and on.
Not always good
Of course, not all redundancy is useful. You can get carried away with it :-)
In coding tasks, redundancy is often your enemy. Lots of antipatterns are undesirable precisely because they create redundancy that’s difficult to understand and maintain. Foolish coding standards and dumb comments are notorious for creating busywork this way.
Food for thought
With caveats acknowledged, here are a few ways that redundancy might be under-represented in our design and coding:
- Do we create user experiences that convey the same information in more than one way?
For example, do we highlight selected items by color, size, and a visual “wiggle,” instead of only by drawing a rectangle? When we finish an operation, do we reset the focus AND play a sound AND display a message? This is best practice for busy users that pay imperfect attention. It’s also a big help for folks who are color blind or visually impaired, which is why it’s a requirement for Section 508 compliance.
- Do we have multiple ways to get someone’s attention when something goes wrong?
It’s all well and good to log errors–but what if the error we’re logging is that we just ran out of disk space on the volume where the log file lives? Whoops…
- Do our architectures distribute responsibilities, instead of assuming a single, centralized point of failure?
This is more than just recognizing the dangers in having one box on a diagram labeled “manager” or “database” or “authentication master”, and compensating in “HA mode” by having a backup. It means thinking about network pipes and protocols and firewalls with a redundancy mindset. It means figuring out how clients can be temporarily autonomous to tolerate hiccups, how to use caching intelligently, and so forth.
- Do we sanity check in enough places?
Of course we need to sanitize user input to avoid SQL injection. We need transactions and CRCs. But do we build byte order detection and version stamp checks and sizeof and alignment tests in the right places? Are we using design-by-contract to prove that consumers and creators of a chunk of code share the same mental models? Do we catch exceptions or test for errors in enough places?
- Do we avoid depending on a single person to make our system work or to troubleshoot?
If we say, “the admin will solve that problem if the user runs into it,” we’re probably letting ourselves off too easy. If we say, “the admin will call tech support,” we’re probably letting ourselves off too easy.
- Do we carefully plan, code, test, and document for graceful degradation?
Less than ideal paths through code aren’t an afterthought. Even if they’re less likely on average, the chances that their behavior will be important to our users sooner or later are virtually 100%. We need circuit breakers, helpful error messages, documentation about what to do when things aren’t perfect, APIs that protect themselves from unwise callers, retries, timeouts, and alternate solutions.
Redundancy has a cost. We need to use it judiciously; good code balances many considerations. Nonetheless, I think pondering the issues above is likely to improve the robustness and aptness of our software in many cases. Our users will love us for it.
Now go out and add some extra redundancy again. And don’t forget to not neglect to build in some secondary alternate failsafes, while you’re at it. :-)