A while back, I wrote a post on why software should feel pain. Since then, I’ve had that lesson reinforced in my mind, and I’ve also understood some nuances that weren’t obvious to me before, so I’m revisiting the topic.
What brought this topic back to my mind was a root cause analysis I did to diagnose a recent system failure. I’ll spare you the gory details, but here’s what happened in a nutshell: a daemon got bad data files and began behaving strangely as a result. The replication process for its data files had been impaired because the app producing the data files finished much later than normal. That app in turn was impacted by anomalous network brownouts which began with a partly damaged network cable.
The Obvious but Naive Lesson
The final step in my root cause analysis was to make recommendations, and I was quick to offer some: the daemon should double-check the integrity of its data file; the originating app should monitor its timing and complain about anomalies.
The more I thought about it, however, the more unhappy I became. Surely, such monitoring is a good idea. So why did I not believe my recommendations would really make things better?