I just finished the nastiest debugging experience of my career–nearly 3 weeks on a single bug. After days and days of staring at code, swearing at core dumps, tailing logs, connecting to gdbserver via a multi-hop ssh tunnel from inside a secure environment, and general programmer misery, I felt like doing cartwheels when I finally figured it out, tweaked a few lines of code, and observed stability again.
Hindsight teaches me this lesson: undocumented, unhandled constraints waste enormous amounts of time and energy. If you’re interested in writing good code, you must know your limits, and you must communicate them. This especially matters when the constraints are obscure or surprising.
My bug seemed simple enough at first blush: I was seeing crashes in a daemon that uses machine learning to classify internet traffic. I’d just rewritten the app’s DNS subsystem to increase throughput, and the crashes often happened in my new event-processing loop. The loop fired off callbacks to service sockets as they became unblocked; I assumed I had some kind of off-by-one error in my sockets array, or maybe I wasn’t cleaning up sockets consistently.
No such luck.
The more I dug, the deeper the mystery grew. Sometimes I got core dumps, sometimes I didn’t. Sometimes I got a
SIGSEGV, sometimes not. Sometimes my core dumps were usable, sometimes they were corrupted. The values displayed in the debugger when I analyzed a core dump didn’t make sense. The exact site of the crash seemed to vary. I began to suspect that a heap or stack corruption at an unrelated, innocent-looking, “stable” place in the code might be crashing DNS because it’s so heavily threaded and high-performance.
I ran valgrind. I ran a specially instrumented Clang address sanitizer build. They yielded lots of noise, but no consistent smoking gun.
After a few days of semi-informed guesswork, I finally got methodical. I started tracking my theories and disproving each of them with a series of experiments. This revealed layers of undocumented, lurking problems that cloaked the real issue. Each is a place where an engineer failed to communicate or plan around a constraint:
- I suspected that the bind server I was calling to do DNS resolution might be experiencing brown-outs, and that the resulting backlog of pending DNS requests might be a trigger for my crash. I had to dig around for a while to see what kind of timeouts DNS used, because most DNS documentation doesn’t consider this to be worth mentioning! Wanna guess? 5 seconds. And you have to multiply that by 2 or 3, since DNS is usually UDP-based and uses retries to compensate. Since my daemon needed to respond within a couple seconds, and since it gets tens to hundreds of queries per second, having a 15-second DNS timeout was a real problem… Fortunately, the DNS library I was using, c-ares, lets you customize both the timeouts and the retries, so I worked past this.
- Since I wasn’t always getting a core dump, I wondered whether the signal causing the abend might be inconsistent. It turned out that the daemon installed handlers for a few signals, but not all, and it didn’t report which signal triggered the crash. Getting disciplined about signal handling eliminated the confusion caused at this layer.
- The app had configurable limits on max sockets and max threads, but it didn’t seem to follow them reliably; it took me a while to realize that it was ignoring interactions with hard and soft resource limits imposed by the OS (see the
- The app used the
select()posix API to discover which sockets became ready or timed out. But it turns out that
select()blows up if you are monitoring more than
FD_SETSIZE(=1024) sockets, although none of the example usages showed that, and the man page for
select()only mentions it in a cryptic aside. I had to stumble upon this particular wisdom via google and stackoverflow. (Even after I capped open sockets safely below this limit, I had a performance problem because
select()scans linearly across as many sockets as you have open; I need to switch to
- The c-ares library is all about asynchronous resolution with lots of overlapping, parallelized I/O on many sockets. It’s great code, and I appreciate the generosity that made it open-source. It provides a
channelconstruct to manage many pending requests, and users such as the curl dev team cite examples of thousands of pending requests. However, I learned after much pain that 16 pending requests per channel is what c-ares was designed for, and one request per channel is more standard; you can scale super-high, but only by using many channels. It took an email exchange with the author to discover this informal side agreement in the API.
- During spikes in traffic to the app, if the open socket count exceeded the OS’s soft limit but not the limit in the config file, it was possible to get in a state where we couldn’t bind an incoming socket to return an http
503 Server Busystatus to the client. This made it difficult to decide whether a traffic spike was a trigger for our problem, since we were identifying spikes by looking for 503 errors. I fixed this problem by reserving some sockets for returning errors.
- To make matters worse, when a socket failed to bind for this reason, some very low-level code (possibly in the kernel’s underlying TCP/IP stack) issued a retry… over and over again. I was seeing logs flood with over 200k retry attempts per second in this unusual state. I had to fix this by sensing the pain of a retry storm and tripping a circuit breaker to keep the logs moderately quiet.
The rest of the story
Eventually I waded through all these layers of undocumented, improperly handled constraints on program behavior, and my app was still crashing, still unpredictable. In desperation, I downloaded the source for c-ares, built it with debug symbols, captured a stack trace, and analyzed the code path. I discovered to my surprise that nowhere was the channel parameter mutexed in the call graph of
ares_gethostbyname(). This triggered some more targeted googling than I had previously managed, and I found an email thread from 2008 where the author of the library clarifies the undocumented mutexing rules. A light went on; 5 minutes later I had fixed the bug.
I feel a little silly; the library code is solid and reasonable, and the need for this particular mutex should have occurred to me long before. But in my defense, the library is all about parallelism, the channel object that needed mutexing is intended to support many simultaneous requests, and the docs are utterly silent about the topic. So are all the samples I found; not one shows a mutex. (That’s because all the samples are asynchronous but single-threaded; I should have been more careful as I extrapolated to my slightly different use case…)
Anyway, I’m a sadder but wiser programmer now. :-) I need to add marks such as
+reasonable upper bound and
+potential gotcha to the
intent programming language I’m creating, so that this type of issue is easy to formally communicate. But even without fancy new languages, the programming ecosystems that we all work in today would benefit from more attention to understanding, handling, and communicating limits.