Small Files Are Your Friends

Yesterday I was discussing refactoring priorities with a colleague who’s a brilliant engineer, and I happened to mention my strong desire for smaller files in our codebase. I told him that I thought .h and .cpp (or .py or .java or .whatever) files with thousands of lines were a problem.

He asked me why.

He told me that he wasn’t opposed to the idea, but he always felt like it was more of a stylistic choice than a true imperative for good code. And he was curious to see if I could convince him differently.

After I pondered his question for a while, I realized that some of my opinion really is traceable to prejudice. I usually use IDEs instead of vim/emacs, and I think that promotes click-back-and-forth-and-hyperlink-in-many-little-files instead of open-a-big-file-and-scroll. My compatriots that are more console-centric are just as smart and effective–maybe more. So I’ll write that part off.

However, I also found some arguments for the small-file principle that feel more substantive. Small files are your friends.

More small friends. Photo credit: miguelandresen (Flickr)

Named scopes and cognitive complexity

The case for small functions is more discussed than the case for small files, and it has been made by almost every luminary in computer science. My colleague immediately conceded it, and I won’t repeat it here–but I will claim that many of the same arguments apply to files as well, because files as well as functions are an important named scope in software development. This in turn suggests some constraints on files with respect to cognitive complexity.

Studies of memory and human attention consistently demonstrate that we think best about small sets. This fact is reflected by the amount of detail visible within any given named scope, both in programming and in other thought tasks. How many top-level menus in the average application? Colors in most cultures’ divisions of the rainbow? Parameters in an easy-to-understand function? Sections in the average book store? Steps in easy-to-follow driving directions? (There’s a whole field called cognitive ergonomics that explores why these questions always have similar answers.)

How many functions should we put in a reasonable file?

For me, 2 or 5 or 10 feels tractable. 50 feels excessive.

If a “good function” also respects the cognitive complexity constraints of the human brain–not being too big to read in a screen or two, for example–then you end up with a reasonable upper boundary on file sizes of, maybe, 500 or 1000 lines. (See Steve Yegge’s insightful rant about code size being an engineer’s worst enemy. He focuses on codebase size, but much of what he says applies just as well at the next level down.)

I suppose that this argument is weakened by the features of some IDEs, which collapse tangential code blocks, display treeviews of functions, and support lots of hypertext-style navigation. But not all programmers use the same IDEs, and not all interactions with code are IDE-driven; file size remains relevant. There’s a reason why C# created partial classes to improve on java’s lump-it-all-in-a-single-file constraint…

When humans try to remember more than their brains can fit, stuff falls out. Big files mean that coders have to mentally model relationships between stuff that’s separated by way too much screen real estate. This is a recipe for bugs. It is also a serious impediment to learnability.

Loose coupling and encapsulation

Files are a natural unit of coupling. In most programming languages, you can declare a construct (a variable, an internal function, or class) within a file, and have that construct be invisible to the outside world. This means there is a built-in temptation for functions and classes to bind more tightly when they’re in the same file, because they have access to common but private knowledge. By breaking large files apart, you remove the temptation, break unnecessary dependencies, and promote looser coupling.

Another way to say this is that file boundaries are an encapsulation barrier. Use them to hide data. (See my recent post about encapsulation as a simplicity strategy.)

Code reuse and testability

A consequence of files hiding data is that when you have a function that might be useful in a dozen different modules, but the function is buried in a large file with lots of dependencies extraneous to that function, reuse and testability are both frustrated. If the function is in a file of its own, it’s more discoverable, and it’s reusable and testable without extra baggage.

Link optimization

A C/C++ corollary to the file boundary issue has to do with linkers and binary sizes. In many cases, linkers remove unused functions at compilation unit level, rather than at the individual function level. A .c or .cpp file is either in or out, as a unit. This means that if you have a .cpp file with 50 functions in it, and you call only 1 of them, all 50 get linked into the final binary. The result is bloated binaries. So: smaller .cpp files ==> smaller binaries. (Before you flame me about linker optimizations, I will admit that some linkers get more granular, depending on which switches you use. But it’s surprising how hard it is to do better than what I’ve described. Experiment and comment with your results.)

Counter Argument

I suppose you could argue that by making lots of small files, you’re creating more complexity in directories, in makefiles or projects, and so forth. Is 250 files in a folder worse than 15? Doesn’t that violate the “cognitive complexity” guideline above?

My comeback is: use packages or subdirectories or libraries (another level of management). You can’t subdivide forever, but you don’t need to.

The bottom line for me is experiential, not theoretical. I nearly always have cruddy experiences in code bases where large files are common. Small files don’t guarantee pleasant and productive work, but big ones seem to go hand-in-hand with other problems. I find it telling that codebases with big files are also codebases where people lament the lack of comments the most, for example. Over the years, I’ve become convinced that a simple rule of thumb about keeping files small will pay off more handsomely than almost any other coding best practice.

Action Item

Leave a comment to tell me what you think. Am I making a mountain out of a molehill? Or do you feel strongly about small file sizes as well? Have I omitted any important pros and cons from the discussion?

The Scaling Fallacy

If X works for 1 ___ [minute | user | computer | customer | …], then 100X ought to work for 100, right? And 1000X for 1000?

Sorry, Charlie. No dice.

One of my favorite books, Universal Principles of Design, includes a fascinating discussion of our tendency to succumb to scaling fallacies. The book makes its case using the strength of ants and winged flight as examples.

Have you ever heard that an ant can lift many times its own weight–and that if that if one were the size of a human, it could hoist a car over its head with ease? The first part of that assertion is true, but the conclusion folks draw is completely bogus. Exoskeletons cease to be a viable structure on which to anchor muscle and tissue at sizes much smaller than your average grown-up; the strength-to-weight ratio just isn’t good enough. Chitin is only about as tough as fingernails.

Tough little bugger — but not an olympic champion at human scale. Image credit: D.A.Otee (Flickr)

I’d long understood the flaws in the big-ant-lifting-cars idea, but the flight example from the book was virgin territory for me.

Humans are familiar with birds and insects that fly. We know they have wings that beat the air. We naively assume that at much larger and much smaller scales, the same principles apply. But it turns out Continue reading

How Enums Spread Disease — And How To Cure It

Poorly handled enums can infect code with fragility and tight coupling like a digital Typhoid Mary.

Say you’re writing software that optimizes traffic flow patterns, and you need to model different vehicle types. So you code up something like this:

vehicle_type.h

enum VehicleType {
    eVTCar,
    eVTMotorcycle,
    eVTTruck,
    eVTSemi,
};

Then you press your enum into service:

route.cpp

if (vehicle.vt == eVTSemi || vehicle.vt == eVTTruck) {
    // These vehicle types sometimes have unusual weight, so we 
    // have to test whether they can use old bridges...
    if (vehicle.getWeight() > bridge.getMaxWeight()) {

Quickly your enum becomes handy in lots of other places as well:

if (vehicle.vt == eVTMotorcycle) {
    // vehicle is particularly sensitive to slippery roads

And…

switch (vehicle.vt) {
case eVTTruck:
case eVTSemi:
    // can't use high-occupancy/fuel-efficient lane
case eVTMotorcycle:
    // can always use high-occupancy/fuel-efficient lane
default:
    // can only use lane during off-peak hours
}

Diagnosis

The infection from your enum is already coursing through the bloodstream at this point. Do you recognize the warning signs?

  • Knowledge about the semantics of each member of the enum are spread throughout the code.
  • The members of the enum are incomplete. How will we account for cranes and bulldozers and tractors and vans?
  • Semantics are unsatisfying. We’re saying cars are never gas guzzlers or gas savers; what about massive old steel-framed jalopies and tiny new hybrids?

A vehicle that challenges our tidy enum. Photo credit: Manila Imperial Motor Sales (Flickr)

The infection amplifies when we want to represent the enum in a config file or a UI. Now we need to Continue reading

Progressive Disclosure Everywhere

If you google “progressive disclosure,” you’ll get hits that describe the phrase as an interaction design technique. UI folks have long recognized that it’s better to show a simple set of options, and allow users to drill into greater detail only when they need it. (Thanks to James Russell–a brilliant UI designer–for teaching me PD years ago.)

But calling progressive disclosure a “technique” is, I think, a serious understatement. Progressive disclosure aligns with a profound cognitive principle, and its use is (and should be) pervasive, if you have eyes to see.

The Principle

Here’s my best attempt to distill the operative rule behind progressive disclosure:

Focus on essence. Elaborate on demand.

In other words, begin by addressing fundamentals without cluttering detail. When more detail is needed, find the next appropriate state, and move there. Repeat as appropriate.

Stated that way, perhaps you’ll see the pattern of progressive disclosure in lots of unexpected places. I’ve listed a few that occur to me…

Manifestations

The scientific method is an iterative process in which hypotheses gradually align to increasingly detailed observation. We learn by progressive disclosure.

Good conversationalists don’t gush forever on a topic. They throw out an observation or a tidbit, and wait to see if others are interested. If yes, they offer more info.

The development of a complex organism from a one-celled zygote, through differentiation and all subsequent phases, into adulthood, could be considered a progressive disclosure of the patterns embedded in its DNA. The recursive incorporation of the golden mean in many morphologies is another tie to biology.

A nautilus grows–progressively discloses–the protective covering it requires over its lifespan. The golden mean, repeated and repeated… Photo credit: Wikimedia Commons.

In journalism, the inverted pyramid approach to storytelling is a form of progressive disclosure. So are headlines.

Depending on how you’re reading this post, you might see a “Read more…” link that I’ve inserted right after this paragraph. Making below-the-fold reading optional is progressive disclosure at work. TLDR…

Continue reading

Decoupling Interfaces as Versions Evolve, Part 3

This is part 3 of a series. You can read part 1 and part 2 as well.

Quick Review

We want all the encapsulation and data hiding benefits that interfaces provide. We want to be able to version our interfaces so consumers can depend on them reliably, but we don’t want the producer and consumer of an interface to have to coordinate tightly. We don’t want the producer of an interface to have to version so often that there’s a built-in disincentive to follow best practice. And we want all the compiler and IDE benefits that early binding typically offers to a programmer.

I claim that no current solution really provides all of this — not COM, not SOAP-based web services, not late-bound REST web services.

Fear not.

Summary of Solution

  1. The provider of an interface and the consumer of an interface each conform to a compiler-enforceable contract (.wsdl/.idl/etc.), but unlike the traditional approach, these contracts are allowed to differ.
  2. The test of whether the two interfaces are compatible is not done by traditional casting, but by testing the contents of the two sides for semantic equivalence – a consumer has a compatible interface if it is a semantic subset of the provider’s.
  3. The consumer is required to write wrapper classes that forward from its own interface to that of the provider. (Using a language that supports reflection, like Java or C#, makes this task trivial).

Alternative Approach

Alternative Approach

The Gory Details

This solution could be built on top of COM, RPC-over-soap-style web services, or a RESTful service interface more analogous to document-oriented web services. Other environments such as CORBA/EJB may also be candidates, though I am less familiar with the details there.

Most SOAP comm pipelines get a remote object and deserialize it to a tightly bound object type in a single step, using a type cast as a runtime check that the remote source meets the calling code’s expectations. Such code would have to change so a remote object is fetched and deserialized in an initial step, and subsequently, the standard cast is replaced with a function that creates a wrapper object from the local interface if compatibility tests pass.

TryCast Pseudocode

TryCast Pseudocode

In COM code, the analogous initial step must return an IUnknown; the second step consists of composing the semantic union of all interfaces the IUnknown supports, and then using that überinterface as the basis for compatibility testing. Since IUnknown does not support enumeration, the semantic union of all interfaces in an IUnknown would require a list of possible IIDs to perform a series of QueryInterface calls, or a low-level analysis of the object’s vtable.

In a RESTful document-oriented web service, a URL returns an xml document that describes an arbitrary object using structural elements that do not vary across returned object type. For example, instead of

<book><title>Dragon’s Egg</title><author><fname>Stephen</fname><lname>King</lname></book>

you have

<doc><prop name=”title” type=”string”>Dragon’s Egg</prop><prop name=”author”>Stephen King</prop></doc>

or something similar. This conveys the object’s semantic constraints along with its data, much like sending a table definition along with a tuple in response to a DB query. The initial step of deserialization constructs a generic object; the second step tests compatibility against the semantic constraints embedded directly in the document and constructs an instance of a wrapper class on success.

It’s important to distinguish between read-only and read-write usage patterns in this mechanism. Consumers of an interface that only intend to display data are infinitely backward compatible if the runtime check for semantic compatibility passes, regardless of the version numbers/guids in play under a given scenario, because the wrapper classes depend on an interface mapping that’s generated dynamically at runtime. However, if a consumer of an object wants to update its state at the source, the wrapper class must contain every property that the provider will require – or else the provider must set such properties either before serving the object or when the update is requested. Using wrapper classes rather than the traditional generated SOAP stubs is an important element of this mechanism because this allows mods to objects that a client does not fully understand.

New Approach - Pros and Cons

New Approach - Pros and Cons