I’ve been focusing on esoteric features of language design for a while. I thought it might be nice to take a detour and explore something eminently practical and easy to explain, for a change.
Let’s talk data and tables.
I don’t mean databases–relational or otherwise; I’m talking about tables of data in source code itself. Sooner or later, every coder uses them. We build jump tables, tables of unicode character attributes, tables of time zone properties, tables of html entities, tables of multipliers to use in hash functions, tables that map zip codes to states, tables of dispatch targets, tables that tell us the internet domain-name suffix for a particular country name…
Depending on the language you’re using and the nature of your data, you might code such tables using arrays, structs, enums, dictionaries, hash maps, and so forth.
I think this is a mediocre solution, at best. Shouldn’t programmers work on funner stuff, like “traveling salesman” problems? :-)
What’s wrong with how we code tables?
What’s wrong? In a word, the fact that we have to code them at all–that’s what.
If all the tables in our code had half a dozen items, the awkwardness of codifying them might not matter so much. But tabular data in code is often large, awkward, and difficult to format correctly. We rarely receive it expressed in a syntax that makes lexers happy.
I recently spent time writing a python script to reformat a 100 KLOC .cpp file that contained generated unicode ngram definitions represented as strings. The strings included bytes > 0x80 and < 0x20, and the compiler was refusing to process the file because it could tell the “source code” wasn’t ASCII.
I’m not sure how many hours I’ve spent doing regex search/replace to put quotes around strings that I copied/pasted from a table on a web page somewhere–but the tally is large. I’ve fiddled with smart quotes, tweaked project defs to declare code pages for my source, line-wrapped by hand across hundreds of lines of data, debugged missing escape sequences due to embedded backslashes, added and subtracted commas and curly braces and line continuations, and all kinds of similar fiddle work.
Coders have better things to do.
Notice that in most cases, data tables like these represent knowledge whose primary home isn’t really code, and whose true owner isn’t an engineer. Coding such data is therefore a setup for communication problems and busy work.
On another recent project, I needed a table to correlate ISO 639 country codes, country names as stored in a whois DB, and telephone dialing prefixes. The providers of whois data helpfully offered a pipe-delimited text file on their web site that showed how their country names mapped to ISO 639, and a little googling gave me an HTML table that mapped those codes to dialing prefixes.
I knitted these two data sources together and build a .h that declared an array of structs to do my mapping. Easy. But I don’t own the data. Because it is “foreign” in the new code home I’ve built for it, I have some lingering problems. For one, what do I do about changes? If Syria fragments or Crimea is no longer a part of Ukraine, I will have bugs in my table, and I will have to hand-edit to fix them once I diagnose the problem. That might never happen, since the owner of the whois data is unlikely to email me about it. Likewise, if phone companies decide that Antigua and Barbuda needs a new dialing prefix, how will I find out? Nobody is guaranteeing that the country-code-to-dialing-prefix table I found on the internet is up-to-date (or complete, or even accurate)–except me.
What would be better?
The world already has very mature ways to deal with tabular data. They’re called spreadsheets and databases. Imagine the master version of some of the types of data I’ve mentioned, and I suspect you’ll be imagining one or the other of these tools as part of the context. Don’t you think the definitive master lists of mappings between cities and postal codes live in postal service databases somewhere? Or that the guaranteed-accurate-and-up-to-date enumeration of stock ticker symbols lives in a spreadsheet at the NYSE or the FTSE?
What programming languages ought to do is allow coders to import data from their definitive sources–or at least from a small handful of exchange formats like CSV and XML–with no intermediate hand coding. In other words, I want what I’ll call direct compilation of data from native formats. If I create a currency-exchange app that needs a currency conversion table, what I want is to write code like this:
table currency_info: columns: id(enum), symbol(str), name(str), exchange_rate(float) rows: Attach("latest_currency_info.csv")
If a compiler supported such code, it might read the attached .csv file, parse it using CSV rules, and create an array of structs where each struct instance is a tuple or row of data. The array would be indexed by ID, a value that the compiler would generate in the same way enum values are assigned. The end result would be an a static constant array, exactly as if I had hand-coded a manual translation of the data. Essentially, this is the technique I recommended when I wrote about how to avoid breaking encapsulation with enums.
Think about the advantages for a minute. Christine Lagarde isn’t going to call me up or help me write code if the IMF decides to make loans in Bitcoin (to pick a ridiculous example)–but I can write a cron job that downloads data about accepted currencies worldwide, as published on xe.com. Suddenly my code is up-to-date. I never have to do reformatting work, and I don’t have to worry about code getting out-of-sync with reality.
This isn’t rocket science, but it’s remarkably powerful. You no longer need to use programming language syntax to describe data–you can use a familiar, standard data representation language. That means non-coders can give it to you directly. Data sources turn into code with minimal effort.
At work, I maintain code that helps categorize content on the web. The set of possible categories is in a coded table in both C++ and java, but it is not chosen by engineers–product managers and executives debate about what’s most useful to customers, and they periodically change their minds. If I had compilers that supported the behavior I’m advocating, I could tell my product management to email me a .xlsx whenever they make a change, and my reaction time would be minutes. And I could be certain that the C++ and java versions of the table were identical, since they used the same input data.
I can think of enhancements that would make such a mechanism even more powerful:
- Let tables be re-sorted during import.
- Let tables be indexed by multiple fields.
- Support joins, either during import or by making tables connectable at run-time. (Remember my problem where I had to connect whois country names and telephone dialing prefixes, with an ISO 639 country code as a common column?)
- Allow columns that are populated by formula (evaluated at compile-time) instead of by an actual data value. Besides generating new or composite data, this would give us a way to normalize reliably.
- Support assertions about imported data, to guarantee integrity.
- In addition to supporting a rich set of input formats, allow a coder to write custom importers.
- Let coders reorder or suppress columns during import.
Making it real
As you might have guessed, direct compilation of data from native formats is a feature of the
intent programming language I’m working on. But I think this technique might be implementable in some other programming languages, even without changes to a spec.
In java, you might be able to implement a custom class loader that generates bytecode for a table when given a .csv as the URL it should load from.
In python and perl, you could probably implement a class that generates dictionaries from a statement that looks quite similar to an
In C++, you could use a custom build step and an external app to generate a table from a csv. A bit klunky, but usable.
You might even be able to write a SWIG module that would do this in a whole bunch of different languages, all in one go.
If any of you have great ideas (or implementations), please share them.