The Considerate Data Modeler


Database modeling is closer to library science than computer science. No one is impressed by a librarian who gets creative and files cookbooks under “G” for “Gourmand.” The best catalog system is where everything is in an obvious place, where everything conforms to expectations.

As Alec Baldwin’s database consultant explained to a team of application developers he was brought in for the day to train in Glengarry Glen Ross, it comes down to three simple letters, LCD:

Always target the lowest common denominator. ANSI SQL isn’t the LCD; you want to be targeting the subset of the subset of ANSI SQL which is truly portable.

Why? Because your data is eventually going to end up in another RDBMS, pumped over an ETL pipeline to a data warehouse of some kind by a tool like dbt or ADF. Any unique, advanced feature of a database that you use will cause headaches and make you enemies.

I know this firsthand because at one time or another I’ve played each role. I’ve been the application developer using PostGIS spatial types in my application and explaining to the data team about spherical coordinate systems, polygons, and R-trees and why we absolutely needed them and couldn’t just use:

    latitude NUMERIC(8,2),
    longitude NUMERIC(8,2)

even if that’s their coding standard for their data warehouse. I’ve been the data scientist parachuted into a hostile operational database and given three days to write a query to flatten it down into usable training data and wincing over every JSON extraction. And I’ve been the guy advocating building a single data warehouse from three different databases built by different people at different companies before they were acquired and merged, dealing with the impedance mismatch that resulted. (If I never hear the phrase “conformed dimension” again in my life it will be too soon.) So I can tell you with great confidence and breadth of experience:

Nobody likes a smart ass. Nobody ever says, “wow, you leveraged the shit out of that unique, proprietary feature from your vendor of choice! You must really be on the cutting edge! It was a great business decision to lock us in like that, and we’re really enjoying learning the ins-and-outs of every quirk of your very special database! And the ETL team, oh, they’re just having great fun writing custom mapping logic! They were getting so bored using the default copy activity all the time!” No, they just grit their teeth, smile, and ask how it can be mapped back to third normal form.

Case Study: Enum

MySQL and Postgres have an enum type; Oracle, Databricks Delta, and Snowflake do not.

The poor DBA will probably grit their teeth and model your enum as a CHECK constraint on a string, which is wildly inefficient.

Think also of the poor analysts and data scientists who have to use your database. SQL is their mother tongue and they think nothing of LEFT JOINing to a few reference tables to pick up human-readable columns… at the end of their query. In the meat of the logic, all the CTAS statements and subqueries, they want to use fast, exact joins on primary keys.

Here is some homework for you:

1) I want to deprecate a code in an enum, but it is used in historical data and still meaningful there. If it was a ref table we would add a soft delete flag; how do I do the same thing with enum?

2) I need to associate additional information to each code in my enum. For example, I have a State enum, but I also need to store the two-letter state abbreviations, and distinguish between states and territories. How would I do that with an enum?

3) Write a query that returns one row for every state in the State enum and counts the number of orders. To be clear, it should return a row even for states where the count is zero, like ('Wisconsin', 0). Compare your answer to this query:

SELECT state.name, COUNT(orders.id) as n_orders
FROM state
LEFT JOIN orders ON state.id = orders.state_id
GROUP BY state.name

Which one an analyst would prefer? Note also exactly what happened: in order to make it work, it flattened the enum back to the relational view on the fly.

Keep it Boring

If the answer to literally every question about how to do some slightly advanced thing is “use a lookup table,” why not start with that in the first place?

You might argue, well, we could start there, but then migrate when these super advanced cases actually come up in practice. YAGNI! KISS!

The problem there is that the refactoring isn’t transparent at all. Relational databases have an “interface,” just like an API - that’s why we can swap tables for views, for example. But the surface area of that interface is quite large: every table name, column name, and type.

Going from an enum to a lookup table is a backward-incompatible change in the interface to the database. Every single SQL query that ever touched that enum will need to be rewritten. That means the custom maps in ETL pipelines, ORM models in application code, every snippet of saved SQL the analysts keep inside of DBeaver to quickly answer ad hoc questions from executives.

Here is something absolutely key for application developers to understand: the database is not for just you. It is not your private place to persist a few objects so your application works. It is an integral part of your interface to the rest of the organization, the back office in particular. If your application is successful, executives will want dashboards with metrics, analysts will want to do reporting, and data scientists will want to extract training data.

If it’s painless to migrate, it’s only because no one was using your database in the first place. Your database was already a failure.

But let’s say you push through the change management nightmare, and get the State enum refactored to the state column, or whatever. Now you have a data model where some columns are enums, some use the lookup table convention, and everyone has to check every single time they use any column.

Conclusion

What if there was a better way? What if, and hear me out on that, we just… didn’t do that? Any of that? Just built, plain ol’ boring data models that could be trivially instantiated on any RDBMS using conventions that everyone in the industry is familiar with? It would be like coming back to your favorite little bookstore, where everything is where it should be and you can always find what you’re looking for, where all the customers are happy, the staff is unstressed, and even the cat seems to be in exactly the right spot.

ABC: Always Be a Considerate data modeler. Design boring databases.

The lovely cat photo is by Diane Picchiottino on Unsplash.