Make Your Database Tables Smaller

December 28^th, 2022 | Kraków, Poland

Every so often I will inspect the data that I’m storing in a web application and find columns that are near enough totally unused. This is a strong signal that the feature which relied on the column is not valuable, and the column along with the feature can and should be deleted.

This is a good thing.

One of the best moves you can make is to delete your code.

We often talk about reducing the scope of a problem until it fits inside your head. I think a similar principle often applies with database tables — a page of results when running SELECT * FROM … should be narrow enough to fit on the average laptop screen without wrapping.

If the table is too wide, look for columns that share some narrow domain — narrower than whatever domain is shared by the rest of the columns in that table. Does your users table have a few columns prefixed with oidc for single sign-on authentication? Those should probably be in their own table. Joins are cheap. Normalise earlier.

Sometimes for the sake of expediency people will throw a collection of keys and values into a single cell, encoded as JSON. I have done this, and I have encouraged others to do this. This can be all the more tempting given the JSON operators provided in PostgreSQL.

In retrospect, I think this approach has been a false economy. It has most often caused more pain which could have been avoided with a relatively small portion of diligence. It may seem like persisting data in a flexible schema is smaller — perhaps because there are fewer columns in the table or because less effort is required upfront — but then querying becomes harder, analysis becomes harder, and data transformation becomes harder.

Finally, language implementation details shouldn’t leak into the database. In the past — again for the sake of expediency — I have persisted values in the database with exactly the same representation as Haskell would use when printing that value to the screen. This means I would see the skeletons of newtype wrappers in the database, e.g.,

db=# select customer_id from subscriptions;
                  customer_id
--------------------------------------------------
 CustomerKey {unCustomerKey = cus_Buc4bl7hPFJfcC}
 CustomerKey {unCustomerKey = Cus_LO80QORn3LnNPV}
 CustomerKey {unCustomerKey = Cus_GoRq02qsQbd3st}

The serialisation for this type is generated by the following Template Haskell:

derivePersistField "CustomerId"

This is enticing because the amount of code is small, but as before, the data is now harder to work with and the signal to noise ratio is worse.

Writing the persistence instances by hand to wrap and unwrap the inner values for retrieval and storage respectively might require five times the number of lines of code, but it would make the problem space smaller.

It’s a little extra work up front, but the benefits of making investments like this compound over time. And making choices in software engineering is all about good economics.