Once in a while, I encounter a query that looks something like this:
SELECT DISTINCT
ua1.id,
ha1.hubspot_account_foreign_id,
/* .. */
FROM user_account ua1
LEFT JOIN hubspot_account ha1 ON ha1.user_account_id = ua1.id
/* .. */
The original author of this query probably intended to write a query that returns a list of user accounts with their associated HubSpot account. However, we can infer that the query assumes that there is a one-to-one relationship between user accounts and HubSpot accounts. However, even if that assumption is true at the time of writing, it might not be true in the future.
Let's say that the hubspot_account
table records one-to-one relationships between user accounts and HubSpot accounts. This might be enforced by a unique constraint on the hubspot_account.user_account_id
column. In this case, the query above will return one row for each user account, as expected. However, that constraint might be removed in the future, and the query will start returning multiple user account records with a unique HubSpot account for each user account.
id | hubspot_account_foreign_id |
---|---|
1 | 75cce832 |
1 | 6e779469 |
This would lead to subtle bugs that are hard to catch.
DISTINCT ON
would be even more dangerous in this case, as it would return a random HubSpot account for each user account. But since it would return only one row per user account, it might not be immediately obvious that there is a problem.
Because of reasons discussed in this article, the above pattern is not recommended. Instead, you should add a column to the user_account
table that stores the foreign key to the hubspot_account
table. This way, you can enforce the one-to-one relationship at the database level.
So we have a problem: the query above makes an implicit assumption about the data that might not hold in the future. How can we make this assumption explicit?
Well, there really isn't a good way to do this in SQL. PostgreSQL (and as far as I am aware, any dialect of SQL) does not have a built-in mechanism for throwing errors directly based on query results. You could achieve something similar by writing a PL/pgSQL
function that checks the query results and throws an error if the assumption is violated. However, this is not a very elegant solution. What we need is a way to make these assumptions explicit in the query itself.
I propose a new feature for PostgreSQL: native assertions. This feature would allow you to add assertions to your queries that are checked at runtime. Here is how you could use this feature in the query above:
SELECT
ua1.id,
ha1.hubspot_account_foreign_id,
/* .. */
FROM user_account ua1
LEFT JOIN hubspot_account ha1 ON ha1.user_account_id = ua1.id
/* ... */
ASSERT ua1.id IS UNIQUE
In this query, the ASSERT
clause checks that the ua1.id
column is unique within the returned dataset. If it is not, the query will throw an error. This way, the assumption that there is a one-to-one relationship between user accounts and HubSpot accounts is made explicit in the query itself. If this assumption is violated in the future, the query will throw an error, making the bug easier to catch.
One way to achieve something similar in PostgreSQL is to avoid using JOIN
. Instead, you could use a subquery to fetch the HubSpot account for each user account. This way, you can make the assumption explicit in the query itself:
SELECT
ua1.id,
(
SELECT ha1.hubspot_account_foreign_id
FROM hubspot_account ha1
WHERE ha1.user_account_id = ua1.id
) AS hubspot_account_foreign_id
FROM user_account ua1
In this query, if there is more than one HubSpot account for a user account, the subquery will throw an error:
ERROR: more than one row returned by a subquery used as an expression
However, this solution has serious trade-offs:
hubspot_account_foreign_id
column in other parts of the query, as it is defined in a subquery.You could try reaching for libraries like Slonik that allow you to run assertions on query results. However, these are solutions to a different problem. The goal of native assertions is to make assumptions about the data explicit in the query itself. For example, consider a query that assumes that a certain column is never NULL
. You could add an assertion to the query that checks this assumption:
SELECT
id,
name,
email
FROM user_account
ASSERT email IS NOT NULL
Furthermore, the client-side assertions will not help with subqueries, as they only work on the top-level query.
Adding assertions to SQL queries is not a substitute for proper database design. In the earlier example, the correct way to enforce the one-to-one relationship between user accounts and HubSpot accounts is to add a column to the user_account
table that stores the foreign key to the hubspot_account
table. However, adding assertions to SQL queries would be a powerful feature that would make it easier to catch bugs caused by implicit assumptions in queries.