Once in a while, you might need to design a one-to-one relationship in SQL. This is a common pattern in database design, but it's not always straightforward to implement. In this article, I'll explain how to design one-to-one relationships in SQL and discuss some common pitfalls to avoid.
Let's define a clear use case for a one-to-one relationship.
Suppose you have user_account
table.
Column Name | Data Type | Description |
---|---|---|
id | integer | Primary key |
name | text | User's name |
A business requirement emerged that we need to sync the user's account with an external system (e.g. HubSpot). HubSpot assigns a unique identifier to each user, and we need to store this identifier in our database. We will call this identifier hubspot_object_foreign_id
.
The first decision we need to make is how to store this identifier in our database. There are two common ways to do this:
The first option is to add a column to the existing user_account
table.
Column Name | Data Type | Description |
---|---|---|
id | integer | Primary key |
name | text | User's name |
hubspot_object_foreign_id | text | HubSpot's unique identifier |
hubspot_synced_at | timestamp with time zone | Last time the user was synced with HubSpot |
This approach is simple and straightforward. If you only need to store a few additional columns, this is the way to go.
If you choose this option, you should add a unique constraint to the hubspot_object_foreign_id
column to ensure that each user has a unique identifier.
The main disadvantage of this approach is that it can lead to a bloated table. If you need to store a lot of additional information about the relationship between the user account and the HubSpot account, it might be better to create a separate table.
Adding additional columns to the existing table can have performance implications. If the user_account
table is large and you frequently query it, adding additional columns can slow down your queries. This is because the additional columns will increase the size of each row, which can lead to more disk I/O and memory usage.
How many is too many? It depends on the type of data you are storing (see PostgreSQL limits). If I had to give a rough estimate, I would say that if you foresee storing more than 100 columns, you should steer clear of this approach.
The second option is to create a separate table (user_account_hubspot_object
) to store the HubSpot identifier.
Column Name | Data Type | Description |
---|---|---|
id | integer | Primary key |
user_account_id | integer | Foreign key to user_account |
hubspot_object_foreign_id | text | HubSpot's unique identifier |
hubspot_synced_at | timestamp with time zone | Last time the user was synced with HubSpot |
synced_phone_number | text | User's phone number (data from HubSpot) |
The advantage of this approach is that it allows you to store additional information about the relationship between the user account and the HubSpot account. For example, in case of a two-way sync, you might need to store data that was added to HubSpot profile (synced_phone_number
).
The disadvantages of this approach:
user_account
and user_account_hubspot_object
tables.Option 2 is enticing because it keeps the user_account
table clean and allows you to store additional information about the relationship between the user account and the HubSpot account. However, it introduces a new problem: how do you enforce the one-to-one relationship between the user_account
and user_account_hubspot_object
tables?
This is where things get tricky. In SQL, there is no built-in way to enforce a one-to-one relationship between two tables.
You have a few options to enforce the one-to-one relationship (neither of them is perfect):
You can add a unique index to the user_account_id
column in the user_account_hubspot_object
table. This will ensure that each user account has at most one corresponding HubSpot object.
CREATE UNIQUE INDEX user_account_hubspot_object_user_account_id_idx
ON user_account_hubspot_object (user_account_id);
The disadvantages of this approach is that there is no way to tell just by looking at the query referencing the user_account_hubspot_object
table that it is a one-to-one relationship. You have to rely on the fact that the unique index is there. This can lead to the confusion and bugs down the line. Example: suppose that somewhere in our application we have a query that joins user_account
and user_account_hubspot_object
tables.
SELECT
ua1.id,
ua1.name,
uaho1.hubspot_object_foreign_id
FROM user_account ua1
INNER JOIN user_account_hubspot_object uaho1
ON uaho1.user_account_id = ua1.id
/* ... */
Now suppose that a new business requirement emerges that if a HubSpot object is deleted, we want to track that. The engineer assigned to implement this requirement decides that the best way to implement this is to add a deleted_at
column to the user_account_hubspot_object
table. However, we also need to handle a scenario where a new HubSpot object is created and associated with the same user account. The developer who is tasked to implement this requirement might not be aware (have access to) every code path that references the user_account_hubspot_object
table. As a result, they might think it is safe to drop the unique index and add a deleted_at
column to the user_account_hubspot_object
table. This will break the one-to-one relationship between the user_account
and user_account_hubspot_object
tables. Every query that references this table will now produce incorrect results.
This wouldn't be such a huge problem if PostgreSQL natively supported assertions about the query output. However, it doesn't, so we have to be extra careful when picking this approach.
You can enforce the one-to-one relationship by using the foreign key as the primary key in the user_account_hubspot_object
table.
Our user_account_hubspot_object
table would look like this:
Column Name | Data Type | Description |
---|---|---|
user_account_id | integer | Primary key and foreign key to user_account |
hubspot_object_foreign_id | text | HubSpot's unique identifier |
hubspot_synced_at | timestamp with time zone | Last time the user was synced with HubSpot |
This approach enforces the one-to-one relationship between the user_account
and user_account_hubspot_object
tables. The table design makes it clear that user_account_id
is the primary key and therefore it is highly unlikely that this table would ever be refactored to support a one-to-many relationship.
However, just like the 1st option, this approach has the disadvantage that there is no way to tell just by looking at the query referencing the user_account_hubspot_object
table that it is a one-to-one relationship.
You can somewhat mitigate the risk of the latter scenario by using a naming convention that indicates that the relationship is one-to-one. For example, you can prefix the table name with oto_
(one-to-one).
It is a lot easier to spot the one-to-one relationship in the query if the table name is oto_user_account_hubspot_object
instead of user_account_hubspot_object
.
This approach also makes it clear to whoever reads the query that the relationship between the user_account
and user_account_hubspot_object
tables is one-to-one. Example:
SELECT
ua1.id,
ua1.name,
uaho1.hubspot_object_foreign_id
FROM user_account ua1
INNER JOIN oto_user_account_hubspot_object uaho1
ON uaho1.user_account_id = ua1.id
/* ... */
If you have a naming convention that indicates that the relationship is one-to-one, it will be easier for developers to understand the relationship between the tables. However, this approach is not foolproof. Developers can still drop the unique index without understanding the implications.
You might also consider a convention that glues the two tables together using a custom string, e.g. table1_oto_table2
. The benefit of this approach is that it also makes it clear which two tables are in the relationship. This is especially true with long table names, e.g. oto_user_account_stripe_merchant_customer_organization_member_invite
VS user_account_stripe_merchant_customer_oto_organization_member_invite
You can enforce the one-to-one relationship by mandating that the table that owns the relationship (in our case, user_account
) has a column that references the associated row in the other table (user_account_hubspot_object
), e.g.
user_account
table:
Column Name | Data Type | Description |
---|---|---|
id | integer | Primary key |
name | text | User's name |
user_account_hubspot_object_id | integer | Foreign key to user_account_hubspot_object |
This is my preferred approach because it allows user_account
itself to document the relationship with user_account_hubspot_object
. It also makes it clear to whoever reads the query that the relationship between the user_account
and user_account_hubspot_object
tables is one-to-one. Example:
SELECT
ua1.id,
ua1.name,
uaho1.hubspot_object_foreign_id
FROM user_account ua1
INNER JOIN user_account_hubspot_object uaho1
ON uaho1.id = ua1.user_account_hubspot_object_id
/* ... */
The disadvantage of this approach is that there is nothing stopping someone from unknowingly constructing a query that joins user_account
and user_account_hubspot_object
tables on a column other than user_account_hubspot_object_id
. This can lead to bugs and incorrect results.
Here is a comparison of the options discussed above:
Criteria | Inline (A) | Separate Table (B) | |||
---|---|---|---|---|---|
1 | 2 | 3 | 4 | ||
Relies on a naming convention to indicate one-to-one relationship, e.g. oto_ | No | No | No | Yes | Yes |
Additional columns are stored in the same table | Yes | No | No | No | Yes |
Has performance implications as the number of columns increases | Yes / ๐บ | No / ๐ผ | No / ๐ผ | No / ๐ผ | No / ๐ผ |
One-to-one relationship is visible in queries that references tables, e.g. oto_ or user_account_hubspot_object_id | Yes / ๐ผ | No / ๐บ | No / ๐บ | Yes / ๐ผ | Yes / ๐ผ |
Easy to mistakenly construct a query that joins tables on a column other than the one that enforces the one-to-one relationship | No / ๐ผ | Yes / ๐บ | No / ๐ผ | Yes / ๐บ | Yes / ๐บ |
Easy to mistakenly convert the one-to-one relationship to one-to-many | No / ๐ผ | Yes / ๐บ | No / ๐ผ | No / ๐ผ | Yes / ๐บ |
๐ผ - Desirable ๐บ - Undesirable
Historically, I've used a separate table and identified the associated row in the table that owns the relationship (the B4 option) because it makes the relationship explicit and easy to understand. However, as demonstrated, it is not foolproof. Someone can still mistakenly construct a query that joins tables on a column other than the one that enforces the one-to-one relationship.
The more I think about this problem, the more I am inclined to pick any option that allows to identify the one-to-one relationship by reading the queries that reference the tables (options A and B3).
-- Option A โย can identify the one-to-one relationship by reading the query
SELECT
ua1.id,
ua1.name,
ua1.hubspot_object_foreign_id
FROM user_account ua1
-- Option B3 โย can identify the one-to-one relationship based on the "oto_" prefix
SELECT
ua1.id,
ua1.name,
uaho1.hubspot_object_foreign_id
FROM user_account ua1
INNER JOIN oto_user_account_hubspot_object uaho1
ON uaho1.user_account_id = ua1.id
-- Option B4 โย can also identify the one-to-one relationship by reading the query
-- However, the problem with this approach that someone could still by mistake join on a column other than user_account_hubspot_object_id
SELECT
ua1.id,
ua1.name,
uaho1.hubspot_object_foreign_id
FROM user_account ua1
INNER JOIN user_account_hubspot_object uaho1
ON uaho1.id = ua1.user_account_hubspot_object_id
Options A and B4 are therefore the most foolproof way to ensure that the one-to-one relationship can be easily identified and understood by anyone who reads the query, and it is the most likely to prevent someone from mistakenly converting the one-to-one relationship to one-to-many.
What's interesting is that the need to have a convention that expresses the one-to-one relationship in queries was not a problem in most of the projects I've worked on. This only became a problem when I realized during code reviews that we are making assumptions about the relationships between the tables based on our current business logic and not based on the database schema. Example:
SELECT
o1.id,
ot1.name
FROM organization o1
INNER JOIN organization_team ot1
ON ot1.organization_id = o1.id
It is a fair assumption that the organization_team
table is a one-to-one relationship if you are certain that the program only allows one team per organization. Perhaps today, there is no way to create multiple teams per organization in the application. Perhaps, there is even a UNIQUE INDEX
on organization_team.organization_id
that prevents it. Therefore, the assumption is true today. However, what happens when the business requirements change and the program is refactored to allow multiple teams per organization? This can happen with or without database schema changes. The query above would still work, but it would produce incorrect results.
It is impossible to predict the future and every possible way the program can be refactored. However, we can strive to adopt conventions that help us to communicate the original intent and prevent accidental misuse.
In this case, a convention such as outlined in B3 would have either avoided the above query to be written as an INNER JOIN
or would have made it clear that the relationship between the organization
and organization_team
tables is one-to-one.