Database normalization
If you are new to databases or have years of experience, it is key to grasp normalization basics to build well-organized and smooth-running databases. In this post we will focus only on that.
Introduction
Normalization is a method of organizing data to reduce redundancy, with the goal of creating a more efficient, reliable, and simple database. The main goals of the normalization is to organize data, remove data redundancy, ensure data integrity, ensure data consistency and optimize the database for storage and retrieval of the data.
There are six normal forms:
- First normal form (1NF)
- Second normal form (2NF)
- Third Normal Form (3NF)
- Boyce-Codd normal form (BCNF)
- Fourth normal form (4NF)
- Fifth normal form (5NF)
In most of the cases we are ok with the database being in the 3rd normal form. Higher level of the database normalization can lead to the slower query performance in databases where data retrieval is very important. That is one of the reasons why analytical databases are highly denormalized. Going beyond 3th normal form can also increase complexity of the database which can be bad for the onboarding of the new members and juniors as well as a bigger challenge for the current developers and database administrators.
In this post and related examples I will focus only on the first three normal forms. In future I will cover some use cases where other normal forms could be used.
1NF requires that each value in a column is atomic and that all rows have the same number of fields. This means that we should not store for example city, country and address in a single column because that complicates tasks like sorting records by city. If we want to store mentioned data in a single column then a relational database like MSSQL is not a suitable option and we can use a NOSQL columnar database like Casandra.
2NF requires 1NF to be valid, all rows are identified separately and completely dependent on the primary key and there are no repeating columns. Example for that would be a table which contains course id and name columns and instructor id and name columns. In this case the instructorName column would not depend on courseId which would violate 2NF and we would need to split the table into Course table, Instructor table and junction table if the instructor can teach multiple courses.
3NF requires 2NF to be satisfied and there are no transitive dependencies which means that no non-key attribute is dependent on another non-key attribute. This means that all non-required key columns must be independent of each other. Example of this would be if we add to table from 2NF example additional column Department name and department description which is recurring and which needs to be moved to separate table.
You can read more about Normalization on this links:
Example customers
Unnormalized Data
Customer ID | First Name | Last Name | Phone Number | Address | City | |
1 | John | Doe | johndoe@gmail.com | 555-1234 | 123 Main St | Anytown |
2 | Jane | Smith | janesmith@gmail.com | 555-5678 | 456 Elm St | Anytown |
3 | Bob | Johnson | bobjohnson@gmail.com | 555-8901 | 789 Oak St | Othertown |
4 | Sue | Lee | suelee@gmail.com | 555-2345 | 321 Maple St | Othertown |
In this table:
- The City column has repeated values like “Anytown” and “Othertown”
- The Address column is tied directly to each customer leading to some redundancy if multiple customers live in the same city or even at the same address
First Normal Form (1NF)
To achieve 1NF, we need to ensure that each column contains atomic (indivisible) values, and each record is unique. The table is already in 1NF because each field contains a single value per record and there are no repeating columns.
Second Normal Form (2NF)
Since the table is 1NF, there are no partial dependencies and all non-key attributes depend on the primary key, the table is already in 2NF.
Third Normal Form (3NF)
To achieve 3NF we will create two new tables:
- Addresses table where we are going to move addresses and reference the city they belong to
- Cities table where we will move city names
- (Note) in this example customer belongs to only single address. We are avoiding junction table in this example and it will be covered in next examples. In this example this case would be covered by another table called “junction table” which would contain CustomerID and AddressID giving us many to many relationship between those tables
Normalized tables
Customers Table:
Customer ID | First Name | Last Name | Phone Number | Address ID | |
1 | John | Doe | johndoe@gmail.com | 555-1234 | 1 |
2 | Jane | Smith | janesmith@gmail.com | 555-5678 | 2 |
3 | Bob | Johnson | bobjohnson@gmail.com | 555-8901 | 3 |
4 | Sue | Lee | suelee@gmail.com | 555-2345 | 4 |
Addresses Table:
Address ID | Address | City ID |
1 | 123 Main St | 1 |
2 | 456 Elm St | 1 |
3 | 789 Oak St | 2 |
4 | 321 Maple St | 2 |
Cities Table:
City ID | City Name |
1 | Anytown |
2 | Othertown |
This structure eliminates redundancy by storing city names only once in the cities table and addresses only once in the addresses table. Each customer now references their address using the Address ID.
Summary
- Customers table contains customer-specific data, including the first name, last name, email, phone number, and a reference to the address
- Addresses table stores address details and references the city using City ID
- Cities table stores city names
Example Employee timecards
Unnormalized Data
Employee ID | Employee Name | Pay Rate | Week Ending | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday | Total Hours |
101 | John Smith | $15/hr | 26/2/2022 | 8 | 8 | 8 | 8 | 0 | 0 | 40 |
102 | Jane Doe | $20/hr | 26/2/2022 | 8 | 8 | 8 | 8 | 0 | 0 | 40 |
103 | John Smith | $15/hr | 5/3/2022 | 8 | 8 | 8 | 0 | 0 | 0 | 32 |
In this table:
- Employee details (like name and pay rate) are repeated for each timecard
- Days of the week are columns rather than rows, which leads to a less flexible design
First Normal Form (1NF)
To achieve 1NF, we ensure that each column contains atomic values and there are no repeating groups of data. In this case days of week are not considered repeating groups because they each represent a distinct day of the week. Just because by definition it is not repeating groups it does not mean it is good. Example of repeating columns would be if we had for example Phone1, Phone2, Phone3 columns. The table is in 1NF.
Second Normal Form (2NF)
The table is 1NF but we have partial dependencies so it does not satisfy 2NF.
Third Normal Form (3NF)
Even if the table were to be brought to 2NF, the total hours column would still be problematic because it can be derived from the sum of the daily hours (Tuesday, Wednesday, etc.). This creates a transitive dependency where total hours depends on the daily hours which depends on the primary key.
Normalized tables
Employees Table:
Employee ID | Employee Name | Pay Rate |
101 | John Smith | $15/hr |
102 | Jane Doe | $20/hr |
103 | Joe Brown | $18/hr |
Timecards Table:
Timecard ID | Employee ID | Week Ending |
1 | 101 | 26/2/2022 |
2 | 102 | 26/2/2022 |
3 | 103 | 26/2/2022 |
4 | 101 | 5/3/2022 |
Days of Week Table:
Day of Week ID | Name |
1 | Monday |
2 | Tuesday |
3 | Wednesday |
4 | Thursday |
5 | Friday |
6 | Saturday |
7 | Sunday |
Timecard Entries Table:
Timecard ID | Day of Week ID | Hours |
1 | 1 | 8 |
1 | 2 | 8 |
1 | 3 | 8 |
1 | 4 | 8 |
1 | 5 | 8 |
2 | 1 | 8 |
2 | 2 | 8 |
2 | 3 | 8 |
2 | 4 | 8 |
2 | 5 | 8 |
3 | 1 | 8 |
3 | 2 | 8 |
3 | 3 | 8 |
3 | 4 | 8 |
4 | 2 | 8 |
4 | 3 | 8 |
4 | 4 | 8 |
4 | 5 | 8 |
Summary
To bring this table to 2NF and 3NF:
- 2NF removes partial dependencies by splitting the table into separate tables such as an Employees table for Employee Name and Pay Rate and a Timecards table for Week Ending and the daily hours
- 3NF removes transitive dependencies by eliminating the Total Hours column relying instead on calculating totals dynamically based on the daily hours stored in the Timecards table
- Employees table contains employee-specific data including the name and pay rate
- Timecards table contains timecard details linking each timecard to an employee and a specific week
- Days of week table stores day names and their corresponding IDs
- Timecard entries table records the number of hours worked for each day of the week for each timecard
Example countries
Unnormalized Data
Country Name | Capital City | Population | Continent | Area (sq km) | Language |
USA | Washington D.C. | 331,449,281 | North America | 9,147,593 | English |
Canada | Ottawa | 38,048,738 | North America | 9,984,670 | English, French |
China | Beijing | 1,402,193,353 | Asia | 9,596,961 | Mandarin |
Australia | Canberra | 25,762,395 | Oceania | 7,692,024 | English |
Brazil | Brasília | 213,445,417 | South America | 8,515,767 | Portuguese |
Mexico | Mexico City | 130,262,216 | North America | 1,964,375 | Spanish |
Russia | Moscow | 145,912,025 | Europe | 17,098,242 | Russian |
First Normal Form (1NF)
In this example in Language column we have multiple values and because of that table is not in 1NF
We can transform it to 1NF by splitting English,French into separate rows:
Country Name | Capital City | Population | Continent | Area (sq km) | Language |
USA | Washington D.C. | 331,449,281 | North America | 9,147,593 | English |
Canada | Ottawa | 38,048,738 | North America | 9,984,670 | English |
Canada | Ottawa | 38,048,738 | North America | 9,984,670 | French |
China | Beijing | 1,402,193,353 | Asia | 9,596,961 | Mandarin |
Australia | Canberra | 25,762,395 | Oceania | 7,692,024 | English |
Brazil | Brasília | 213,445,417 | South America | 8,515,767 | Portuguese |
Mexico | Mexico City | 130,262,216 | North America | 1,964,375 | Spanish |
Russia | Moscow | 145,912,025 | Europe | 17,098,242 | Russian |
Second Normal Form (2NF)
To move table to 2NF we need to create primary key on table and move the language column into separate table:
Countries Table:
Country ID | Country Name | Capital City | Population | Continent | Area (sq km) |
1 | USA | Washington D.C. | 331,449,281 | North America | 9,147,593 |
2 | Canada | Ottawa | 38,048,738 | North America | 9,984,670 |
3 | China | Beijing | 1,402,193,353 | Asia | 9,596,961 |
4 | Australia | Canberra | 25,762,395 | Oceania | 7,692,024 |
5 | Brazil | Brasília | 213,445,417 | South America | 8,515,767 |
6 | Mexico | Mexico City | 130,262,216 | North America | 1,964,375 |
7 | Russia | Moscow | 145,912,025 | Europe | 17,098,242 |
Country_Languages Table:
Country ID | Language |
1 | English |
2 | English |
2 | French |
3 | Mandarin |
4 | English |
5 | Portuguese |
6 | Spanish |
7 | Russian |
Third Normal Form (3NF)
We have now table in 2NF. To improve it even further we are going to pull continents to new table and do some changes on current tables:
Countries Table:
Country ID | Country Name | Capital City | Population | Area (sq km) | Continent ID |
1 | USA | Washington D.C. | 331,449,281 | 9,147,593 | 1 |
2 | Canada | Ottawa | 38,048,738 | 9,984,670 | 1 |
3 | China | Beijing | 1,402,193,353 | 9,596,961 | 2 |
4 | Australia | Canberra | 25,762,395 | 7,692,024 | 3 |
5 | Brazil | Brasília | 213,445,417 | 8,515,767 | 4 |
6 | Mexico | Mexico City | 130,262,216 | 1,964,375 | 1 |
7 | Russia | Moscow | 145,912,025 | 17,098,242 | 5 |
Continents Table:
Continent ID | Continent Name |
1 | North America |
2 | Asia |
3 | Oceania |
4 | South America |
5 | Europe |
Languages Table:
Language ID | Language Name |
1 | English |
2 | French |
3 | Mandarin |
4 | Portuguese |
5 | Spanish |
6 | Russian |
Country_Languages Table:
Country ID | Language ID |
1 | 1 |
2 | 1 |
2 | 2 |
3 | 3 |
4 | 1 |
5 | 4 |
6 | 5 |
7 | 6 |
Summary
By applying normalization, we’ve split the original table into multiple related tables:
- Countries table contains country-specific data and references to the continent
- Continents table stores continent names
- Languages table stores language names
- Country_Languages table acts as a junction table to establish a many-to-many relationship between countries and languages
Example songs
Unnormalized Data
Artist Name | Genre | Album Title | Release Date | Song Title | Duration | Playlist Title | Description |
Taylor Swift | Pop | Fearless | 11/11/2008 | Love Story | 3:56 | Party Mix | Upbeat dance songs |
Taylor Swift | Pop | Fearless | 11/11/2008 | You Belong With Me | 3:51 | Party Mix | Upbeat dance songs |
Taylor Swift | Pop | Red | 22/10/2012 | Shake It Off | 3:39 | Party Mix | Upbeat dance songs |
Bruno Mars | R&B | 24K Magic | 18/11/2016 | 24K Magic | 3:47 | Party Mix, Workout | Upbeat dance songs |
Coldplay | Alternative | A Head Full of Dreams | 4/12/2015 | Viva la Vida | 4:02 | Party Mix, Workout | Upbeat dance songs |
Taylor Swift | Pop | Fearless | 11/11/2008 | Love Story | 3:56 | Acoustic | Calming acoustic songs |
Coldplay | Alternative | A Head Full of Dreams | 4/12/2015 | Viva la Vida | 4:02 | Acoustic | Calming acoustic songs |
Taylor Swift | Pop | Red | 22/10/2012 | Love Story | 3:56 | Workout | High energy workout songs |
Taylor Swift | Pop | Red | 22/10/2012 | Shake It Off | 3:39 | Workout | High energy workout songs |
In this table:
- There is issue with atomicity in Playlist Title column where we have multiple values
- There are repeated values for Artist Name, Genre, Album Title, Release Date, and Playlist Title
- There is a lot of redundancy that can be removed by normalization
First Normal Form (1NF)
The current table has an issue with atomicity. The Playlist Title column has multiple values (Party Mix, Workout). This violates the requirement that each cell should have a single value.
Normalization to 1NF
We need to split the rows where Playlist Title contains multiple values into separate rows.
Artist Name | Genre | Album Title | Release Date | Song Title | Duration | Playlist Title | Description |
Taylor Swift | Pop | Fearless | 11/11/2008 | Love Story | 3:56 | Party Mix | Upbeat dance songs |
Taylor Swift | Pop | Fearless | 11/11/2008 | You Belong With Me | 3:51 | Party Mix | Upbeat dance songs |
Taylor Swift | Pop | Red | 22/10/2012 | Shake It Off | 3:39 | Party Mix | Upbeat dance songs |
Bruno Mars | R&B | 24K Magic | 18/11/2016 | 24K Magic | 3:47 | Party Mix | Upbeat dance songs |
Bruno Mars | R&B | 24K Magic | 18/11/2016 | 24K Magic | 3:47 | Workout | Upbeat dance songs |
Coldplay | Alternative | A Head Full of Dreams | 4/12/2015 | Viva la Vida | 4:02 | Party Mix | Upbeat dance songs |
Coldplay | Alternative | A Head Full of Dreams | 4/12/2015 | Viva la Vida | 4:02 | Workout | Upbeat dance songs |
Taylor Swift | Pop | Fearless | 11/11/2008 | Love Story | 3:56 | Acoustic | Calming acoustic songs |
Coldplay | Alternative | A Head Full of Dreams | 4/12/2015 | Viva la Vida | 4:02 | Acoustic | Calming acoustic songs |
Taylor Swift | Pop | Red | 22/10/2012 | Love Story | 3:56 | Workout | High energy workout songs |
Taylor Swift | Pop | Red | 22/10/2012 | Shake It Off | 3:39 | Workout | High energy workout songs |
Second Normal Form (2NF)
Genre, Release Date, and Duration are dependent only on part of the primary key (Artist Name, Album Title, Song Title), but not on Playlist Title. This violates 2NF.
Normalization to 2NF
We create separate tables to remove partial dependencies.
Artists Table:
Artist ID | Artist Name | Genre |
101 | Taylor Swift | Pop |
102 | Bruno Mars | R&B |
103 | Coldplay | Alternative |
Albums Table:
Album ID | Album Title | Release Date | Artist ID |
1 | Fearless | 11/11/2008 | 101 |
2 | Red | 22/10/2012 | 101 |
3 | 24K Magic | 18/11/2016 | 102 |
4 | A Head Full of Dreams | 4/12/2015 | 103 |
Songs Table:
Song ID | Song Title | Artist ID | Album ID | Duration |
1 | Love Story | 101 | 1 | 3:56 |
2 | You Belong With Me | 101 | 1 | 3:51 |
3 | Shake It Off | 101 | 2 | 3:39 |
4 | 24K Magic | 102 | 3 | 3:47 |
5 | Viva la Vida | 103 | 4 | 4:02 |
Playlists Table:
Playlist ID | Playlist Title | Description |
1 | Party Mix | Upbeat dance songs |
2 | Acoustic | Calming acoustic songs |
3 | Workout | High energy workout songs |
Playlist-Song Table:
Playlist ID | Song ID |
1 | 1 |
1 | 2 |
1 | 3 |
1 | 4 |
1 | 5 |
2 | 1 |
2 | 5 |
3 | 3 |
3 | 4 |
Third Normal Form (3NF)
The tables are already in 3NF with the above changes.
Summary
The provided tables (Artists, Albums, Songs, Playlists, Playlist-Song Association) are in Third Normal Form (3NF). This ensures that:
- Each table has a primary key
- Non-key attributes are fully functionally dependent on the entire primary key
- There are no transitive dependencies