What is Normalization? | How to apply Normalization up to the 5th Normal Form?
Normalization is a very important topic in the Database. For many people, it can be a little bit confusing especially the higher (4th and 5th) normal form. Most of the resources available online only cover up to the 3NF or the BCNF. There are very few resources for the 4th and 5th normal forms. So I am writing this article to explain the entire concept in a simple and easy way.
I have tried to explain the first three normal forms using the same example so the reader can understand the concept in the same context. Since higher normal forms were a little confusing so I have used different but more relevant examples.
Here is what we will cover in this article:
- What is Normalization?
- What is data redundancy?
- What are data anomalies?
- 1st Normal Form
- 2nd Normal Form & Partial Dependency
- 3rd Normal Form & Transitive Dependency
- Boyce Codd Normal Form
- 4th Normal Form & Multi-valued Dependency
- 5th Normal Form & Join Dependency
- Wrapping up
So let’s start!!
What is Normalization?
Normalization is the process of dividing the large database tables into smaller tables to reduce data redundancy.
What is data redundancy?
Data redundancy is the unnecessary duplication of the data in our database that not only increases the size of our database but also creates the data anomalies.
To understand this concept let’s take an example. Suppose we have an Employee table in which we have stored employee_id, name, email, and phone number. We have also stored computer_id to track computer machines (laptops, tablets, or mobile phones) that are given to employees by the company for office work. We also have the attributes Project_code and Project_name which stores the detail of the project that is assigned to the employee, and the final attribute Project_manager which stores the name of the project manager.
For each project, there will be more than one employee who is assigned to this project but the project will be the same. It will have the same code, same name, and same project manager for each employee.
If we add an entry for the new employee who is assigned to a project with Project_code ‘GL2018’, the project name and project manager will also be stored along with it. This is data redundancy. This will increase the size of the database and can cause data anomalies.
What are data anomalies?
Anomalies are the problems that occur when there is too much data redundancy which indicates the database is poorly planned and unnormalized.
There are three types of data anomalies:
- Update Anomaly:
It happens when we try to update the record in our table. In the employee table, we have a project manager for each project. Now, most probably each project has more than one employee, which means more than one record in our table. For each record, we are repeating the project code, project name, and project manager. Now If we have to change the project manager, let’s say for the GLO project from Qadir Shaikh to Hamza Imran. We have to change it for every record where the project is GLO. There is a chance we might miss one or two entries leaving the database in an inconsistent state. Where in some records project manager is Qadir Shaikh and for others, it is Hamza Imran.
- Insertion Anomaly:
Let’s say we want to add a new employee who has been hired recently and currently is in training so hasn’t been assigned any project yet. We can’t add employees without the project details or we have to fill the project fields with null values.
OR let’s say we want to add a new project in our database that we have received from the client but we haven’t started working on it yet, so currently, there is no employee assigned to this project. But we can’t store the project details without any employee since employee_id (primary key) is linked with the employee.
- Deletion Anomaly
Let’s say our project NJAS is completed and we want to remove it from our table so we will remove all the entries of the table where the project is NJAS. But If we do so we will also remove the record of all the employees (Umer shaikh all others) who were working on this project
We can remove these anomalies by normalizing our Employees table. There are five rules to normalize the table. Each rule is called the normal form. There is also another normal form called Boyce Codd normal form which is also called 3.5 normal form. Let’s look at this one by one.
1st Normal Form
The first Normal form says that:
- Each field (cell) in our table should have a single atomic value.
- There should not be any repeating groups
- Each attribute(column) should contain the value of the same type.
Let’s look at our Employees table. Can you identify if there is any violation of 1st normal form?
All attributes have the same type of data, each attribute has either string or a numeric value. And almost all the attributes have a single value for each row except one attribute, the ‘Computer_id’ attribute. Few employees have been given more than one computer so to track the record all the computer ids have been saved in a single field. This is a violation of the 1st normal form.
The first normal form says that each field (cell) in our relation should have a single atomic value.
So how to solve this problem? How about we create two different columns to store two different computer ids?
But read the 1st normal form again. It also says that there should not be any repeating groups. So doing this also violates our 1st normal form. What if the company wants to give a third machine to an employee (A Macbook, an iPad, and an iPhone) and then a fourth? We will have to keep changing our database schema again and again whenever there is little change. There will be another problem, there will be too many null values for the employees that have only one machine. A huge waste of memory!
So instead of making a separate column what if we store an additional record of the employee who has more than one machine!
Hurray!!! Now the Employee table is in the first normal form. Each cell has a single value and there are no repeating groups. And attribute has the same data type.
But wait a minute. Now there is one little problem with our table. We can’t uniquely identify each record using the primary key. For each employee who has been given more than one computer, there will more than one record with the same Employee_id. So what do we do now?
We can combine Employee_id and Computer_id to create a composite key that will uniquely identify each record.
Composite key: The composite key is the primary key made up of 2 or more attributes.
Now everything is fixed!
2nd Normal Form
To satisfy 2nd normal form:
- The table should be in first normal form
- There should be no partial dependency between non-prime attributes and the prime attributes.
Prime and non-prime attributes: Any attribute which is the part of the primary key is called a prime attribute and an attribute that is not the part of the primary key is called a non-prime attribute. In our case, the Employee_id and the Computer_id are the prime attributes and all others are non-prime attributes.
Understanding partial dependency: Partial dependency means when we have a composite primary key that is made up of more than one attribute then each non-prime attribute should be dependent on the whole primary key and not the part of the primary key.
Let’s look at the Employee table, the primary key is made up of Employee_id and Computer_id. But attributes like name, email, etc can be uniquely identified using only employee_id we don’t need computer_id for that. And if we had other attributes like computer name, model, and brand, etc then they could have been derived from Computer_id alone. This is a violation of 2nd normal form.
So what should we do to remove partial dependency? We will divide the table and make a separate table for the Computer_id.
We can reference Computer_id for each employee using Employee_id as a foreign key in the Employee_Computer table. Now our table is in 2nd normal form.
3rd Normal Form
To satisfy 3rd normal form:
- The table should be in 2nd normal form
- There should be no transitive dependency between prime attributes and non-prime attributes
Understanding transitive dependency: If A is a prime attribute and B and C are non-prime attributes and B depends on A and C depends on B then it means C also depends on A. This is called the transitive dependency
In simple words, 3NF says that any non-prime attribute should not depend on any other non-prime attribute and all the non-prime attributes should only be functionally dependent on prime-attribute
In the Employees table Project_code, Project_name, and project_manager are all non-prime attributes. Now if you look carefully the Project_name and the Project_manager are functionally dependent on Project_code. We can uniquely identify and find the project name and project manager from the project code. This is the violation of the third normal form.
So How can we apply the third normal form? We will again divide the table into two separate smaller tables.
We will remove all the attributes related to the project from the Employee table and make a new table for it. And we will use the Project_code as a foreign key in the Employees table to make relation with the project table
Now original Employees table has been decomposed into 3 separate tables. Now we can say our company database is normalized. Great progress :)
Boyce and Codd Normal Form (BCNF)
Boyce Codd normal form says that:
- The table should be in the 3rd normal form
- And, for any dependency A → B, A should be a super key.
Boyce Codd is the stricter version of 3rd normal form. It is also called 3.5 NF. Let’s cover what we have learned so far:
- The 2nd normal form says that any non-prime attribute should be functionally dependent on the whole primary key and not partial key.
- The 3rd normal form says that any non-prime attribute should not be dependent on any other non-prime attribute.
- BCNF says that any prime attribute should also not be dependent on the non-prime attributes.
The Employee table does not violate the BCNF. So we look at another example to understand this.
Let’s assume we have a course enrollment table where we are storing Student_id, the Course in which students are enrolled, and the Instructor of that course. Students can take more than one course and each course can have more than one instructor but each instructor can teach only one course.
Now in this table, Student_id and Course name are combined to form the composite primary key and using it we can identify all the other attributes. In this case, there is only one other attribute which is Professor.
Now Professor which is a non-prime attribute can be used to identify course attribute which is the prime attribute. Now this table satisfies all the normal forms we have learned so far:
- This table satisfies the first normal form because each cell has a single value and there are no repeating groups.
- This table also satisfies the 2nd normal form because there is no partial dependency between prime attributes and non-prime attributes.
- This table also satisfies the 3rd normal form since there is no transitive dependency between prime attributes and non-prime attributes.
But this table violates the BCNF. Because BCNF states that a prime attribute should not be dependent upon the non-prime attribute.
So how we can apply BCNF to solve this problem? You guessed it… we will divide the table.
Now we have one table which stores Professor_id, Professor_name, and the Course which they are assigned and another table for students which stores Student_id and Professor_id.
4th Normal Form
To Satisfy the fourth normal:
- The table should be in Boyce Codd normal form (BCNF)
- There should be no multi-valued dependency
In 2nd normal form, we removed the partial dependency. In 3rd normal form, we removed the transitive dependency. And now in 4th normal form, we have to check for multi-valued dependency. So let’s first look at what is the multi-valued dependency
Multi-valued Dependency: There are three conditions for multi-valued dependency
- In the relation A->B, for each value of A, there should be more than one value of B
- There should be at least three attributes
- And for the table with A, B, and C columns, B and C should be independent of each other. And both should be dependent on A.
Let’s look at an example to understand it better. Suppose we have a table for courses where we are storing the course name, professor’s name, and the book that is recommended in the course syllabus. Each course can have more than one instructor and each course can also have more than one book.
Note: We can also store for the Course_id, Professor_id, and Book_id instead of the names of these attributes. I have used the name for the ease of understanding
Let’s see if this table has a multi-valued dependency:
- There are 3 attributes
- For each value, of course, there is more than one value of the Professor and more than one value of the Book.
- The Professor and Book attributes are independent of each other but both are dependent on the Course attribute.
All the conditions are met. So we can say this table has a multi-valued dependency.
So why is the multi-valued dependency bad? Let’s see
For each value, of Course, we have more than one value of the book and more than one value of the professor and we have to store every combination of data, there is too much repetition.
For a single course ‘Database’, we have 4 records. This will get even messier if we increase the number of professors or books. Even further, let’s suppose we have an additional column of Room_no which stores in which classroom the lecture for the particular course is scheduled. For each course, there will be more than one Room_no, and Room_no is independent of Professor and Book. There will be a hell of a data. You see the problem!
This is where 4th normal form comes to help. It says that there should be no multi-valued dependency.
So how do we solve this problem? You know the pattern…divide the table.
We divided the table into two separate tables one will keep the record of courses and instructors and the other will keep the record of courses and books.
5th Normal Form (Project join normal form)
To satisfy 5th normal form:
- The table should be in 4th normal form
- It should not have the join dependency
What is join dependency? It is a new villain. Join dependency means If we have table A and if we decompose it into 2 or more tables then when we combine these smaller tables we should have the same original table without losing any information.
Confused? Let me make it simple for you.
In simple words, the 5th normal form is satisfied when a table can not be broken down into smaller tables without losing information.
Let’s look at an example to understand this.
We have a table Supplier_Customer_Product, which stores the name of the supplier, the product they sell, and the name of the customer who buys those products.
We have a composite primary key which is made up of all three attributes. We can divide this table into 3 separate tables
From our original Supplier_Customer_Product table, we can see that Bosch supplies S3 Batteries to Toyota, but from the newly created tables, we can not infer this.
Here is why!
Bosch supplies multiple products (S3 Batteries and Glow Plugs), and Bosch also supplies these products to multiple customers (Toyota and Volkswagen) and Toyota also buys multiple products (S3 Batteries and High Pressure Pipes). So we can not identify which supplier supplies which product to which customer. Information is lost.
This means our original table satisfies the 5th normal form, which says a table can not be further broken down into sub tables without losing information.
That’s it! We covered what is normalization and how can we use it to minimize data redundancy and therefore reduce data anomalies. We looked at first, second, third, Boyce Codd, 4th, and 5th normal form and used these rules to normalize our tables.
I hope you enjoyed this article!
If you have any query, or a feedback you can write in the comments section.
You can also reach out to me on linkedin