In this post I am going to talk about normalization. No matter at what is the professional age of a software engineer he has to know about database concepts and normalization is one if those important concept. And when it comes to giving interview, I bet every software engineer search internet to revise the concepts regarding normalization and same is the case with me.
So decide to write a post on normalization and dedicate it people who want to give interview. So what is normalization?
Normalization – It is the method of breaking down complex table structures in to simple table structures by using certain rules in order to reduce redundancy in a table and eliminate inconsistency problems and disk space usage.
OR
It is a process of elimination of duplication from the set of data and transforms it into specifically identifiable objects.
Benefits from normalization – following are the benefits of normalizing database.
- Avoiding repetitive entries
- Reducing required storage space
- Preventing the need to restructure existing tables to accommodate new data.
- Increased speed and flexibility of queries, sorts, and summaries
There are five stages of normalization from 1st NF to 5th NF. It has been seen that 1st and 2nd normal forms are used extensively and 3rd normal form is used only when it is needed. 4th and 5th normal are generally for academics and as far as there usage is concern they cause performance problems.
First normal form (1st NF) – A table is in first normal form (1NF) if all underlying domains contain atomic values only. So what does atomic values means?
Atomic Values – In most general sense, ‘Atomic’ means anything which has one value. It means each column in a tuple must have one and only one value, it should not be multi valued. To understand this lets look into following example:
Example # 1
Table: Customer
CustomerId | Name | Address | PhoneNumber | Postal Code | Area | City | Province |
1 | Ali Raza | ABC Society | 0345-6687675 0300-6421665 0321-4509861 | 74600 | Defence | Islamabad | Sindh |
2 | Asif Fattah | XYZ Society | 0333-2721835 0321-2125367 | 45240 | Gulberg | Haiderabad | Punjab |
|
|
|
|
|
|
|
|
In the above example column PhoneNumber does not have atomic values because it has multiple values and because of this column, Customer table is not in the 1st NF. To make this table satisfy 1st NF PhoneNumber column must be placed in a separate table and reference it via foreign key, like I have done in the following example:
Table: Customer
CustomerId | First Name | Last Name | Address | Postal Code | Area | City | Province |
1 | Ali | Raza | ABC Society | 74600 | Defence | Islamabad | Sindh |
2 | Asif | Fattah | XYZ Society | 45240 | Gulberg | Haiderabad | Punjab |
Table: CustomerPhoneNo
CustomerId | PhoneNumber |
2 | 0333-2721835 |
2 | 0321-2125367 |
1 | 0321-4509861 |
1 | 0300-6421665 |
1 | 0345-6687675 |
Example # 2
Table: Meeting
MeetingId | Date | Attendee |
123 | 1/2/2009 | Ali, Asif, Asad |
In the above example Attendees is multivalve column. To apply 1st NF on this table Attendee column should move in a separate table like I’ve done below:
Table: Meeting
MeetingId | Date |
123 | 1/2/2009 |
Table: MeetingAttendee
MeetingId | Attendee |
123 | Ali |
123 | Asif |
123 | Asad |
If we have a look of above example it can be figured out that each tuple (row) in Meeting table can have zero/one or many rows in MeetingAttendee table against it and in MeetingAttendee table one or many rows must have one and only one tuple (row) in Meeting table against it.
So, we can say that 1st NF requires to put multi valued column in separate and connect them by making one-to-many relationship.
Second normal form (2nd NF) – An entity is in second normal form (2nd NF) if it fulfill the requirement of 1st NF + (entity who does not have combinatorial key or composite key, all none key elements should be fully dependent on primary key)
Combinatorial Key or Composite Key – key consisted on more than one field is called combinatorial key or Composite Key.
Example # 1
Table: Student
StudentId (pk) | CourseId (pk) | CourseName | StudentName | Address | PhoneNumber |
1 | 1 | C++ | Ali | ABC Society | 0345-6687675 0300-6421665 0321-4509861 |
2 | 2 | Applied Calculus | Asif | XYZ Society | 0333-2721835 0321-2125367 |
In the above example studentId and CourseId both are primary keys means they are making combinatorial key. It is obvious from the Student table that not all the fields are completely dependent on StudentId or CourseId. We can figure out that there are two entities stored in one table. To make the Student table satisfy 2nd NF we need to break combinatorial key and put the information dependent on CourseId field in another table let’s say in Course table.
Table: Student
StudentId (pk) | CourseId (fk) | StudentName | Address | PhoneNumber |
1 | 1 | Ali | ABC Society | 0345-6687675 0300-6421665 0321-4509861 |
2 | 2 | Asif | XYZ Society | 0333-2721835 0321-2125367 |
Table: Course
CourseId (pk) | CourseName |
1 | C++ |
2 | Applied Calculus |
Table: CustomerPhoneNo
CustomerId(fk) | PhoneNumber |
2 | 0333-2721835 |
2 | 0321-2125367 |
1 | 0321-4509861 |
1 | 0300-6421665 |
1 | 0345-6687675 |
So we can say that 2nd NF requires putting multi valued attributes in separate table.
Third normal form (3rd NF) – An entity is in third normal form (3rd NF) if it fulfill the requirement of 2nd NF + (all non key element must be functionally dependent on primary key)
Functionally dependent – it means all non key elements must be dependent on primary key.
Example # 1
Table: Customer
CustomerId | Name | Address | Postal Code | Area | City | Province |
1 | Ali | ABC Society | 74600 | Nazimabad | Karachi | Sindh |
2 | Asif | XYZ Society | 45240 | Sanda | Lahore | Punjab |
3 | John | DEF society | 74600 | North Nazimabad | Karachi | Sindh |
|
|
|
|
|
|
|
In the above example City and Province are dependent on Postal Code, both of these fields are not dependent on CustomerId (Primary Key), so to apply 3rd NF on this table we can put City and Country in a separate table and connect both tables via Postal Code field.
If we look it in other perspective we are separating our static data and minimizing storage. Area, City and Country is static data and for each postal code it will remain same (hear I am assuming that postal code is unique for each area in the city).
Table: Customer
CustomerId | Name | Address | Postal Code |
1 | Ali | ABC Society | 74600 |
2 | Asif | XYZ Society | 45240 |
3 | John | DEF society | 74600 |
|
|
|
|
Table: PostalCode
Postal Code | Area | City | Province |
74600 | Nazimabad | Karachi | Sindh |
45240 | Sanda | Lahore | Punjab |
|
|
|
|
So we can say that 3rd form of normalization is about moving that data into other table that is functionally dependent on some other field rather than primary key.I have described the process of normalization in the sequence of 1st, 2nd, and 3rd phase, however; the order of these phases is not that important most of the times. The reason is while designing database the entities are already in 1st or 2nd NF. So you don’t need to apply 1st or 2nd NF. So the rule of thumb is that remember the rules of normalization instead of getting confuse with sequence.