Step by step to normalize Database data
- Tram Ho
It’s not that the project is developing a big feature, having to design more tables, the project brothers keep saying “Customers don’t follow DB standardization at all”, even though I still remember my university days, but to be sure of what you’re talking about, do some research on the internet; And now share it with you.”
The concept of Normalization is a technique in DB design, used to design tables of Relational Databases to higher “standards”.
This technique is divided into several levels including: in order to advance to the next level the previous level requirement must be satisfied. For example, we have a data set unnormalized form (this is the lowest level) and we are looking to achieve the highest level of normalization.
Then the first step must first make sure normalized form I is followed, then we further design the DB to satisfy the specific requirements of normalized form II,… and so on until the data set ours reaches the sixth normalized form.
NOTE
The fourth normalized form and above are only theoretical in nature, the problems that these forms aim to solve often “rare” appear in practice.
Initialization data
Here is an example of an initialized data set – not normalized.
Title | Author | Author Nationality | Format | Price | Subject | Pages | Thickness | Publisher | Publisher Country | Publication Type | Genre ID | Genre Name |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Beginning MySQL Database Design and Optimization | Chad Russell | American | Hardcover | 49.99 | MySQL,Database,Design | 520 | Thick | Apress | USA | E-book | 1 | Tutorial |
We assume that 1 book belongs to only 1 author
1NF – Normalized Form I
To satisfy 1NF, the value in each column needs to be a single value.
Looking at the table above, the Subject column is containing more than one value, causing this table to not satisfy 1NF.
One way to achieve 1NF in this case is to split those values into many other Subject columns.
Title | Format | Author | Author Nationality | Price | Subject 1 | Subject 2 | Subject 3 | Pages | Thickness | Publisher | Publisher country | Genre ID | Genre Name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Beginning MySQL Database Design and Optimization | Hardcover | Chad Russell | American | 49.99 | MySQL | Database | Design | 520 | Thick | Apress | USA | 1 | Tutorial |
The above results have met the standard 1NF but there is a problem. If a book is on more than 3 topics, we won’t be able to add that new topic to the table without adding another column to the table.
To solve this problem more subtile, it is necessary to first identify the objects that appear in the table and divide them into corresponding tables. With this topic, 3 tables will be: Book, Subject and Publisher
Title | Format | Author | Author Nationality | Price | Pages | Thickness | Genre ID | Genre Name | Publisher ID |
---|---|---|---|---|---|---|---|---|---|
Beginning MySQL Database Design and Optimization | Hardcover | Chad Russell | American | 49.99 | 520 | Thick | 1 | Tutorial | 1 |
Publisher Table
Publisher_ID | Name | Country |
---|---|---|
1 | Apress | USA |
Bảng Subject
Subject ID | Subject name |
---|---|
1 | MySQL |
2 | Database |
3 | Design |
Obviously, splitting the table like this will make the data no longer linked. Which means we will have to somehow represent the relationship between these tables.
You notice that the Publisher ID column in the Books Table is the foreign key determining the one – many relationship between the book and the publisher.
On the other hand, a book can belong to many different subjects, just as a subject can belong to many book rights. Therefore, we need a many-to-many relation represented in the intermediate table below:
Table Title – Subject
Title | Subject ID |
---|---|
Beginning MySQL Database Design and Optimization | 1 |
Beginning MySQL Database Design and Optimization | 2 |
Beginning MySQL Database Design and Optimization | 3 |
Up to this point, we have a total of 4 tables detached from the first initialization table – just to comply with the 1NF standard
Satisfying 2NF
To identify a book, for the table below we need to use the Composite Key of {Title, Format}
Title | Format | Author | Author Nationality | Price | Pages | Thickness | Genre ID | Genre Name | Publisher ID |
---|---|---|---|---|---|---|---|---|---|
Beginning MySQL Database Design and Optimization | Hardcover | Chad Russell | American | 49.99 | 520 | Thick | 1 | Tutorial | 1 |
Beginning MySQL Database Design and Optimization | E-book | Chad Russell | American | 22.34 | 520 | Thick | 1 | Tutorial | 1 |
The Relational Model for Database Management: Version 2 | E-book | E.F.Codd | British | 13.88 | 538 | Thick | 2 | Popular science | 2 |
The Relational Model for Database Management: Version 2 | Paperback | E.F.Codd | British | 39.99 | 538 | Thick | 2 | Popular science | 2 |
We see that the columns that are not in the composite key {Title, Format} all depend on Title, each column Price depends on both Title and Format.
To satisfy the 2NF criterion, we need to ensure that no attribute (column) depends only 1 part on the column key. In other words, in the 2NF standard, every attribute must depend completely on the coherent key.
For the current problem, we will split up a table so that the Title column becomes the primary key, all the remaining attributes (columns) depend entirely on it; and the Price column splits into another table where it depends on the Format column (and of course the Title)
Table Books
Title | Author | Author Nationality | Pages | Thickness | Genre ID | Genre Name | Publisher ID |
---|---|---|---|---|---|---|---|
Beginning MySQL Database Design and Optimization | Chad Russell | American | 520 | Thick | 1 | Tutorial | 1 |
The Relational Model for Database Management: Version 2 | E.F.Codd | British | 538 | Thick | 2 | Popular science | 2 |
The Books table is now standard 2NF
Table Format – Price
Title | Format | Price |
---|---|---|
Beginning MySQL Database Design and Optimization | Hardcover | 49.99 |
Beginning MySQL Database Design and Optimization | E-book | 22.34 |
The Relational Model for Database Management: Version 2 | E-book | 13.88 |
The Relational Model for Database Management: Version 2 | Paperback | 39.99 |
The Formats – Price table is still not up to the 2NF standard..
Satisfying 3NF
The Books table is still experiencing a phenomenon called: bắctransitional dependency, specifically: {Author Nationality} depends on {Author}, and {Author} depends on {Title}.
Its downside is: if the {Author} value is changed, then we will have to find and change all the corresponding {Author Nationality} in other rows (if any) in the table.
Same thing happens between {Genre Name} {Genre ID} and {Title}
We say that the Books table is not up to the 3NF standard now.
To comply with this standard, we will redesign as follows: Split Books into 3 tables with 2 new tables respectively Authors and Genres.
Table Books
Title | Author | Pages | Thickness | Genre ID | Publisher ID |
---|---|---|---|---|---|
Beginning MySQL Database Design and Optimization | Chad Russell | 520 | Thick | 1 | 1 |
The Relational Model for Database Management: Version 2 | E.F.Codd | 538 | Thick | 2 | 2 |
Table Authors
Author | Author Nationality |
---|---|
Chad Russell | American |
E.F.Codd | British |
Table Genres
Genre ID | Genre Name |
---|---|
1 | Tutorial |
2 | Popular science |
4NF
First, to meet this standard, you need to meet the BCNF standard, that is, every attribute (column) must be a member of the composite key; or simply having a column that is not part of the composite key is BCNF false.
Next, the 4NF standard requires that inside a relational table NO can have more than 1 many-valued relations..
Take a look at the following example:
Assume these Books are owned by Franchisees (Enterprises) based in many States in the USA. Now we need to add a new table describing where a book is present. Assumption: every Franchisee has books at all its headquarters
Table Franchisee – Book – Location
Franchisee ID | Title | Location |
---|---|---|
1 | Beginning MySQL Database Design and Optimization | California |
1 | Beginning MySQL Database Design and Optimization | Florida |
1 | Beginning MySQL Database Design and Optimization | Texas |
1 | The Relational Model for Database Management: Version 2 | California |
1 | The Relational Model for Database Management: Version 2 | Florida |
1 | The Relational Model for Database Management: Version 2 | Texas |
2 | Beginning MySQL Database Design and Optimization | California |
2 | Beginning MySQL Database Design and Optimization | Florida |
2 | Beginning MySQL Database Design and Optimization | Texas |
2 | The Relational Model for Database Management: Version 2 | California |
2 | The Relational Model for Database Management: Version 2 | Florida |
2 | The Relational Model for Database Management: Version 2 | Texas |
3 | Beginning MySQL Database Design and Optimization | Texas |
We see two relationships here:
- Franchisee ->> Location(1 Franchisee has many headquarters)
- Franchisee ->> Book (1 Franchisee sells many books)
At this point, we can say that this table has 2 many-valued relations.
To comply with 4NF ta must be redesigned so that there is only one value-to-value relation in a table. We separate as follows:
Table Franchisee – Book
Franchisee ID | Title |
---|---|
1 | Beginning MySQL Database Design and Optimization |
1 | The Relational Model for Database Management: Version 2 |
2 | Beginning MySQL Database Design and Optimization |
2 | The Relational Model for Database Management: Version 2 |
3 | Beginning MySQL Database Design and Optimization |
Table Franchisee – Location
Franchisee ID | Location |
---|---|
1 | California |
1 | Florida |
1 | Texas |
2 | California |
2 | Florida |
2 | Texas |
3 | Texas |
Summary
The rest of the standards are many, but the programmers I know here are probably OK. If you want to read more carefully the original document, please see the sources below for more:
Reference source
https://en.wikipedia.org/wiki/Database_normalization;
https://www.geeksforgeeks.org/introduction-of-4th-and-5th-normal-form-in-dbms/;