Database Design - Normalization of Relational Databases
Database Design
Overview
Introduction
What are the normal forms in relational database design theory?
-
Guidelines for how to design records.
-
They are general in nature, and apply to any relational database system.
-
They are designed to prevent rendundancy, ambiguity, anomalies, and data inconsistencies.
-
They tend to penalize retrieval (i.e. reading with
SELECT
statements), since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records using a join in the normalized form. -
There is no obligation to fully normalize all records when actual performance requirements are taken into account.
Redundancy
One of the goals of normalization is to avoid data redundancy - the repetition of facts in multiple places within the data.
An example table to hold students
’ grades’, with lots of redundancy.
id | name | assessment_title | grade | |
---|---|---|---|---|
1 | John I. Rivero | JohnIRivero@jourrapide.com | Quiz #1 | 95 |
2 | John I. Rivero | JohnIRivero@jourrapide.com | Quiz #2 | 78 |
3 | John I. Rivero | JohnIRivero@jourrapide.com | Midterm Exam | 82 |
4 | Susan C. Anthony | AnthonySusan@bmoney.com | Quiz #2 | 64 |
… | … | … | … | … |
Anomalies
Another goal is to avoid anomalies, which come in three types.
-
Insertion anomalies
-
Update anomalies
-
Deletion anomalies
Insertion anomalies
Insertion anomalies occur when we are not able to insert certain attributes in the database without the presence of other attributes.
- For example, if we wanted to add a new student to our records, but couldn’t do so because they hadn’t yet taken any assessments, and we had mistakenly made the
assessment_title
andgrade
fieldsNOT NULL
.
id | name | assessment_title | grade | |
---|---|---|---|---|
1 | John I. Rivero | JohnIRivero@jourrapide.com | Quiz #1 | 95 |
2 | John I. Rivero | JohnIRivero@jourrapide.com | Quiz #2 | 78 |
3 | John I. Rivero | JohnIRivero@jourrapide.com | Midterm Exam | 82 |
4 | Susan C. Anthony | AnthonySusan@bmoney.com | Quiz #2 | 64 |
… | … | … | … | … |
Update anomalies
Update anomalies occur when a correct update of a record requires other actions, such as addition, deletion or both, in order to retain data integrity.
- For example, if we wanted to change
John I. Rivero
’s email address, but it requires us to update multiple records.
id | name | assessment_title | grade | |
---|---|---|---|---|
1 | John I. Rivero | JohnIRivero@jourrapide.com | Quiz #1 | 95 |
2 | John I. Rivero | JohnIRivero@jourrapide.com | Quiz #2 | 78 |
3 | John I. Rivero | JohnIRivero@jourrapide.com | Midterm Exam | 82 |
4 | Susan C. Anthony | AnthonySusan@bmoney.com | Quiz #2 | 64 |
… | … | … | … | … |
Deletion anomalies
Deletion anomalies occur when you delete a record, but because of the design of the tables, you accidentally delete information you shouldn’t have.
- For example, if
Helen C. Gonzalez
has only taken one assessment, Quiz #1, but we decide to drop that grade. Deleting that record would remove her email address entirely from our student data.
id | name | assessment_title | grade | |
---|---|---|---|---|
4 | Susan C. Anthony | AnthonySusan@bmoney.com | Quiz #2 | 64 |
5 | Mary G. Dickinson | MaryGDickinson@jourrapide.com | Quiz #1 | 95 |
6 | JSandra B. Kile | SandraBKile@teleworm.us | Quiz #2 | 78 |
7 | Helen C. Gonzalez | HelenCGonzalez@teleworm.us | Quiz #1 | 82 |
… | … | … | … | … |
First Normal Form
Introduction
First normal form deals with the shape of a record type.
-
All records in a table must contain the same number of fields.
-
In other words, all tables in relational database systems have a fixed schema.
-
A fixed schema is generally a requirement of modern relational database systems, and requires no extra work.
-
All values in a given field should also be singular values.
Second Normal Form
Introduction
Second and third normal forms both deal with the relationship between non-key and key fields.
-
Each record in second and third normal forms must satisfy first normal form.
-
A non-key field must provide a fact about the entity uniquely identified by the primary key.
-
It is not allowed for a non-key field to provide a fact about only a part of that entity or about some other unrelated entity.
-
The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee.
Applicability
Second normal form only applies to tables whose primary key is composed of two or more fields.
CREATE TABLE (
part TEXT NOT NULL
warehouse TEXT NOT NULL
quantity INTEGER
warehouse-address TEXT
PRIMARY KEY (part, warehouse)
)
-
While today it is possible to create such composite keys, it is increasingly uncommon, in preference for singular surrogate key fields containing an auto-incrementing arbitrary integer.
-
Nevertheless, we will explore it.
Example
Take, for example, the following table showing parts inventories in various warehouses:
part | warehouse | quantity | warehouse-address |
---|---|---|---|
Baby Bed Crib Screws Hardware Replacement Kit, cSeao 25-Set | Avenel, NJ | 2441 | 275 Omar Ave, Avenel, NJ 07001 |
Prime-Line N 7534 Bi-Fold Door Hardware Repair Kit | Florence, NJ | 1121 | 309 Cedar Ln, Florence, NJ 08518 |
HIMIKI Tailgate Hardware Rebuild Kit w/Handle Bezel Latch Cable | Avenel, NJ | 3567 | 275 Omar Ave, Avenel, NJ 07001 |
… | … | … | … |
The composite primary key is composed of part and warehouse, meaning that the combination of those two fields is guaranteed to be unique for each record.
Problems
This example does not meet the requirements of second normal form.
-
The field
warehouse-address
is a fact about thewarehouse
only, not a fact about thepart
/warehouse
combined entity that this table is about. -
This is unfortunate, since it requires the address of each warehouse to be repeated however many times there are parts in that warehouse.
-
This data redundancy makes maintaining the data difficult. Updating a warehouses address would have to be done across many records, not just in a singular place.
-
If there were no parts stored in a given warehouse, there would be nowhere in the database to store the address of that warehouse.
Solutions
In order to normalize these records, we could easily split the data into two tables.
One table for parts
:
part | warehouse_id | quantity |
---|---|---|
Baby Bed Crib Screws Hardware Replacement Kit, cSeao 25-Set | 1 | 2441 |
Prime-Line N 7534 Bi-Fold Door Hardware Repair Kit | 2 | 1121 |
HIMIKI Tailgate Hardware Rebuild Kit w/Handle Bezel Latch Cable | 1 | 3567 |
… | … | … |
And another for warehouses
:
id | address |
---|---|
1 | 275 Omar Ave, Avenel, NJ 07001 |
2 | 309 Cedar Ln, Florence, NJ 08518 |
… | … |
Third Normal Form
Introduction
As with second normal form, third normal forms deals with the relationship between non-key and key fields.
-
Records in third normal form must satisfy second normal form.
-
Third normal form is violated when a non-key field is a fact about another non-key field.
-
Whereas second normal form is only applicable to tables with composite primary keys, third normal form applies to all tables.
Example
Take, for example, a table about employees
at a company.
id | employee | department | location |
---|---|---|---|
1 | Henry K. Brinkman | Accounting | Fort Myers, FL |
2 | Darlene R. Gonzalez | Marketing | Jackson, MS |
3 | Abigail W. Wagner | Sales | Pleasantville, NJ |
4 | Norella T. Walker | Accounting | Fort Myers, FL |
… | … | … | … |
- Let’s assume that the
location
is dependent upon thedepartment
, with each department having a different location.
Problems
This example does not meet the requirements of third normal form.
-
The
id
field is the primary key and represents an employee. If each department is located in one place, then thelocation
field is a fact about thedepartment
, and not a fact about the employee. -
The department’s location is repeated in the record of every employee assigned to that department. If the location of the department changes, every such record must be updated.
-
Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department.
-
If a department has no employees, there may be no record in which to keep the department’s location.
Solution
As with violations of second normal form, the solution to a violation of third normal form is typically to split the data into multiple tables.
One table for employees
:
id | employee | department_id |
---|---|---|
1 | Henry K. Brinkman | 1 |
2 | Darlene R. Gonzalez | 2 |
3 | Abigail W. Wagner | 3 |
4 | Norella T. Walker | 1 |
… | … | … |
And another for departments
:
id | department | location |
---|---|---|
1 | Accounting | Fort Myers, FL |
2 | Marketing | Jackson, MS |
3 | Sales | Pleasantville, NJ |
… | … | … |
Fourth Normal Form
Introduction
Fourth normal form is concerned with multi-valued facts, which we’ll show by example. In order for a record to meet fourth normal form, it must:
-
satisfy the requirements of third normal form.
-
not contain more than one independent multi-valued fact about an entity.
Example
For example, consider a situation where we intend to store employee’s skills and foreign language abilities.
A single employee who has multiple skills and/or multiple languages (two independent multi-valued facts about them) might [erroneously] be represented with two or more independent multi-valued fact fields.
id | employee | skill | language |
---|---|---|---|
1 | Henry K. Brinkman | cook | |
2 | Henry K. Brinkman | type | |
3 | Henry K. Brinkman | French | |
4 | Henry K. Brinkman | German | |
5 | Henry K. Brinkman | Greek | |
6 | Norella T. Walker | type | |
… | … | … | … |
- In this representation, in addition to data redundancy, there is ambiguity in the meaning of the null values - does the employee lack those abilities, are they not applicable, or are they unknown?
Another Possibility
That same data, with two or more multi-value fact fields, might be represented a few different ways, including:
id | employee | skill | language |
---|---|---|---|
1 | Henry K. Brinkman | cook | French |
2 | Henry K. Brinkman | type | German |
3 | Henry K. Brinkman | type | Greek |
4 | Norella T. Walker | type | None |
… | … | … | … |
-
In this representation, we have removed the null values, but we still have redundancy and therefore difficulty maintaining data, and what to do with employees who either have no language or no skill?
-
Note that the
skill
andlanguage
fields are said in our description of the data to be independent. Thus, this model is forbidden by the fourth normal form. -
However, if a skill was dependent upon a specific language, this model would be allowed by the fourth normal form.
Solution
The solution to remove redundancy, ambiguity, and anomalies, as with previous normal forms, is to split the data up into multiple tables.
One for employees
:
id | employee |
---|---|
1 | Henry K. Brinkman |
2 | Darlene R. Gonzalez |
3 | Abigail W. Wagner |
4 | Norella T. Walker |
… | … |
Solution (continued)
The solution, as with previous normal forms, is to split the data up into multiple tables.
Another for employees_skills
:
employee_id | skill |
---|---|
1 | cook |
1 | type |
2 | type |
4 | type |
… | … |
- Note, here we use a composite primary key to make sure the employee-skill combinations are unique.
Solution (continued again)
The solution, as with previous normal forms, is to split the data up into multiple tables.
And a third for employees_languages
:
employee_id | language |
---|---|
1 | French |
1 | German |
1 | Greek |
… | … |
- Note, here we use a composite primary key to make sure the employee-language combinations are unique.
Conclusions
Thank you. Bye.