Chapter-4
DATABASE DESIGN THEORY AND
NORMALIZATION
Outline
Well structure relation
In logical database design, a process called normalization is
used to produce a data model that has the properties of simplicity,
non-redundancy and minimal maintenance.
When designing a database, the analyst should aim to design
well-structured relations/tables.
Cont’d...
A well-structured relation contains a minimum amount of
redundancy and allows users to insert, update and delete the
rows in a table without errors or inconsistencies.
It will be designed by using a technique called normalization.
Normalization is the process of decomposing relations with
anomalies to produce smaller, well-structured relations.
Basic concepts of normalization
A technique of organizing a data into multiple related tables to
minimize data redundancy.
Through normalization we want to design for our relational
database a set of files that
Have as little redundancy as possible,
Permit efficient updates of the data in the database, and
Avoid the danger of losing data unknowingly.
Data redundancy
What is data redundancy? And why should we reduce it?
Table
Row-1 X
X
Row-2
X
Row-3 X
Row-4
In a relation if a fact is stored more than once ,there is data
redundancy.
Cont’d…
Data redundancy also leads to other issues like:
Insertion , Deletion and Updation problems
Student_table
ID Stu-name Dept Hod Phone-no
Ababa CS Mr. x 0918502888
AAU01
Yadiel CS Mr. x 0918502888
AAU02
John CS Mr. x 0918502888
AAU03
Jonatan CS Mr. x 0918502888
AAU04
Cont’d…
Repetition of data increases the size of databases and leads to
other more issues such as
Insertion anomalies
Deletion anomalies and
Updation anomalies
These types of error can creep into databases that are
insufficiently normalized.
Insertion anomalies
An "Insertion Anomaly" is a failure to place information about a
new database entry into all the places in the database where
information about that new entry needs to be stored.
In a properly normalized database, information about a new entry
needs to be inserted into only one place in the database.
In an inadequately normalized database, information about a new
entry may need to be inserted into more than one place.
Deletion anomalies
A "Deletion Anomaly" is a failure to remove information about
an existing database entry when it is time to remove that entry.
In an inadequately normalized database, information about that
old entry may need to be deleted from more than one place, and,
human fallibility being what it is, some of the needed additional
deletions may be missed.
Loss of related data when some other data is deleted
Updation anomalies
Modification Anomaly: changing data in a row forces
changes to other rows because of duplication.
Leads to data inconsistency .
All three kinds of anomalies are highly undesirable.
The main purpose of normalization is to reduce the chances
for anomalies to occur in a database
Purpose of normalization
The main goal of database normalization is to restructure the
logical data model of a database to:
Eliminate redundancy
Organize data efficiently
Reduce the potential for data anomalies
Cont’d…
There are well-accepted principles and rules that are used to
normalize a data model.
Types of normalizations are :
First normal form
Second normal form
Third normal form, BCNF.
A databases must be normalize to at least the level of the 3rd Normal
Form (3 NF).
First normal form(1NF)
oA table (relation) is in 1NF if
There are no duplicated rows in the table.
Each cell is single-valued (i.e., there are no repeating
groups).
Entries in a column (attribute, field) are of the same
kind.
Each column should have a unique name
Example: Look the following Employee table below
Cont’d…
Second Normal Form (2NF)
A table is in 2NF if it is in 1NF and if all non-key attributes
are dependent on all of the key attributes.
FD = x y
If t1.x = t2.x then t1.y = t2.y
Functional dependency
Functional dependency is a relationship that exists when one
attribute uniquely determines another attributes.
For a relation R, attribute B is functionally dependent on attribute
A if for every valid instance of A, that value of A uniquely defines
the value of B.
If B is functionally dependent on A we write A B
The attribute on the left-hand side of the arrow is called a
determinant.
Cont’d...
An attribute may be functionally dependent on two or more attributes
Example
The relation EMP_COURSE (Emp_ID, Course Title,
Date_Completed) can have its functional dependencies described as:
Emp_ID, Course_Title Date_Completed
i.e., the date of course completed is completely determined by the identity of
the employee and the title of the course
Cont’d...
Other examples:
VIN Make, Model, Colour, i.e. the make, model and colour
of a vehicle are functionally dependent on the vehicle
identification number
ISBN Title, First_Author_Name, i.e. the title of a book and
the name of the first author are functionally dependent on the
book’s ISBN
Second normal form (2NF)
2NF = already in 1NF and every non key attribute is fully
functionally dependent on the primary key.
If 1NF will be in 2NF if any one of the following conditions
applies:
Cont’d...
The primary key consists of only one attribute(such as
Emp_ID in Employee)
thus all of the attributes in the relation are components of the
primary key
Every non key attribute is functionally dependent on the full
set of primary key attributes
Cont’d...
Employee is an example of a relation that is not in 2NF.
The primary key for this relation is the composite key Emp_ID
and Course_Title.
Therefore the non key attributes Name, Dept_Name and Salary
are functionally dependent on part of the primary key (Emp_ID)
but not on Course_Title.
Functional Dependencies in Employee
Dependency on entire primary key
EmpID CourseTitle Name DeptName Salary Date_Completed
Dependency on only part of the key
EmpID, CourseTitle DateCompleted
EmpID Name, DeptName, Salary
Therefore, NOT in 2nd Normal Form
Partial dependency
Partial functional dependency is a functional dependency in
which one or more nonkey attributes (such as Name) are
functionally dependent on part (but not all) of the primary key
In Employee this creates redundancy in that relation, which
results in anomalies when that table is updated (as noticed
previously: Insertion, Deletion and Modification anomalies)
Converting to 2NF
To do this we decompose the relation into new relations that
satisfy one (or more) of the conditions described above.
Employee relation is decomposed into the following 2 relations:
EMPLOYEE (Emp_ID, Name, Dept_Name, Salary). This
satisfies the first condition above and is in 2NF
EMP_COURSE(Emp_ID, Course_Title, Date_Completed)
This satisfies the third property above and is also in 2NF.
Cont’d…
Decomposed into two separate relations
EmpID Name DeptName Salary
EmpID CourseTitle DateCompleted
Both are full functional dependencies
Third normal form (3NF)
3NF = already in 2NF and no transitive dependencies exist.
Transitive dependency = a functional dependency between two (or more)
nonkey attributes.
E,g. SALES(Cust_ID, Name, Salesperson, Region) – Cust_ID the
primary key so that all the remaining attributes are functionally dependent
on this
However there is a transitive dependency, as Region is dependent on
Salesperson and Salesperson is dependent on Cust_ID
As a result there would be update anomalies for this .
Relation with transitive dependency
Relation with transitive dependency
CustID Name
CustID Salesperson BUT
CustID Region
CustID Salesperson Region
All this is OK Transitive dependency
(2nd NF) (not 3rd NF)
Anomalies
Insertion anomaly because a new salesperson assigned to the
‘North Region’ cannot be entered until a customer has been
assigned, because a value for Cust_ID must be provided to insert a
row in the table
Deletion anomaly because if customer 6837 is deleted from the
table, we lose the information that salesperson Hernandez is
assigned to the ‘East Region’
Modification anomaly because if salesperson Smith is reassigned
to the ‘East Region’, several rows must be changed to reflect that
fact.
Removing transitive
dependencies
This can be done by decomposing SALES into two relations
(see following table.)
Salesperson, which is the determinant in the transitive
dependency in SALES, becomes the primary key in Sperson
Salesperson becomes a foreign key in SALES1
Also uses less storage space because the dependent data (such
as Region) do not have to repeat for each customer
Decomposing the SALES relation
Relations in 3NF
Salesperson Region
CustID Name
CustID Salesperson
Now, there are no transitive dependencies, both relations are
in 3rd NF
BCNF
Boyce–Codd normal form (or BCNF or 3.5NF) is a normal form
used in database normalization.
It is a slightly stronger version of the third normal form (3NF).
Rules for BCNF
For a table to satisfy the Boyce-Codd Normal Form, it should satisfy
the following two conditions:
It should be in the Third Normal Form.
And, for any dependency A → B, A should be a super key.
it means, that for a dependency A → B, A cannot be a non-prime
attribute, if B is a prime attribute.
Example
student_id subject professor
101 Java P.Java
101 C++ P.Cpp
102 Java P.Java2
103 C# P.Chash
104 Java P.Java
Student id and subject together form PK. one professor teaches only
one subject, but one subject may have two different professors.
Hence, subject depends on the professor name.
To make this relation(table) satisfy BCNF, we will decompose this table
into two tables, student table and professor table.
Student Table
student_id p_id
And, Professor Table
p_id professor subject
Inference rules(Armstrong’s
Axioms)
The following six rules IR1 through IR6 are well-known inference rules for functional
dependencies:
IR1 (reflexive rule)1: If X ⊇ Y, then X →Y.
IR2 (augmentation rule)2: {X → Y} |=XZ → YZ.
IR3 (transitive rule): {X → Y, Y → Z} |=X → Z.
IR4 (decomposition, or projective rule): {X → YZ} |=X → Y.
IR5 (union, or additive rule): {X → Y, X → Z} |=X → YZ.
IR6 (pseudotransitive rule): {X → Y, WY → Z} |=WX → Z.
4NF
For a table to satisfy the Fourth Normal Form, it should satisfy the following
two conditions:
1. It should be in the Boyce-Codd Normal Form.
2. And, the table should not have any Multi-valued Dependency.
For a dependency A → B, if for a single value of A, multiple value of B
exists, then the table may have multi-valued dependency.
a table should have at-least 3 columns for it to have a multi-valued
dependency.
And, for a relation R(A,B,C), if there is a multi-valued dependency between,
A and B, then B and C should be independent of each other.
If all these conditions are true for any relation(table), it is said to have multi-
valued dependency.
Example
Below we have a college enrolment table with columns s_id, course
and hobby.
student with s_id 1 has opted for two courses, Science and Maths, and
has two hobbies, Cricket and Hockey.
s_id course hobby
1 Science Cricket
1 Maths Hockey
2 C# Cricket
2 Php Hockey
The two records for student with s_id 1, will give rise to two more
records, as shown below, because for one student, two hobbies exists,
hence along with both the courses, these hobbies should be specified.
s_id course hobby
1 Science Cricket
1 Maths Hockey
1 Science Hockey
1 Maths Cricket
CourseOpted Table Hobbies Table
s_id course s_id hobby
1 Science 1 Cricket
1 Maths 1 Hockey
2 C# 2 Cricket
2 Php 2 Hockey
Attribute closure(closure set)
set of attributes which can be functionally determined from it.
Example: R(A,B,C,D,E)
FD: {A → B, B → C , C → D, D → E }
A+ = {ABCDE}
AD + = {ADBEC}
B + = {BCDE}
Equivalence of Sets of Functional
Dependencies
A set of functional dependencies F is said to cover another set of
functional dependencies E if every FD in E is also in F+; that is, if every
dependency in E can be inferred from F; alternatively, we can say that E is
covered by F.
Two sets of functional dependencies E and F are equivalent if E+ = F+.
Therefore, equivalence means that every FD in E can be inferred from F,
and every FD in F can be inferred from E; that is, E is equivalent to F if
both the conditions—E covers F and F covers E—hold.
Minimal Sets(canonical) of
Functional Dependencies
F to be minimal if it satisfies the following conditions:
1. singleton RHS
2. No extraneous(two or more) attribute in LHS
3. Eliminate redundant
Example
R(A,B,C,D,E)
FD = { A → D, BC → AD, C → B, E → A, E → D}
Step 1 singleton RHS
{ A → D, BC → A, BC → D, C → B, E → A, E → D}
Step 2 extraneous attribute in LHS
BC → A, BC → D (eliminate either B,C in the first and B,C in the second)
To eliminate first compute B and C closure using the above FD
{ A → D, BC → A, BC → D, C → B, E → A, E → D}
B+ = B , C+ = CBAD so the C closure already contains the B so we can
eliminate B(this means we can reach to B using the C closure)
Example
{ A → D, BC → A, BC → D, C → B, E → A, E → D} eliminate B
= { A → D, C → A, BC → D, C → B, E → A, E → D}
B+ = B , C+ = CBAD so the C closure already contains the B so we
can eliminate B(this means we can reach to B using the C closure)
once again
{ A → D, C → A, BC → D, C → B, E → A, E → D} eliminate B
= { A → D, C → A, C → D, C → B, E → A, E → D}
Example
FD = { A → D, C → A, C → D, C → B, E → A, E → D}
Step 3: starting from first FD finding the redundant one so we take A→D and if A+
contains the D then we can eliminate other wise we can’t. when computing A + assuming
we eliminate A → D we get = A , we can’t eliminate so this stays and move to the second
one.
{ A → D, C → A, C → D, C → B, E → A, E → D} compute C + assuming you eliminate
C → A , C + = CDB it doesn’t contain A so we need this.
{ A → D, C → A, C → D, C → B, E → A, E → D} , compute C +
assuming you
eliminate C → D, C + = CADB , now we can eliminate C → D
Applying this to all FD we get = { A → D, C → A, C → B, E → A}
Properties of Relational Decompositions
In a database, it breaks the table into multiple tables. If the relation has no proper
decomposition, then it may lead to problems like loss of information. Decomposition
is used to eliminate some of the problems of bad design like anomalies,
inconsistencies, and redundancy.
The properties of a relational decomposition are
1.Attribute Preservation: The attributes in R will appear in at least one relation
schema Ri in the decomposition, i.e., no attribute is lost.
2.Dependency preservation :If a decomposition is not dependency preserving some
dependency is lost in decomposition.
If R decomposed into R1,R2,R3 then all the F1,F2,F3 Union should cover FD in R
Cont’d
3. Non Additive Join Property: ensures that no spurious tuples are generated when
a NATURAL JOIN operation is applied to the relations resulting from the
decomposition.
4. No redundancy: Decomposition is used to eliminate some of the problems of bad
design like anomalies, inconsistencies, and redundancy. If the relation has no proper
decomposition, then it may lead to problems like loss of information.
5. Lossless join property: It is the ability to ensure that any instance of the original
relation can be identified from corresponding instances in the smaller relations.
Cont’d
We must carefully consider the problems associated with NULLs when designing a
relational database schema. There is no fully satisfactory relational design theory as
yet that includes NULL values.
A tuple that does not participate in a natural join is called a dangling tuple.
Dangling tuples may indicate a consistency problem in the database.
Example
R (A,B,C,D,E)
FD: {A → B, B → C, C → D, D → A}
Check if these two decompositions preserve FD. R1(A,B,C) and R2(C,D,E) First, find the FD set
for R1
A+ = ABCD (A →BC) Since D IS NOT IN R1 we eliminate D from FD1
B+ = BCDA (B →CA)
C+ = CDAB (C →AB)
AB+ = ABCD (AB →C)
BC+ = BCDA (BC →A) These two are already in the above 3 closure above so FD1 will be
{A →BC, B →CA, C →AB}
Following same process for R2 we get FD2 {C → D, D → C}
Example
Then FD1 U FD2 should be equivalent to FD
i.e FD {A → B, B → C, C → D, D → A} should be a member of
FD1 U FD2 {A →BC, B →CA, C →AB, C → D, D → C}
Therefore, check each FD exist in FD1 U FD2, All the first 3FD exist in the union relation. but what
about the fourth? It does not seem to exist directly so we will check if it exists indirectly by finding
out the determinant D closure
D+ = DCAB since A is a member in closure set of D then we can say D → A is member of FD1 U
FD2 {A →BC, B →CA, C →AB, C → D, D → C}
We can conclude that the decomposed relation preserve dependency.