Normalization
Unit -IV
What is Normalization?
Normalization is a database design technique which organizes tables in a manner
that reduces redundancy and dependency of data.
• It divides larger tables to smaller tables and links them using relationships
• Normalization is used for mainly two purpose,
1. Eliminating redundant(useless) data.
2. Ensuring data dependencies make sense i.e data is logically stored.
Redundant Information in Tuples and Update Anomalies
• Information is stored redundantly
• Wastes storage
• Causes problems with update anomalies
• Insertion anomalies
• Deletion anomalies
• Modification anomalies
Slide 10- 3
EXAMPLE OF AN UPDATE ANOMALY
• Consider the relation:
• EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)
• Update Anomaly:
• Changing the name of department No. 5 from “Research” to “Research and
Technology” may cause this update to be made for all 100 employees working
in this department.
Slide 10- 5
EXAMPLE OF AN INSERT ANOMALY
• Consider the relation:
• EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)
• Insert Anomaly:
• Cannot insert a project unless an employee is assigned to it.
• Conversely
• Cannot insert an employee unless an he/she is assigned to a project.
Slide 10- 6
EXAMPLE OF AN DELETE ANOMALY
• Consider the relation:
• EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)
• Delete Anomaly:
• When a project is deleted, it will result in deleting all the employees who
work on that project.
• Alternately, if an employee is the sole employee on a project, deleting that
employee would result in deleting the corresponding project.
Slide 10- 7
A simplified COMPANY relational database schema
Slide 10- 8
Design Alternative: Smaller Schemas
• How would we recognize that it requires repetition of information and should be split into the
two schemas employee and department?
• So, one of the alternative is to decompose larger schemas into smaller schemas
• After decomposition , if we attempt to regenerate the original tuples using a natural join then the
result should not be lossy decomposition. Example is one the next slide
• So, there is a formal methodology for evaluating whether a relational schema should be
decomposed.
• This methodology is based upon the concepts of keys and functional dependencies.
E.G.-Bad
Decomposition
Keys and Functional Dependencies
A database models a set of entities and relationships in the real world.
There are usually a variety of constraints (rules) on the data in the real world.
For example, some of the constraints that are expected to hold in a university database are:
1. Students and instructors are uniquely identified by their ID.
2. Each student and instructor has only one name.
3. Each instructor and student is (primarily) associated with only one department.
Keys and Functional Dependencies
• So, real-world constraints can be represented formally as keys (superkeys, candidate keys and
primary keys), or as functional dependencies
• A minimal set of attributes that determines the entire tuple is a candidate key
• {sid, name} is not a candidate key because I can remove the name.
• sid is a candidate key
• a superkey as a set of one or more attributes that, taken collectively, allows us to identify uniquely
a tuple in the relation
• E.g. Student (sid, name, supervisor_id, specialization)
• {sid, name} is a superkey for the student table.
• Also {sid, name, supervisor_id} etc.
If there are multiple candidate keys, the DB designer chooses designates one as the primary key.
Functional Dependency -Definition
A functional dependency allows us to express constraints that uniquely identify the values of
certain attributes.
Consider a relation schema r (R), and let X ⊆ R and Y ⊆ R. where r is a relation and R is a set of all
attributes of relation r , X and Y are sets of attributes
Given an instance of r (R), we say that the instance satisfies the functional Dependency
X → Y (read as X determines Y) if for all pairs of tuples t1 and t2 in the instance such that
t1[X] = t2[X], it is also the case that t1[Y] = t2[Y].
Example : Student (sid, name, supervisor_id, specialization):
• {supervisor_id} {specialization} means
• If two student records have the same supervisor (e.g., Mishra), then their
specialization (e.g., Databases) must be the same
• On the other hand, if the supervisors of 2 students are different, we do not care about
their specializations (they may be the same or different).
We say that the functional dependency X → Y holds on schema r (R) if, in every legal instance of r
(R) it satisfies the functional dependency.
t1
t2 t1[A]=t2[A] and t1[C] also =t2[C],it is true for every legal
instance
t3
t4
t4[C]=t5[C] but t4[A] not=t5[A]
t5
Trivial FDs
A functional dependency X Y is trivial if Y is a subset of X
• {name, supervisor_id} {name}
• If two records have the same values on both the name and supervisor_id
attributes, then they obviously have the same name.
• Trivial dependencies hold for all relation instances
A functional dependency X Y is non-trivial if YX =
• {supervisor_id} {specialization}
• Non-trivial FDs are given implicitly in the form of constraints when designing a
database.
• For instance, the specialization of a students must be the same as that of the
supervisor.
• They constrain the set of legal relation instances. For instance, if I try to insert
two students under the same supervisor with different specializations, the
insertion will be rejected by the DBMS
Closure of a Set of Functional Dependencies
• Given that a set of functional dependencies F holds on a relation r (R),
it may be possible to infer that certain other functional dependencies
must also hold on the relation.
• For example, given a schema r (A, B,C), if functional dependencies
A→ B and B → C, hold on r , we can infer the functional dependency
A→ C must also hold on r.
• Thus, the set of all functional dependencies that can be inferred given
the set F, is closure of the set F, denoted by F+
• We can find all of F+ by applying Armstrong’s Axioms:
Armstrong’s Axioms
• If Y X, then X Y (reflexivity rule)
• If X Y, and Z is a set of attributes then ZX ZY holds (augmentation rule)
• If X Y, and Y Z , then X Z (transitivity rule)
These rules are
• sound (generate only functional dependencies that actually hold) and
• complete (generate all functional dependencies that hold).
• If X Y and X Z, then X YZ (union)
• If X YZ, then X Y and X Z (decomposition)
• If X Y and ZY W, then ZX W (pseudotransitivity)
Example
R = (A, B, C, G, H, I)
F={ AB
AC
CG H
CG I
BH }
some members of F+
• AH
• by transitivity from A B and B H
• AG I
• by augmenting A C and CG I
the pseudo transitivity rule implies that AG I holds
• CG HI
• by augmenting CG I to infer CG CGI,
and augmenting of CG H to infer CGI HI,
and then transitivity
Procedure for Computing F+
To compute the closure of a set of functional dependencies F:
F+=F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F+
for each pair of functional dependencies f1and f2 in F +
if f1 and f2 can be combined using transitivity then add
the resulting functional dependency to F+
until F + does not change any further
Closure of a Set of Attributes
For a set X of attributes, we call the closure of X (with respect to a set of
functional dependencies F), noted X+, the maximum set of attributes
such that XX+ (as a consequence of F)
Consider the relation scheme R(A,B,C,D) with functional dependencies
{A}{C} and {B}{D}.
{A}+ = {A,C}
{B}+ = {B,D}
{C}+={C}
{D}+={D}
{A,B}+ = {A,B,C,D}
Algorithm for Computing the Closure of a
Set of Attributes
Input:
• R a relation scheme
• F a set of functional dependencies
• X R (the set of attributes for which we want to compute the closure)
Output:
• X+ the closure of X w.r.t. F
result := X;
while (changes to result) do
for each F.D Y Z in F do
begin
if Y result then
result := result Z
end
Example of Attribute Set Closure
I) II)
(A)+ =ABCD
R = (A, B, C, D) (B)+ = BCD
F={ AB (C)+ =CD
BC (D)+ =D
CD
} Thus ‘A’ attribute is the candidate key as it
(A)+ determines all the attributes of R
(AB)+ = ABCD
1. result = A
Thus ,attribute AB is the superkey , not the
2. result = AB (A B) candidate key as the subset A is a key itself.
3. result = ABC (B C) Prime attributes : A
4. result = ABCD (C D) Non prime attributes : BCD
Example of Attribute Set Closure
R = (A, B, C, D) II)
F={ AB (A)+ = ABCD
BC (B)+ = BCDA=ABCD
CD (C)+ =CDAB=ABCD
(D)+ =DABC=ABCD
DA
} Thus candidate keys are (A,B,C,D) as it
determines all the attributes of R
(A)+ (AB)+ = ABCD
1. result = A Thus ,attribute AB is the superkey , it is not
2. result = AB (A B) the candidate key as the subset A as well as
3. result = ABC (B C) B is a key itself.
4. result = ABCD (C D) Prime attributes : A B C D
Non prime attributes :
null
Example of Attribute Set Closure
R = (A, B, C, G, H, I)
F={ AB II) In this example , A and G attributes
AC are not at R.H.S , i.e these attributes are
CG H
CG I not determined by any attributes.
B H} Hence , such attributes has to be in the
(AG)+
candidate key .
1. result = AG
Thus candidate key is (AG) as it determines
2. result = ABCG (A C and A B) all the attributes of R
3. result = ABCGH (CG H and CG AGBC)
4. result = ABCGHI (CG I and CG AGBCH) Prime attributes : A G
Therefore
Non prime attributes : B C H I
(AG) +=
ABCGHI
Uses of Attribute Closure
There are several uses of the attribute closure algorithm:
Testing for superkey and candidate key
• To test if X is a superkey, we compute X+, and check if X+ contains all attributes of R.
• X is a candidate key if none of its subsets is a key.
Testing functional dependencies
• To check if a functional dependency X Y holds (or, in other words, is in F+), just check if Y
X+.
Computing the closure of F
• For each subset X R, we find the closure X+, and for each Y X+, we output a functional
dependency X Y.
Computing if two sets of functional dependencies F and G are equivalent, i.e., F+ = G+
• For each functional dependency YZ in F
• Compute Y+ with respect to G
• If Z Y+ then YZ is in G+
• And vice versa
Redundancy of FDs
Sets of functional dependencies may have redundant dependencies
that can be inferred from the others
• {A}{C} is redundant in: {{A}{B}, {B}{C},{A} {C}}
Parts of a functional dependency may be redundant
• Example of extraneous/redundant attribute on RHS:
{{A}{B}, {B}{C}, {A}{C,D}} can be simplified to
{{A}{B}, {B}{C}, {A}{D}}
(because {A}{C} is inferred from {A} {B}, {B}{C})
• Example of extraneous/redundant attribute on LHS:
{{A}{B}, {B}{C}, {A,C}{D}} can be simplified to
{{A}{B}, {B}{C}, {A}{D}}
(because of {A}{C})
Desirable Properties of Decomposition
• Lossless-Join Decomposition
• Let R be a relation schema, and let F be a set of functional dependencies on R. Let
R1 and R2 form a decomposition of R. This decomposition is a lossless-join
decomposition of R if at least one of the following functional dependencies is in
F+:
• R1 ∩ R2 → R1
• R1 ∩ R2 → R2
• Dependency Preservation
• For each functional dependency X ---> Y specified in F should either appear
directly in one of the relation schemas Ri in the decomposition or could be inferred
from the dependencies that appear in some Ri.
First Normal Form
• First Normal Form is defined in the definition of relations (tables) itself.
• This rule defines that all the attributes in a relation must have atomic domains.
The values in an atomic domain are indivisible units.
• unorganized relation
• We re-arrange the relation
(table) as below, to convert it
to First Normal Form.
1NF (First Normal Form) Rules
• Each table cell should contain a single value.
• Each record needs to be unique.
Second Normal Form
• We have already learned −
• Prime attribute − An attribute, which is a part of the prime-key, is known as a
prime attribute.
• Non-prime attribute − An attribute, which is not a part of the prime-key, is said
to be a non-prime attribute.
• If we follow second normal form, then every non-prime attribute should be
fully functionally dependent on prime key (candidate key) attribute.
• That is, if X → A holds, then there should not be any proper subset Y of X, for
which Y → A also holds true.
• Definition: A relation schema R is in 2NF if every nonprime attribute A in R is
fully functionally dependent on the primary key of R.
STUD_ID PROJ_ID STUD_NAME PROJ_NAME
1 101 ASHISH Hotel Management System
1 102 ASHISH Library Management System
3 101 PRIYA Hotel Management System
4 101 ANUSHA Hotel Management System
5 103 SACHIN Courier Services System
• In the given example “students can work on multiple projects.”
• The table suffers from update anomaly, as changing the name of the project will affect ‘n’ number of
rows
• We see here in Student_Project relation that the candidate key attributes are STUD_ID and PROJ_ID.
• Prime attributes : STUD_ID,PROJ_ID
• Non Prime attributes : STUD_NAME and PROJ_NAME
• When a subpart of the candidate key is determining the non prime attributes then, it is called partial
dependency, which is not allowed in Second Normal Form.
• According to the rule, non-key attributes, i.e. Stu_Name and Proj_Name must be dependent upon both and
not on any of the prime key attribute individually.
Converting into 2NF
1. Form all subsets of the attributes making up the primary key.
2. Begin a new table for each subset, using the subset as the primary key.
• (STUD_ID,
• (PROD_ID,
• (STUD_ID,PROD_ID,
3. Now, from the original table, add to each subset the attributes that
depend on that subsets primary key.
• (STUD_ID,STUD_NAME)
• (PROJ_ID,PROJ_NAME)
• (STUD_ID,PROD_ID)
4. Name each of the new tables appropriately.
Thus all the non-prime attribute should be fully functionally dependent
on prime key (candidate key) attribute.
Third Normal Form
• Definition: A relation schema R is in 3NF if it satisfies 2NFand no
nonprime attribute of R is transitively dependent on the primary key.
• For a relation to be in Third Normal Form, it must be in Second Normal
form and the following must satisfy −
• No non-prime attribute is transitively dependent on prime key attribute.
• A non prime attribute cannot determine a non prime attribute
• A prime attribute can determine non-prime attribute
• A non prime attribute can determine prime attribute
• For any non-trivial functional dependency, X → A, then either −
• X is a superkey or, LHS can be a candidate key or superkey OR
• A is prime attribute. RHS is a prime attribute
Third Normal Form
• To remove transitive dependencies.
• For each determinant that is not a candidate key*, remove from the
table the columns that depend on this determinant (but don't remove
the determinant).
• Create a new table containing all the columns from the original table
that depend on this determinant.
• Make the determinant the primary key of this new table.
• Name the new table appropriately.
• We find that in the above Student_detail relation, Stu_ID is the key and only prime key
attribute. We find that City can be identified by Stu_ID as well as Zip itself. Neither Zip is a
superkey nor is City a prime attribute.
• Additionally, Stu_ID → Zip → City, so there exists transitive dependency.
• To bring this relation into third normal form, we break the relation into two relations as
follows
SUMMARY OF NORMAL FORMS BASED ON PRIMARY KEYS
AND CORRESPONDINGNORMALIZATION
Normal Form Test REMEDY (NORMALIZATION)
First (1NF) Relation should have no nonatomic Form new relations for each nonatomic
attributes or nested relations. attribute or nested relation.
Second (2NF) For relations where primary key contains Decompose and set up a new relation for
multiple attributes, no nonkey attribute each partial key with its dependent
should be functionally dependent on a attributes). Make sure to keep a relation
part of the primary key. with the original primary key and any
attributes that are fully functionally
dependent on it.
Third (3NF) Relation should not have a nonkey Decompose and set up a relation that
attribute functionally determined by includes the nonkey attributets) that
another nonkey attribute (or by a set of functionally determines) other nonkey
nonkey attributes.) That is, there should attributes).
be no transitive dependency of a nonkey
attribute on the primary key.
Boyce Codd normal form (BCNF)
• Definition. A relation schema R is in BCNF if whenever a nontrivial
functional dependency X A holds in R, then X is a superkey of R.
• BCNF is stricter than 3NF.
Example: Suppose there is a company wherein employees work in more than one department.
They store the data like this:
emp_i emp_nationali dept_no_of_e
d ty emp_dept dept_type mp
1001 Austrian Production and D001 200
planning
1001 Austrian stores D001 250
design and technical
1002 American support D134 100
1002 American Purchasing D134 600
department
• Functional dependencies in the table above:
emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}
• Candidate key: {emp_id, emp_dept}
• The table is not in BCNF as neither emp_id nor emp_dept alone are keys.
• To make the table comply with BCNF we can break the table in three tables like
this:
To make the table comply with BCNF we can break the table in three tables like this:
emp
emp_nationality
_id
1:emp_nationality table: 1001 Austrian
1002 American
2:emp_dept table: dept_typ dept_no_
emp_dept
e of_emp
Production and planning D001 200
stores D001 250
design and technical
D134 100
support
Purchasing department D134 600
emp_id emp_dept
3:emp_dept_mappi 1001 Production and planning
ng table:
1001 stores
design and technical
1002
support
1002 Purchasing department
Comparison between 3NF and BCNF
BASIS FOR COMPARISON 3NF BCNF
Concept No non-prime attribute must be For any trivial dependency in a
transitively dependent on the relation R say X->Y, X should be a
Candidate key. super key of relation R.
Dependency 3NF can be obtained without Dependencies may not be
sacrificing all dependencies. preserved in BCNF.
Decomposition Lossless decomposition can be Lossless decomposition is hard to
achieved in 3NF. achieve in BCNF.