Dbms 2-1 Material
Dbms 2-1 Material
3 0 0 3
Unit II: Relational Model: Introduction to relational model, concepts of domain, attribute,
tuple, relation, importance of null values, constraints (Domain, Key constraints, integrity
constraints) and their importance, Relational Algebra, Relational Calculus. BASIC SQL:
Simple Database schema, data types, table definitions (create, alter), different DML
operations (insert, delete, update).
UNIT III: SQL: Basic SQL querying (select and project) using where clause, arithmetic &
logical operations, SQL functions(Date and Time, Numeric, String conversion).Creating
tables with relationship, implementation of key and integrity constraints, nested queries, sub
queries, grouping, aggregation, ordering, implementation of different types of joins,
view(updatable and non-updatable), relational set operations.
Web-Resources:
1. https://nptel.ac.in/courses/106/105/106105175/
https://infyspringboard.onwingspan.com/web/en/app/toc/lex_auth_0127580666728202
2456_shared/overview
UNIT 1- DATABASE MANAGEMENT SYSTEM
Database Users:-
There are 4 different types of database system users, differentiated by the way they except to
interact with the system. Different types of user interfaces have been designed for the
different types of users.
1. Naive Users:
Naive users are unsophisticated users who interact with the system by invoking one of
the applications programs that have been written previously. The typical user interface
for naive users is a forms interface, where the user can fill in appropriate fields of the
form. Naïve users may also simply read reports generated from the database.
2. Application Programmers:-
Application Programmers are computer professionals who write application programs.
Application Programmers can choose from many tools to develop user interfaces.
Rapid Application Development(RAD) tools are tools that enable an application
programmer to construct forms and reports with minimal programming effort.
3. Sophisticated Users:-
Sophisticated Users interact with the system without writing programs. Instead, they
form their requests either using a database query language or by using tools such as
data analysis software. Analysis who submit queries to explore data in the database
fall in this category.
4. Specialized Users:-
Specialized Users are sophisticated users who write specialized database applications
that do not fit into the traditional data, processing framework. Among these
applications are computer – aided design systems, knowledge base and expert systems,
systems that store data with complex data types(for example, graphics data and audio
data) and environment – modelling systems.
Enterprise Information:-
Sales:- For customer, product and purchase information.
Accounting:- For payments, receipts, account balances, assets and other accounting
information.
Human resources:- For information about employee, salaries, payroll taxes, and benefits
and for generation of pay checks.
Manufacturing:- For management of the supply chain and for tracking production of
items in factories, inventories of items in warehouses and stores, and order for items.
Online retailers:- For sales data noted above plus online order tracking, generation of
recommendation lists, and maintenance of online product evaluations.
Naive Users: Naive users are unsophisticated users who interact with the system by invoking
one of the applications programs that have been written previously. The typical user interface
for naive users is a forms interface, where the user can fill in appropriate fields of the form.
Naïve users may also simply read reports generated from the database.
Application Programmers:-Application Programmers are computer professionals who write
application programs. Application Programmers can choose from many tools to develop user
interfaces. Rapid Application Development(RAD) tools are tools that enable an application
programmer to construct forms and reports with minimal programming effort.
Sophisticated Users:- Sophisticated Users interact with the system without writing programs.
Instead, they form their requests either using a database query language or by using tools
such as data analysis software. Analysis who submit queries to explore data in the database
fall in this category.
Database Administrators:- One of the main reasons for using DBMS is to have central
control of both the data and the programs that access those data. A person who has such
central control over the system is called a database administrator.
Query Processor:-
The query processor components includes:-
DDL interpreter:- Which interprets DDL statements
and records the definitions in the data dictionary.
DML compiler:- Which translates DML statements in a query language into an
evaluation plan consisting of low-level instructions that the query evaluation engine
understands.
A query can usually be translated into any of a number of alternative evaluation
plans that all given the same result. The DML compiler also performs query
optimization; that is, it picks the lowest cost evaluation plan from among the
alternatives.
Query evaluation engine:- Which executed low-level instructions generated by the
DML compiler.
Storage Manager:-
The Storage Manager is the component of a database system that provides the interface
between the low-level data stored in the database and the application programs and queries
submitted to the system.
The storage manager components include:-
Authorization and Integrity Manager: - Authorization and Integrity Manager,
which tests for the satisfaction of integrity constraints and checks the authority of
users to access data.
Transaction Manager: - Which ensures that the database remains in a consistent
state despite system failures and that concurrent transactions executions proceed
without conflicting.
File Manager:- Which manages the allocation of space on disk storage and the data
structures used to represent information stored on disk.
Buffer Manager:- Which is responsible for fetching data from disk storage into main
memory, and deciding what data to cache in main memory. The buffer manager is a
critical part of the database system since it enables the database to handle data sizes
that are much larger than the size of main memory.
The storage manager implements several data structures as part of the physical system
implementation: -
Data Files: - which store the database itself.
Data Dictionary: - which stores metadata about the structure of the database, in
particular the schema of the database.
Indices: - which can provide fast access to data item. Like the index in the textbook, a
database index provides pointers to those data items that hold a particular value.
Statistical Data:- It refers to the metadata and metrics collected about the database’s
performance, usage and structure. This data is used to optimize query performance,
storage and system management.
Three Tier Schema Architecture for data independence: -
View Level
Logical level
Physical Level
Physical Level: - The lowest level of abstraction describes how the data are actually stored.
The physical level describes complex low-level data structures in detail.
Logical Level: - The next higher level of abstraction describes what data are stored in the
database and what relationships exist among those data. The logical level thus describes the
entire database in terms of a small number of relatively simple structures.
View Level: - The highest level of abstraction describes only part of the entire database.
Even though the logical level uses simpler structures, complexity remains because of the
variety of information stored in a large database. The view level of abstraction exists to
simplify their interaction with the system. The system may provide many view for the same
database.
Entity Relationship Model
Introduction:
The Entity – Relationship (E – R) data model was developed to facilitate database design by
allowing specification of an enterprise schema that represents the overall logical structure of
a database.
E – R Model: -
The Entity – Relationship(E-R) Model is a conceptual representation of the data structures
and relationships in a database. It’s a graphical representation that uses entities, attributes and
relationship to describe the organization of data
Strong Entity: -
A strong entity is an entity that has a unique identifier and exists independently of other
entities. It is an entity that can identified by its own attributes, without relying on another
entity.
Characteristics of a Strong Entity: -
1. Has a primary key (unique identifier)
2. It is represented by
3. It Exists independently.
4. Not dependent on another entity for its existence.
5. It can be created, updated and deleted independently.
Weak Entity: -
A weak entity is an entity that depends on another entity. It has no primary key of its own. It
has a foreign key that references the strong entity’s primary key.
Charateristics:-
1. It has no unique identifier (primary key)
2. It is represented by
3. It is dependent on a strong entity for existence.
4. It has a foreign key referencing the strong entity.
5. It cannot be created, updated or deleted independently.
Attributes: -
Attribute is an characteristic (or) property of an entity that describes it or defines its state.
Attributes are used to provide more details about an entity and help to differentiate.
Types of Attributes: -
1. Simple Attribute: - It is a type of attribute that has only one value that cannot be
further divided and we cannot insert the null values. They provide basic information
about entities.
Ex: Name, class, Roll_no, DOB, etc.
2.Composite Attribute: - It is a type of attribute that has multiple values of components.
And can be further divided into smaller attributes.
Ex: 1. Address: 2. Name:
-street -First name
-city -Middle name
-state -Last Name
-ZIP code
3. Multi-Valued Attribute: -
It is a type of attribute that can have multiple values for a single entity. It represents
the collection of values for a property of an entity. Allows an entity to have more than
one value for the attribute.
Ex: Phone Numbers – A person can have multiple phone numbers(e.g; home, work)
4. Single – Valued Attribute: -
It is a type of attribute that has only one value for a single entity. Represents a property
of an entity. It cannot have multiple values for the same entity.
5. Derived Attribute:-
It is a type of attribute that is calculated or derived from other attributes. Does not
store its own value. Depends on the values of other attributes.
Ex: Age
Relationship Set:
Characteristics of Relationship Set
Degree: Degree of a relationship set indicates to the number of properties related with
the relationship set.
Arity: Arity of a relationship set indicates the number of taking part relations. It can be
like double (including two relations), ternary (including three relations), and so on.
Cardinality: Cardinality characterizes the number of occurrences or records that can be
related with each substance on both sides of the relationship.
This permits for the execution of different sorts of connections, such as one-to-one, one-to-
many, and many-to-many, reflecting real-world associations between substances.
Degree of Relationship Set
It denotes the number of entity sets that belong to that particular relationship set is called
the degree of that relationship set.
Degree of a Relationship Set = the number of entity set that belongs to that particular
relationship set
Relationship Set Example with Tables
Let 'Customer' and 'Loan' entity sets defines the Relationship 'set Borrow' to denote
the association between Customer and bank
loans.
In the above example, in table Customer shows a customer borrows a loan from Table Loan.
In other words, the relationship borrows will be one to one.
Examples
1. Student-Student ID Relationship
In a one-to-one relationship, each student is assigned a unique student ID
one to one
2. Teacher-Student Relationship
Consider two relations: Teacher and Student. The relationship set between them can be
characterized as “mentors.” Each teacher mentors in one class, but a class can have
numerous students. So one teacher can have many students.
This is often an illustration of a one-to-many relationship.
one to many
3. Student-Course Relationship
In this situation, we have two relations: Understudy and Course. The relationship set
between them can be characterized as “enrolled_in.” Each understudy can be selected in
numerous courses, and each course can have different understudies.
This can be an illustration of a many-to-many relationship.
many to many
In this ER diagram, both entities customer and driving license having an arrow which
means the entity Customer is participating in the relation “has a” in a one-to-one fashion. It
could be read as ‘Each customer has exactly one driving license and every driving license is
associated with exactly one customer.
The set-theoretic perspective of the ER diagram is
There may be customers who do not have a credit card, but every credit card is associated
with exactly one customer. Therefore, the entity customer has total participation in a
relation.
2. One to many relationship (1:M)
Example:
This relationship is one to many because “There are some employees who manage more
than one team while there is only one manager to manage a team”.
Any of the four cardinalities of a binary relationship can have both sides partial, both total,
and one partial, and one total participation, depending on the constraints specified by user
requirements.
Generalization
Generalization is the process of extracting common properties from a set of entities and
creating a generalized entity from it. It is a bottom-up approach in which two or more
entities can be generalized to a higher-level entity if they have some attributes in common.
For Example, STUDENT and FACULTY can be generalized to a higher-level entity called
PERSON as shown in Figure 1. In this case, common attributes like P_NAME, and P_ADD
become part of a higher entity (PERSON), and specialized attributes like S_FEE become
part of a specialized entity (STUDENT).
Generalization is also called as “ Bottom-up approach”.
Specialization
In specialization, an entity is divided into sub-entities based on its characteristics. It is a
top-down approach where the higher-level entity is specialized into two or more lower-
level entities. For Example, an EMPLOYEE entity in an Employee management system can
be specialized into DEVELOPER, TESTER, etc. as shown in Figure 2. In this case,
common attributes like E_NAME, E_SAL, etc. become part of a higher entity
(EMPLOYEE), and specialized attributes like TES_TYPE become part of a specialized
entity (TESTER).
Specialization is also called as ” Top-Down approch”.
•Employee Table:
•Department Table:
Using relational operations, you can query, join and manipulate their tables to retrieve
meaningful data such as which employees belong to which department.
Concepts Of Domain: -
The concept of Domain in the relational model refers to the set of au possible values that an
attribute (column) can take.It so essentially the data type or range, of allowable values for an
attribute in a relation (table).
Ex: - An attribute like Age, the domain could be the set of all positive integers
between 1 and 100.
2. Atomicity: - A domain contains atomic (indivisible), values, meaning the values in a
domain are simple and cannot be broken down further.
3. Data Type: -The domain typically aligns with a specific data type like:
•Integer: A set of whole numbers (e.g., 1, 2,3,)
4. Constraints on Domain: Domains can have constraints that define specific rules for allowable
values:
•Range: À domain can restrict values within a specific range (e.g.,Age must be
between 18 and 65)
•Pattern: for example, the domain of an email address might enforce a specific patten,
such as requiring an @ symbol.
•Uniqueness: In some cases, a domain may require that values be unique, as with a
primary key.
Consider a relation
Students (Student -ID, Name, Each attribute in this student relation has its own domain:
Attribute Domain
Student_ID Positive integers (,
1 , 2,3)
•Student_ID is a Positive integer, so the domain consists of all possible positive integers.
•Age limited to Integers within a specific range, such as between 18 and 30.
•Gender in restricted to one of three possible values: Male, Female or other:
Age, Gender):
Concepts of attributes:
In the relational model of a database Mangement system. Attributes are the columns of a
table (relation). That describes the properties or characteristics of the entities in the table.
Ex: - In a student table, the attributes, could be student_ID, Name, Age, and Course.
•Name and Domain: Every attribute has a name (used to identity it) and a domain, which,
defines the allowable set of values for that attribute.
Ex: The attribute Age might have a domain restricting values to positive integers.
•Data Type: Attributes have associated data types, which define the kind of values they can hold,
such as:
Concepts Of Tuple: -
Example of a Tuple:
Here, each row represents, a tuple, and the table contains three tuples. Let's break down one
tuple:
Employee_ID: 1
Age: 30 Department: HR
•Tuple 2: (2, "Tane Smith”, 28, " IT")
Employee _ID: 2
Age: 28 Department: IT
•Tuple 3: (8: " Alice", 35 " Marketing")
Employee_ID: 3
Name: Alice Аgе: 35
Department: Marketing
Concepts of Relation: -
In the context of a relational database management system. A relation as a fundamental Concept
that represents a table. At consists of tuples (rows) and attributes (columns), where each attribute
has a specific data type. Key Concepts of Relation:
3. Schema: The schema of a relation defines its structure, including the names
of attributes and this data types.
4. primary Key: - A primary key is an attribute (or a set of attributes that uniquely
identifies each tuple in a relation. It ensures that no two tuples are identical based on the
primary key.
5. Foreign Key: -A foreign key is an attribute in one relation that refers to the primary key
of another relation, establishing a link between the two tables.
Example of a Relation: -
• Attributes:
Employee_ID: Integer, unique identifier for each employee (primary (key).
Name: String, name of the employee
Tuples: each row in a table is a tuple representing an employee. For example, the fast tuple (1,
"John smith", 30 "HR", 60000) contains all the information for the employee with
Employee_ID 1. Importance of null values: -
2. Representation of Unknown Data: NULL allows for data to be stored when the value
is unknown but might be provided later distinguishing between missing and unknown value.
Example: A customer signs up without providing an email address. The email field can be
left as Null, indicating that the value, is unknown
INSERT INTO Customers (Customer_ID, Name, Email) VALUES (1,'Alice', NULL);
3. Representation of Inapplicable Data: In some cases, certain fields may not apply to all
records, such as employee benefits for freelance workers. NULL can represent these
inapplicable fields. 4 Data Integrity: - NULL enforces data integrity by ensuring that
databases can capture incomplete, or uncertain data without relying or misleading place
holder values like “N/A” or “o”.
7. Standard Approach: NULL provides a standardized way of dealing with the absence of
data across relational databases. Ensuring consistency in data management practices
Constraints: -
In relational model constraints play a critical rule in maintaining the integrity and consistency of
data within relational databases.
Constraints in the relational model are rules applied to relational tables (or relations) to
restrict the types of data that can be stored n than. They ensure that the data, to certain
standards and rules defined by the database schema.
1. Domain constraint
2. Key Constraint
3. Integrity Constraint
1. Domain Constraint: - A domain constraint in the relational
model in a rule the restricts the set of permissible values that can be stored in a particular
attribute (column) of a relation (table). It defines the valid values for an attribute in teams of
data type, length, range, or a specific set of Values. Domain constraints ensure that data
entered into the database in valid and confirms to the defined specifications.
Example: -
CREATE TABLE Customers (Customer_ID INT PRIMARY KEY, Name VARCHAR (100)
NOT NULL, Age INT CHECK (Age Between 18 AND
Age: This column has a check constraint that restrict the values to be between 18 and
100.
•Domain constants ensure that only valid data types and values stored in a column.
•All entries in a column with conform to the same standards, making It easier to
understand and analyze the data.
•Domain Constraints help, prevent common data entry errors at the point of data
insertion. By restricting the range of acceptable values, they minimize the chances of
user mistakes.
•when data is validated at the time of entry, it reduces the need for extensive data
cleaning and correction later. This Simplifies database maintenance and management.
•Domain, Constraint can be used to enforce business rules, directly within the
database schema.
•Domain constraints contribute the overall integrity of the database
2. Key Constraint: -
A key Constraint is a statement that a certain minimal Subset of the fields of a relation is a
unique identifier for a tuple. A set of fields that uniquely identifier a tuple according to a key
constraint is called a candidate Key.
•Two distinct "tuples an a legal instance (an instance that satisfies to all ICs, including the
key constant) cannot have identical values in all the fields of a key.
•No subset of the set of fields in a key is a unique identifier for a tuple.
There are different types of key constraints:
1. primary key: Uniquely identifies each record in a table. A primary Key must contain
unique values and cannot contain NULL Values.
2. Unique Key: Ensures that all values in a column (or a combination of Columns) are
unique, but unlike a primary key is can contain NULL values.
Ex: - CREATE TABLE Students (Student_ID INT PRIMARY KEY,
Student_ID: this column is defined as the primary key. It uniquely entities each student and
cannot be NULL or duplicate.
Name: A non-nullable Column to store the student's name.
Email: This column has a unique key Constraint, ensuring that no two students can have the
same email address.
•Primary keys encore that each record can be uniquely identified, which is essential for, data
integrity.
•Key constraints prevent duplicate records, which helps maintain the accuracy of the data.
•Key constants enable efficient data retrieval and indexing.
•primary keys are often used in foreign key constants, establishing relationship between tables
and maintaining referential integrity.
•Key constraints help organize data effectively, allowing for better data management and
retrieval strategies.
•They enable the creation of relationship between different tables, essential for normalized
database design.
3. Integrity Constant: -
Integrity Constraint in a relational model are rules that ensures the accuracy and
consistency of data within a database. These constants are critical for maintaining the
integrity of the database by enforcing certain rules on the data entered.
•Entity Integrity: Ensures that each table has a primary key, and that this key cannot
contain NULL values.
•Referential Integrity: Ensures that a relationship between tables remain consistent.
specifically, it mandates that a foreign Key must either be NULL or must match a primary
key in another table.
•Domain Integrity: Enforces valid entries, for a given column through restrictions on
the types and ranges of data.
•User- Defined Integrity: Custom rules defined by users to enforce specific business rules or
requirements.
CREATE TABLE Students (student-ID INT PRIMARY KEY, Name VARCHAR (100) NOT
NULL);
The students table has a primary key (Student_ID) that uniquely identifies each student.
Importance of Integrity Constraint:
• Integrity constrains help ensure that the data stored in the database is accurate and conforms
to specified formats and rules.
•They promote consistency across the database by ensuring that all data to defined rules.
•Referential integrity constants ensures that relationships between tables are valid.
•Integrity constraints contribute to the ensures quality of the data in the database
•They allow the enforcement of business rules directly within the database schema.
Relational Algebra: -
Relational algebra is one of the two formal queries languages within the relational model. It is
a procedural language. It provides a set of operations. That take one or more
relation(tables)as input and provide a new relation as output.
•Fundamental Operations: - Fundamental operations are the basic operations that can be
performed on relational data. They are Used to retrieve, manipulate, and combine data
from one on more relations(tables).
The fundamental operations of relational algebra:
1. Selection (�):
•It is used to select required tuples of the relations.
•It retrieves rows form a relation that satisfy a specified Condition (predicate).
3. Union (U):
Union operation in relational algebra is the same as union operation in set theory.
•The union operation combines the tuples of two relations, removing duplicates.
•It Is denoted by “U”
Relational Algebra: -
Relational algebra is one of the two formal queries languages within the relational model. It is
a procedural language. It provides a set of operations. That take one or more
relation(tables)as input and provide a new relation as output.
•Fundamental Operations: - Fundamental operations are the basic operations that can be
performed on relational data. They are Used to retrieve, manipulate, and combine data
from one on more relations(tables).
The fundamental operations of relational algebra:
4. Selection (�):
•It is used to select required tuples of the relations.
•It retrieves rows form a relation that satisfy a specified Condition (predicate).
1. Intersection (⋂):
•The intersection operation returns the tuples that are present in both of the given relations.
•It in equivalent to selecting tuples that are common to both relations.
• It is denoted by “⋂”
Notation: Relation1 ⋂ Relation2
•A natural join as a type of join that automatically combines tuples from two relations based
on all common attributes.
•It Is denoted by “�”
•Outer join:
•It as a type of join that returns not only the matching rows between
two tables but also the non-matching row from one or both tables.
•Outer joins can be used to retrieve data even if one of the related tables does not have a
matching record, felling in the missing values with NULL.
•Left outer join (�): - A left outer Join returns all the rows from the left table, along with
the matching rows from the right table. If there is no match, NULL values are returned for the
columns from the right table.
Ex:-
Student Table Enrollment Table
Result:
Student_ID Name Course
1 Alice math
2 Bob science
3 Carol NULL
2. Right outer join (⟖): - A right outer join returns all rows from the right table, along with
the matching rows from the left table. If there is no match, NULL values are returned for the
columns toom the left table.
Result:
2. Full outer join (�): - A full outer join returns all the rows from both tables. It there is a
match between the tables, it shows the matching rows. If no match & found, NULL
values are returned for the missing columns from either table.
Result:
Relational Calculus:
Relational calculus es an alternative to relational algebra. The relational calculus is a non-
procedural language in relational databases. Relational calculus focuses on what results to
obtain, leaving the specific execution details to the database System. In Relational calculus,
two important quantifiers are used to express conditions in queries:
•Universal Quantifier (⩝)
1. Universal Quantifier: - The universal quantifier states that a predicate must be true for all
possible values in the specified domain.
Notation: ⩝x P(x)
2. Existential Quantifier: - The existential Quantifier states that a predicate is true for at
least one value in the specified domain.
Notation: ∃x P(x)
This means "there exist at least one value of x such that the
predicate P(x) is true”.
•The condition T.Age > 20 restricts the result to only those students whose age in greater than 20.
where;
•X1, x2,……,xn are domain variables representing values of attributes.
•P (x1, x2,….,xn) is a predicate (logical Condition) that must be true for the variables.
Example: -
BASIC SQL:
Simple Database Schema:
A Simple database schema refers to the structure or blueprint of a database, defining how data is
organized into tables and how the relationships between these tables are established. Primary
components are:
1. Tables:
•A table is a collection of related data organised in rows and columns
2. Columns (Attributes):
•Columns are the vertical division of a table that represent specific attributes of the entity.
•Each column holds a particular piece of data for each record (row) in the table.
Example: In the students table, Attributes are Student_ID, FirstName, LastName, Age and
Department
3. Rows (Records):
•Rows are the horizontal entries in a table that represent Individual records or instances of the
entity.
•Each row contains data for each attribute defined by the columns.
Example:
5. Foreign Key:
•A foreign key is a column that creates a relationship between two tables by referencing the
primary key of another table.
6. Unique Key: -
• Ensures that all values in a column or set of columns are unique.
•Schemas provide a structured way to organize and store data.
Data types:
In SQL, data types define the type of data that can be stored in each column of a table. They
ensure consistency and accuracy in data storage by restricting the kinds of data that can be
entered.
• DECIMAL / NUMERIC: Used for storing numbers with fixed decimal points.
• CHAR(n): Fixed length string where n defines the number of characters. If the text
is shorter than n, it is padded with spaces.
• TIMESTAMP: stores both date and time, typically used to track events.
• BLOB: Binary Large object. Used for storing large amounts of binary data like
images or audio.
•UP: Stores a universally unique identifier, typically used for primary keys Instead of
integer IDs.
• ENUM: Allows a column to have one value from a predefined list of possible values.
SQL COMMANDS
SQL (standard Query language) is, a standard language used to manage and manipulate
relational databases It allow user to create, retrieve, update, and delete data stored in a
database. SQL is used for tasks like querying data, inserting Records, updating data and
creating database structures such as tables.
SQL commands are:
1.DDL (Data Definition Language): - DDL is a subset of SQL commands used to define and
manage database structures, such as tables, schemas, and indexes. DDL commands focus on
the creation, modification, and deletion of database objects.
DDL commands are:
views, or schemas.
→ table created
→ desc student;
Name Туре
ST_ ID NUMBER(10)
ST_PH_NO VARCHAR(10)
alter table table_name and column (column_name data_type (Size); Example: - alter table
student add column (st_address varchar (10);
→ table altered
Name Type
→ table altered
→ desc student;
Name Туре
→ table dropped
Name Туре
→ 1 row inserted
→ 1 row inserted.
→ 1 row inserted
→ 1 row updated.
3. DELETE: - Delete all records from a table, the space for the records remain.
→ 1 row deleted
→This gives the user john permission to SELECT and INSERT data on the employee's table.
SELECT: It is used, to query and retrieve data from a database. It can retrieve data from one or
more tables based on Specific conditions.
The SQL WHERE clause allows to filtering of records in queries. Whether you’re retrieving
data, updating records, or deleting entries from a database, the WHERE clause plays an
important role in defining which rows will be affected by the query. Without it, SQL
queries would return all rows in a table, making it difficult to target specific data.
In this article, we will learn the WHERE clause in detail—from basic concepts to advanced
ones. We’ll cover practical examples, discuss common operators, provide optimization tips,
and address real-world use cases.
What is the SQL WHERE Clause?
The SQL WHERE clause is used to specify a condition while fetching or modifying data in a
database. It filters the rows that are affected by the SELECT, UPDATE, DELETE,
or INSERT operations. The condition can range from simple comparisons to complex
expressions, enabling precise targeting of the data.
Syntax:
SELECT column1,column2 FROM table_name WHERE column_name operator value;
Parameter Explanation:
column1,column2: fields in the table
table_name: name of table
column_name: name of field used for filtering the data
operator: operation to be considered for filtering
value: exact value or pattern to get related data in the result
Examples of WHERE Clause in SQL
To create a basic employee table structure in SQL for performing all the where clause
operation.
Query:
CREATE TABLE Emp1(EmpID INT PRIMARY KEY, Name VARCHAR(50), Country
VARCHAR(50), Age int(2), mob int(10));
Here we have done addition of 100 to each Employee’s salary i.e, addition operation on
single column.
Let’s perform addition of 2 columns:
SELECT employee_id, employee_name, salary, salary + employee_id
AS "salary + employee_id" FROM addition;
Output:
2 rr 55000 55002
Here we have done addition of 2 columns with each other i.e, each employee’s employee_id
is added with its salary.
Subtraction (-) :
It is use to perform subtraction operation on the data items, items include either single
column or multiple columns.
Implementation:
SELECT employee_id, employee_name, salary, salary - 100
AS "salary - 100" FROM subtraction;
Output:
Here we have done subtraction of 100 to each Employee’s salary i.e, subtraction operation on
single column.
Let’s perform subtraction of 2 columns:
SELECT employee_id, employee_name, salary, salary - employee_id
AS "salary - employee_id" FROM subtraction;
Output:
salary –
employee_id employee_name salary employee_id
Here we have done subtraction of 2 columns with each other i.e, each employee’s
employee_id is subtracted from its salary.
Division (/) : For Division refer this link- Division in SQL
Multiplication (*) :
It is use to perform multiplication of data items.
Implementation:
SELECT employee_id, employee_name, salary, salary * 100
AS "salary * 100" FROM addition;
Output:
Here we have done multiplication of 100 to each Employee’s salary i.e, multiplication
operation on single column.
Let’s perform multiplication of 2 columns:
SELECT employee_id, employee_name, salary, salary * employee_id
AS "salary * employee_id" FROM addition;
Output:
salary *
employee_id employee_name salary employee_id
Here we have done multiplication of 2 columns with each other i.e, each employee’s
employee_id is multiplied with its salary.
Modulus ( % ) :
It is use to get remainder when one data is divided by another.
Implementation:
SELECT employee_id, employee_name, salary, salary % 25000
AS "salary % 25000" FROM addition;
Output:
1 Finch 25000 0
Here we have done modulus of 100 to each Employee’s salary i.e, modulus operation on
single column.
Let’s perform modulus operation between 2 columns:
SELECT employee_id, employee_name, salary, salary % employee_id
AS "salary % employee_id" FROM addition;
Output:
salary %
employee_id employee_name salary employee_id
1 Finch 25000 0
2 Peter 55000 0
3 Warner 52000 1
4 Watson 12312 0
Here we have done modulus of 2 columns with each other i.e, each employee’s salary is
divided with its id and corresponding remainder is shown.
Basically, modulus is use to check whether a number is Even or Odd. Suppose a given
number if divided by 2 and gives 1 as remainder, then it is an odd number or if on dividing
by 2 and gives 0 as remainder, then it is an even number.
Concept of NULL :
If we perform any arithmetic operation on NULL, then answer is always null.
Implementation:
SELECT employee_id, employee_name, salary, type, type + 100
AS "type+100" FROM addition;
Output:
Here output always came null, since performing any operation on null will always result in
a null value.
Example:
In the below example, we will see how this logical operator works with the help of creating a
database.
Step 1: Creating a Database
In order to create a database, we need to use the CREATE operator.
Query
CREATE DATABASE xstream_db;
Step 2: Create table employee
In this step, we will create the table employee inside the x stream_db database.
Query
CREATE TABLE employee (emp_id INT, emp_name VARCHAR(255),
emp_city VARCHAR(255),
emp_country VARCHAR(255),
PRIMARY KEY (emp_id));
Create Table
In order to insert the data inside the database, we need to use the INSERT operator.
Query
INSERT INTO employee VALUES (101, 'Utkarsh Tripathi', 'Varanasi', 'India'),
(102, 'Abhinav Singh', 'Varanasi', 'India'),
(103, 'Utkarsh Raghuvanshi', 'Varanasi', 'India'),
(104, 'Utkarsh Singh', 'Allahabad', 'India'),
(105, 'Sudhanshu Yadav', 'Allahabad', 'India'),
(106, 'Ashutosh Kumar', 'Patna', 'India');
Insert Value
Output
employee Table
Now the given below is the list of different logical operators.
AND Operator
The AND operator is used to combines two or more conditions but if it is true when all the
conditions are satisfied.
Query
SELECT * FROM employee WHERE emp_city = 'Allahabad' AND emp_country = 'India';
Output
IN Operator
It is used to remove the multiple OR conditions in SELECT, INSERT, UPDATE,
or DELETE. and We can also use NOT IN to minimize the rows in your list and any kind of
duplicate entry will be retained.
Query
SELECT * FROM employee WHERE emp_city IN ('Allahabad', 'Patna');
Output
NOT Operator
Query
SELECT * FROM employee WHERE emp_city NOT LIKE 'A%';
Output
OR Operator
The OR operator is used to combines two or more conditions but if it is true when one of the
conditions are satisfied.
Query
SELECT * FROM employee WHERE emp_city = 'Varanasi' OR emp_country = 'India';
Output
LIKE Operator
In SQL, the LIKE operator is used in the WHERE clause to search for a specified pattern in a
column.
% – It is used for zero or more than one character.
_ – It is used for only one character means fixed length.
Query
SELECT * FROM employee WHERE emp_city LIKE 'P%';
Output
BETWEEN Operator
The SQL BETWEEN condition allows you to easily test if an expression is within a range of
values (inclusive).
Query
SELECT * FROM employee WHERE emp_id BETWEEN 101 AND 104;
Output
ALL Operator
The ALL operator returns TRUE if all of the subqueries values matches the condition.
All operator is used with SELECT, WHERE, HAVING statement.
Query
SELECT * FROM employee WHERE emp_id = ALL
(SELECT emp_id FROM employee WHERE emp_city = 'Varanasi');
Output
ANY Operator
The ANY operator:
It returns a boolean value as a result
It returns TRUE if ANY of the subquery values match the condition
Query
SELECT * FROM employee WHERE emp_id = ANY
(SELECT emp_id FROM employee WHERE emp_city = 'Varanasi');
Output
EXISTS Operator
In SQL,Exists operator is used to check whether the result of a correlated nested query is
empty or not.
Exists operator is used with SELECT, UPDATE, INSERT or DELETE statement.
Query
SELECT emp_name FROM employee WHERE EXISTS
(SELECT emp_id FROM employee WHERE emp_city = 'Patna');
Output
SOME Operator
In SQL, SOME operators are issued with comparison operators (<,>,=,<=, etc) to compare
the value with the result of a subquery.
Query
SELECT * FROM employee WHERE emp_id < SOME
(SELECT emp_id FROM employee WHERE emp_city = 'Patna');
Output
In MySQL, functions play a crucial role in performing various operations on data, such
as calculations, string manipulations, and date handling. These built-in functions simplify
complex queries and data transformations, making it easier to manage and analyze data
within a database.
In this article, we will look at these different categories of MySQL functions, and look at
different MySQL functions with definitions and examples in each category.
Functions in MySQL
In MySQL, functions are a fundamental part of the SQL language, enabling us to perform
calculations, manipulate data and retrieve information.
The functions in MySQL can edit rows and tables, alter strings, and help us to manage
organized and easy-to-navigate databases.
A function is a special type of predefined command set that performs some operation and
returns a value. Functions operate on zero, one, two, or more values that are provided to them.
The values that are provided to functions are called parameters or arguments.
The MySQL functions have been categorized into various categories, such as String
functions, Mathematical functions, Date and Time functions, etc.
String Functions
Numeric Functions
Date and Time Functions
String functions
String functions are used to perform an operation on input string and return an output string.
Following are the string functions defined in SQL:
1. CONCAT
Purpose: Combines two or more strings into a single string.
Syntax:
CONCAT(string1, string2, ...)
Example:
SELECT CONCAT('Hello', ' ', 'World') AS Result;
Result:
Result -------- Hello World
2. LENGTH / LEN
Purpose: Returns the length of the string.
Syntax:
LENGTH(string)
Example:
SELECT LENGTH('Database') AS Length;
Result:
Length -------- 8
3. UPPER
Purpose: Converts all characters in a string to uppercase.
Syntax:
UPPER(string)
Example:
SELECT UPPER('hello') AS UpperCase;
Result:
UpperCase -------- HELLO
4. LOWER
Purpose: Converts all characters in a string to lowercase.
Syntax:
LOWER(string)
Example:
SELECT LOWER('WORLD') AS LowerCase;
Result:
LowerCase -------- world
5. SUBSTRING / SUBSTR
Purpose: Extracts a part of the string.
Syntax:
SUBSTRING(string, start_position, length)
Example:
SELECT SUBSTRING('Database', 1, 4) AS Sub;
Result:
Sub -------- Data
6. TRIM
Purpose: Removes leading, trailing, or both spaces (or characters) from a string.
Syntax:
TRIM([BOTH | LEADING | TRAILING] 'char' FROM string)
Example:
SELECT TRIM(' SQL ') AS Trimmed;
Result:
Trimmed -------- SQL
7. REPLACE
Purpose: Replaces all occurrences of a substring with another substring.
Syntax:
REPLACE(string, old_substring, new_substring)
Example:
SELECT REPLACE('Learn SQL', 'SQL', 'DBMS') AS Replaced;
Result:
Replaced -------- Learn DBMS
8. REVERSE
Purpose: Reverses the characters of a string.
Syntax:
REVERSE(string)
Example:
SELECT REVERSE('DBMS') AS Reversed;
Result:
Reversed -------- SMBD
9. INSTR / CHARINDEX
Purpose: Finds the position of a substring in a string.
Syntax:
For MySQL: INSTR(string, substring)
For SQL Server: CHARINDEX(substring, string)
Example:
SELECT INSTR('Database', 'base') AS Position;
Result:
Position -------- 5
10. LEFT
Purpose: Returns the specified number of characters from the left of the string.
Syntax:
LEFT(string, number_of_characters)
Example:
SELECT LEFT('Database', 4) AS LeftPart;
Result:
LeftPart -------- Data
11. RIGHT
Purpose: Returns the specified number of characters from the right of the string.
Syntax:
RIGHT(string, number_of_characters)
Example:
SELECT RIGHT('Database', 4) AS RightPart;
Result:
RightPart -------- base
12. LPAD
Purpose: Pads a string on the left side with a specified character.
Syntax:
LPAD(string, length, pad_string)
Example:
SELECT LPAD('SQL', 5, '*') AS LeftPad;
Result:
LeftPad -------- **SQL
13. RPAD
Purpose: Pads a string on the right side with a specified character.
Syntax:
RPAD(string, length, pad_string)
Example:
SELECT RPAD('SQL', 5, '*') AS RightPad;
Result:
RightPad -------- SQL**
14. CONCAT_WS
Purpose: Combines strings with a specified separator.
Syntax:
CONCAT_WS(separator, string1, string2, ...)
Example:
SELECT CONCAT_WS('-', '2024', '06', '23') AS Date;
Result:
Date -------- 2024-06-23
15. ASCII
Purpose: Returns the ASCII value of the first character of a string.
Syntax:
ASCII(string)
Example:
SELECT ASCII('A') AS AsciiValue;
Result:
AsciiValue -------- 65
16. CHAR
Purpose: Converts an ASCII code to its corresponding character.
Syntax:
CHAR(ascii_value)
Example:
SELECT CHAR(65) AS Character;
Result:
Character -------- A
1. ABS()
Purpose: Returns the absolute value of a number.
Syntax:
SELECT ABS(column_name) AS Result FROM table_name;
Query:
SELECT ProductName, ABS(Discount) AS AbsoluteDiscount FROM Products;
Result:
ProductName AbsoluteDiscount
Laptop 5.5
Phone 10.0
Tablet 7.5
Headphones 15.0
Monitor 0.0
2. CEIL()
Purpose: Returns the smallest integer greater than or equal to a number.
Syntax:
SELECT CEIL(column_name) AS Result FROM table_name;
Query:
SELECT ProductName, CEIL(Price) AS CeilPrice FROM Products;
Result:
ProductName CeilPrice
Laptop 50001
Phone 20001
Tablet 15001
Headphones 2000
Monitor 12000
3. FLOOR()
Purpose: Returns the largest integer less than or equal to a number.
Syntax:
SELECT FLOOR(column_name) AS Result FROM table_name;
Query:
SELECT ProductName, FLOOR(Price) AS FloorPrice FROM Products;
Result:
ProductName FloorPrice
Laptop 50000
Phone 20000
Tablet 15000
Headphones 2000
Monitor 12000
4. ROUND()
Purpose: Rounds a number to the specified number of decimal places.
Syntax:
SELECT ROUND(column_name, decimal_places) AS Result FROM table_name;
Query:
sql
Copy code
SELECT ProductName, ROUND(Price, 1) AS RoundedPrice FROM Products;
Result:
ProductName RoundedPrice
Laptop 50000.8
Phone 20000.5
Tablet 15000.3
Headphones 2000.0
Monitor 12000.0
5. MOD()
Purpose: Returns the remainder of a division.
Syntax:
SELECT MOD(column1, column2) AS Result FROM table_name;
Query:
SELECT ProductName, MOD(Quantity, 3) AS QuantityRemainder FROM Products;
Result:
ProductName QuantityRemainder
Laptop 1
Phone 1
Tablet 0
ProductName QuantityRemainder
Headphones 2
Monitor 2
6. POWER()
Purpose: Returns the value of a number raised to a power.
Syntax:
SELECT POWER(column_name, exponent) AS Result FROM table_name;
Query:
SELECT ProductName, POWER(2, 3) AS PowerResult FROM Products;
Result:
ProductName PowerResult
Laptop 8
Phone 8
Tablet 8
Headphones 8
Monitor 8
7. SQRT()
Purpose: Returns the square root of a number.
Syntax:
SELECT SQRT(column_name) AS Result FROM table_name;
Query:
SELECT ProductName, SQRT(Price) AS PriceSquareRoot FROM Products;
Result:
ProductName PriceSquareRoot
Laptop 223.61
ProductName PriceSquareRoot
Phone 141.42
Tablet 122.47
Headphones 44.72
Monitor 109.54
CURDATE()
Returns the current date.
Query:
SELECT CURDATE();
Output:
CURTIME()
Returns the current time.
Query:
SELECT CURTIME();
Output:
DATE()
Extracts the date part of a date or date/time expression. Example: For the below table
named ‘Test’
Id Name BirthTime
Query:
SELECT Name, DATE(BirthTime)
AS BirthDate FROM Test;
Output:
Name BirthDate
Pratik 1996-09-26
EXTRACT()
Returns a single part of a date/time.
Syntax
EXTRACT(unit FROM date);
Several units can be considered but only some are used such as MICROSECOND,
SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, YEAR, etc. And
‘date’ is a valid date expression. Example: For the below table named ‘Test’
Id Name BirthTime
Id Name BirthTime
Query:
SELECT Name, Extract(DAY FROM
BirthTime) AS BirthDay FROM Test;
Output:
Name Birthday
Pratik 26
Query:
SELECT Name, Extract(YEAR FROM BirthTime)
AS BirthYear FROM Test;
Output:
Name BirthYear
Pratik 1996
Query:
SELECT Name, Extract(SECOND FROM
BirthTime) AS BirthSecond FROM Test;
Output:
Name BirthSecond
Pratik 581
DATE_ADD()
Adds a specified time interval to a date.
Syntax:
DATE_ADD(date, INTERVAL expr type);
Where, date – valid date expression, and expr is the number of intervals we want to add.
and type can be one of the following: MICROSECOND, SECOND, MINUTE, HOUR,
DAY, WEEK, MONTH, QUARTER, YEAR, etc. Example: For the below table named
‘Test’
Id Name BirthTime
Query:
SELECT Name, DATE_ADD(BirthTime, INTERVAL
1 YEAR) AS BirthTimeModified FROM Test;
Output:
Name BirthTimeModified
Query:
SELECT Name, DATE_ADD(BirthTime, INTERVAL 30 DAY) AS BirthDayModified
FROM Test;
Output:
Name BirthDayModified
Query:
SELECT Name, DATE_ADD(BirthTime, INTERVAL
4 HOUR) AS BirthHourModified FROM Test;
Output:
Name BirthSecond
DateDiff
DATE_FORMAT()
Displays date/time data in different formats.
Syntax:
DATE_FORMAT(date,format);
the date is a valid date and the format specifies the output format for the date/time. The
formats that can be used are:
%a-Abbreviated weekday name (Sun-Sat)
%b-Abbreviated month name (Jan-Dec)
%c-Month, numeric (0-12)
%D-Day of month with English suffix (0th, 1st, 2nd, 3rd)
%d-Day of the month, numeric (00-31)
%e-Day of the month, numeric (0-31)
%f-Microseconds (000000-999999)
%H-Hour (00-23)
%h-Hour (01-12)
%I-Hour (01-12)
%i-Minutes, numeric (00-59)
%j-Day of the year (001-366)
%k-Hour (0-23)
%l-Hour (1-12)
%M-Month name (January-December)
%m-Month, numeric (00-12)
%p-AM or PM
%r-Time, 12-hour (hh:mm: ss followed by AM or PM)
%S-Seconds (00-59)
%s-Seconds (00-59)
%T-Time, 24-hour (hh:mm: ss)
%U-Week (00-53) where Sunday is the first day of the week
%u-Week (00-53) where Monday is the first day of the week
%V-Week (01-53) where Sunday is the first day of the week, used with %X
%v-Week (01-53) where Monday is the first day of the week, used with %x
%W-Weekday name (Sunday-Saturday)
%w-Day of the week (0=Sunday, 6=Saturday)
%X-Year for the week where Sunday is the first day of the week, four digits, used with %V
%x-Year for the week where Monday is the first day of the week, four digits, used with %v
%Y-Year, numeric, four digits
%y-Year, numeric, two digits
SQL Query
Now, to retrieve data from the tables we write the following query to view the details of
both the tables.
SELECT students.Enrolment_No, students.Name, students.Roll_No, class_details.Class,
class_details.Division
FROM students
JOIN class_details ON students.Enrolment_No = class_details.Enrolment_No;
Query for retrieving records from both the table
The above query retrieves records using the primary key and foreign key of another table
which is in turn a primary key of the first table. The above query uses 'Enrolment_No' for
referencing the records from both tables. In this query, we retrieved data using the primary
key of the Student table and as a foreign key of the class_details table.
Output:
Explanation: Through the above output or result we can see that each record of the first
table is related exactly to one another record in the next table. You can also notice that in
one-to-one relationship output, no multiple records are possible.
one-to-one relationship(vice-versa)
In the definition, we saw that the one-to-one relationship is also vice-versa, this means
the class_details table is also related to exactly one record in students table.
Explanation: You can see that we have the same number of records that we got in the
output from the students table to the class_details table i.e. 5 and it is the same number of
records while retrieving from class_details to the students table. You can also see that each
record is associated with only one record in another table. We got 5 records from both way
of retrieving the data using one-to-one relation from table students
and class_details and vice-versa.
One-to-Many Relationship
In this relationship, each row/record of the Let's first table can be related to multiple
rows in the second table.
Example:
Let's consider 2 tables "Parent" and "Child". Now the "Parent" table has columns
"ParentID", "Name"... and so on, and "Child" has columns or attributes "ChildID",
"Name", "Age", "ParentID" and so on.
In the above example we used the logic of parent and child and you can see that a parent
can have many children, but a child cannot have many/multiple parents.
Create tables 'parent' and 'child' as mentioned in the above table and let's see one-to-
many relation between these 2 tables.
one-to-many relationship between parent and child table
We can see that each record from the table parent is associated with one or more than one
record(s) in the child table representing one-to-many relationships. Here, a one-to-many
relationship occurs when the vice-versa of the above condition should not be true. If the
vice-versa or reverse condition for a one-to-many relationship becomes true then it is a
many-to-many relationship.
SQL Query:
Let's see the query to retrieve data representing one-to-many relationships parent-to-child:
SELECT parent.parentID, parent.Name, parent.age, parent.Address, child.ChildID,
child.Name, child.age
FROM parent
JOIN child ON parent.parentID = child.ParentID
21CSE103 Mukesh 20
Mukesh 5th 20
Key Constraints
Keys are the entity set that are used to identify an entity within its entity set uniquely. An
entity set can contain multiple keys, bit out of them one key will be primary key. A primary
key is always unique, it does not contain any null value in table.
Example:
20 Chandigarh
21 Punjab
25 Delhi
Conclusion
Integrity constraints act as the backbone of reliable and robust database. They ensure that
the data stored is reliable, worthy, consistent and accurate within the database. By
implement integrity constraints we can improve the quality of the data stored in database.
So, as the database continues to grow it will not become inconsistent and inaccurate.
1 Alice 4000
2 Bob 6000
3 Carol 8000
4 David 7000
Query:
SELECT Name, Salary FROM Employees WHERE Salary > (SELECT AVG(Salary) FROM
Employees);
Execution:
Subquery: (SELECT AVG(Salary) FROM Employees) calculates the average salary (6250).
Outer Query: Fetches employees with Salary > 6250.
Result:
Name Salary
Carol 8000
David 7000
1 HR New York
2 IT California
DeptID DeptName Location
Employees
1 Alice 1
2 Bob 2
3 Carol 3
Query:
SELECT Name FROM Employees WHERE DeptID IN (SELECT DeptID FROM
Departments WHERE Location = 'New York');
Execution:
Subquery: Fetches DeptID of departments in New York → (1, 3).
Outer Query: Fetches employees with matching DeptID.
Result:
Name
Alice
Carol
3. Correlated Subquery
A correlated subquery depends on the outer query for its values.
Example: Find employees whose salary is greater than the average salary of their department.
Tables:
Employees
1 Alice 4000 1
2 Bob 6000 2
3 Carol 7000 1
EmployeeID Name Salary DeptID
4 David 8000 2
Query:
SELECT Name, Salary FROM Employees E1 WHERE Salary > (SELECT AVG(Salary)
FROM Employees E2 WHERE E1.DeptID = E2.DeptID);
Execution:
The subquery calculates the average salary for each department (DeptID).
The outer query checks if the employee’s salary is greater than their department’s average
salary.
Result:
Name Salary
Carol 7000
David 8000
DeptID AVG_Salary
1 5500
2 7000
5. Nested Query with EXISTS
The EXISTS operator checks whether a subquery returns any rows.
Example: Find employees who belong to departments located in New York.
Query:
SELECT Name FROM Employees E WHERE EXISTS (SELECT 1 FROM Departments D
WHERE E.DeptID = D.DeptID AND D.Location = 'New York');
Execution:
The subquery checks if a matching DeptID exists with Location = 'New York'.
If true, the outer query includes the employee.
Result:
Name
Alice
Carol
Subqueries in SQL
A subquery is a query nested inside another query. The result of the subquery is used by the
main query to perform further operations. Subqueries are also called inner queries, and the
main query is called the outer query.
Syntax of Subqueries
SELECT column1, column2 FROM table1 WHERE column_name OPERATOR (SELECT
column_name FROM table2 WHERE condition);
The subquery is placed inside parentheses ().
The result of the subquery can be a single value, multiple rows, or even a table.
Types of Subqueries
Single-Row Subquery – Returns one row and one column.
Multi-Row Subquery – Returns multiple rows.
Multi-Column Subquery – Returns multiple columns.
Correlated Subquery – Subquery depends on the outer query.
Nested Subquery – Subqueries inside other subqueries.
Examples of Subqueries
1. Single-Row Subquery
Returns a single value and uses operators like =, >, <, etc.
Example: Find employees whose salary is greater than the average salary.
Table: Employees
1 Alice 4000
2 Bob 6000
3 Carol 8000
4 David 7000
Query:
SELECT Name, Salary FROM Employees WHERE Salary > (SELECT AVG(Salary) FROM
Employees);
Subquery: Calculates AVG(Salary).
Outer Query: Finds employees with salary greater than the average.
Result:
Name Salary
Carol 8000
David 7000
2. Multi-Row Subquery
Returns multiple rows and is used with operators like IN, ANY, ALL.
Example: Find employees who work in departments located in New York.
Tables:
Departments
DeptID DeptName Location
1 HR New York
2 IT California
Employees
1 Alice 1
2 Bob 2
3 Carol 3
Query:
SELECT Name FROM Employees WHERE DeptID IN (SELECT DeptID FROM
Departments WHERE Location = 'New York');
Subquery: Fetches DeptID values for departments located in New York.
Outer Query: Finds employees whose DeptID matches.
Result:
Name
Alice
Carol
3. Multi-Column Subquery
Returns multiple columns and is often used in comparison.
Example: Find employees with the same salary and department as 'Alice'.
Query:
SELECT Name, Salary, DeptID FROM Employees WHERE (Salary, DeptID) = (SELECT
Salary, DeptID FROM Employees WHERE Name = 'Alice');
Subquery: Returns Salary and DeptID for Alice.
Outer Query: Finds employees with the same Salary and DeptID.
Result:
Alice 4000 1
4. Correlated Subquery
A correlated subquery depends on the outer query for its values. It is executed for each row
in the outer query.
Example: Find employees whose salary is greater than the average salary of their department.
Query:
SELECT Name, Salary FROM Employees E1 WHERE Salary > (SELECT AVG(Salary)
FROM Employees E2 WHERE E1.DeptID = E2.DeptID);
Subquery: Calculates the average salary for each department.
Outer Query: Checks if an employee’s salary is greater than the average salary of their
department.
Grouping in SQL
Grouping in SQL is done using the GROUP BY clause. It is used to group rows that have
the same values in specified columns into summary rows (like totals or averages). Often, it
is used with aggregate functions such as SUM, AVG, COUNT, MIN, and MAX.
Syntax
SELECT column_name, aggregate_function(column_name) FROM table_name WHERE
condition GROUP BY column_name;
column_name: The column(s) on which the grouping is performed.
aggregate_function: Functions like SUM, AVG, COUNT, etc., applied to the grouped data.
Key Points about GROUP BY
The GROUP BY clause groups rows with the same value in specified columns.
Aggregate functions operate on each group, not on individual rows.
The GROUP BY clause must appear after the WHERE clause but before the ORDER
BY clause.
Columns in the SELECT statement must either be part of the GROUP BY clause or used
within an aggregate function.
1 Alice HR 4000
2 Bob IT 6000
3 Carol HR 5000
4 David IT 7000
EmployeeID Name Department Salary
Query:
SELECT Department, SUM(Salary) AS TotalSalary FROM Employees GROUP BY
Department;
Result:
Department TotalSalary
HR 9000
IT 13000
Finance 8000
Department TotalEmployees
HR 2
IT 2
Finance 1
Department TotalSalary
IT 13000
Finance 8000
Query:
SELECT Department, JobTitle, SUM(Salary) AS TotalSalary FROM Employees GROUP
BY Department, JobTitle;
Result:
HR Manager 4000
HR Assistant 5000
IT Developer 13000
Department TotalSalary
IT 13000
Department TotalSalary
IT 13000
HR 9000
Finance 8000
Summary
GROUP BY is used to group rows that have the same values into aggregated results.
It works with aggregate functions like SUM, AVG, COUNT, MIN, MAX.
Use the HAVING clause to filter groups based on aggregate results.
Columns in SELECT must be either in the GROUP BY clause or part of an aggregate
function. use ORDER BY to sort grouped results.
Aggregation in SQL
Aggregation in SQL involves using aggregate functions to summarize or compute values
over a group of rows. These functions allow you to perform calculations like totals, averages,
counts, and other summary statistics.
Function Description
Examples of Aggregation
1. SUM() – Total of Values
Problem: Calculate the total salary of all employees.
Table: Employees
1 Alice HR 4000
2 Bob IT 6000
3 Carol HR 5000
4 David IT 7000
Query:
SELECT SUM(Salary) AS TotalSalary FROM Employees;
Result:
TotalSalary
22000
AverageSalary
5500
TotalEmployees
4000 7000
Ordering in SQL
Ordering in SQL allows you to sort the result set of a query in ascending or descending
order based on one or more columns. The ORDER BY clause is used to specify the sorting
order.
Syntax of ORDER by
SELECT column1, column2, ... FROM table_name WHERE condition ORDER BY column1
[ASC|DESC], column2 [ASC|DESC], ...;
column1, column2, ...: The columns by which you want to sort the result set.
ASC (default): Sorts the result in ascending order (from smallest to largest, alphabetically A
to Z).
DESC: Sorts the result in descending order (from largest to smallest, alphabetically Z to A).
1 Alice 4000
2 Bob 6000
3 Carol 5000
4 David 7000
Query:
SELECT EmployeeID, Name, Salary FROM Employees ORDER BY Salary ASC;
Result:
1 Alice 4000
EmployeeID Name Salary
3 Carol 5000
2 Bob 6000
4 David 7000
4 David 7000
2 Bob 6000
3 Carol 5000
1 Alice 4000
1 Alice HR 4000
2 Bob IT 6000
3 Carol HR 5000
4 David IT 7000
Query:
SELECT EmployeeID, Name, Department, Salary FROM Employees ORDER BY
Department ASC, Salary DESC;
Result:
1 Alice HR 4000
3 Carol HR 5000
2 Bob IT 7000
4 David IT 6000
1 Alice 4000
2 Bob NULL
3 Carol 5000
4 David 7000
Query:
SELECT EmployeeID, Name, Salary FROM Employees ORDER BY Salary ASC NULLS
FIRST;
Result:
2 Bob NULL
EmployeeID Name Salary
1 Alice 4000
3 Carol 5000
4 David 7000
Result:
4 David 7000
2 Bob 6000
3 Carol 5000
SQL Views
Views in SQL are a type of virtual table that simplifies how users interact with data across
one or more tables. Unlike traditional tables, a view in SQL does not store data on disk;
instead, it dynamically retrieves data based on a pre-defined query each time it’s accessed.
SQL views are particularly useful for managing complex queries, enhancing security, and
presenting data in a simplified format. In this guide, we will cover the SQL create view
statement, updating and deleting views, and using the WITH CHECK OPTION clause.
What is a View in SQL?
A view in SQL is a saved SQL query that acts as a virtual table. It can fetch data from one or
more tables and present it in a customized format, allowing developers to:
Simplify Complex Queries: Encapsulate complex joins and conditions into a single object.
Enhance Security: Restrict access to specific columns or rows.
Present Data Flexibly: Provide tailored data views for different users.
Demo SQL Database
We will be using these two SQL tables for examples.
StudentDetails
-- Create StudentDetails table
CREATE TABLE StudentDetails (
S_ID INT PRIMARY KEY,
NAME VARCHAR(255),
ADDRESS VARCHAR(255)
);
OR
1. UNION
The UNION operator combines the results of two SELECT statements, removing duplicates.
If you want to include duplicate rows, use UNION ALL.
Syntax:
SELECT column1, column2, ... FROM table1 UNION SELECT column1, column2, ...
FROM table2;
Key points:
The columns in the SELECT statements must match in number and data type.
By default, UNION removes duplicate rows from the result.
To include duplicates, use UNION ALL.
Example of UNION
Consider two tables: Employees_1 and Employees_2.
Table 1: Employees_1
EmployeeID Name
1 Alice
2 Bob
3 Carol
Table 2: Employees_2
EmployeeID Name
2 Bob
4 David
5 Eve
Name
Alice
Bob
Carol
David
Eve
2. INTERSECT
The INTERSECT operator returns only the rows that are common between
two SELECT statements. The result includes only the rows that exist in both tables.
Syntax:
SELECT column1, column2, ... FROM table1 INTERSECT SELECT column1, column2, ...
FROM table2;
Key points:
The result set will only contain rows that exist in both tables.
Like UNION, the columns must match in number and data type.
Example of INTERSECT
Using the same Employees_1 and Employees_2 tables:
Query to get the common names between the two tables:
SELECT Name FROM Employees_1 INTERSECT SELECT Name FROM Employees_2;
Result:
Name
Bob
Here, only Bob is returned because it is the only common name in both tables.
Key points:
The result includes rows that appear in the first table but not in the second.
The columns in both SELECT statements must match in number and data type.
Example of EXCEPT
Using the same Employees_1 and Employees_2 tables:
Query to get the names in Employees_1 but not in Employees_2:
SELECT Name FROM Employees_1 EXCEPT SELECT Name FROM Employees_2;
Result:
Name
Alice
Carol
Here, Alice and Carol are returned because they are in Employees_1 but not in Employees_2.
4. UNION ALL
While not strictly a set operation in the mathematical sense, UNION ALL is an important
extension. It combines the result sets of two SELECT statements without removing
duplicates.
Syntax:
SELECT column1, column2, ... FROM table1 UNION ALL SELECT column1, column2, ...
FROM table2;
Key points:
UNION ALL does not remove duplicates, so all rows from both tables are included, even if
they are identical.
Example of UNION ALL
Using the same Employees_1 and Employees_2 tables:
Query to get all names, including duplicates:
SELECT Name FROM Employees_1 UNION ALL SELECT Name FROM Employees_2;
Result:
Name
Alice
Bob
Carol
Bob
David
Eve
Here, Bob appears twice because UNION ALL does not eliminate duplicates.
What is Data Normalization and Why Is It Important? Normalization is the process of reducing
data redundancy in a table and improving data integrity. Data normalization is a technique used
in databases to organize data efficiently. Have you ever faced a situation where data redundancy
and anomalies affected the accuracy of your database? Data normalization ensures that your
data remains clean, consistent, and error-free by breaking it into smaller tables and linking them
through relationships. This process reduces redundancy, improves data integrity, and optimizes
database performance. Then why do you need it? If there is no normalization in SQL, there will be
many problems, such as:
Insert Anomaly: This happens when we cannot insert data into the table without another.
Update Anomaly: This is due to data inconsistency caused by data redundancy and data
update.
Delete exception: Occurs when some attributes are lost due to the deletion of other
attributes.
What is Normalization in DBMS?
So, normalization is a way of organizing data in a database. Normalization involves organizing
the columns and tables in the database to ensure that their dependencies are correctly
implemented using database constraints. Normalization is the process of organizing data
properly. It is used to minimize the duplication of various relationships in the database.
It is also used to troubleshoot exceptions such as inserts, deletes, and updates in the table. It helps
to split a large table into several small normalized tables. Relational links and links are used to
reduce redundancy. Normalization, also known as database normalization or data normalization,
is an important part of relational database design because it helps to improve the speed, accuracy,
and efficiency of the database.
Now the question arises what is the relationship between SQL and normalization? Well, SQL is
the language used to interact with the database. Normalization in SQL improves data distribution.
To initiate interaction, the data in the database must be normalized. Otherwise, we cannot continue
because it will cause an exception. Normalization can also make it easier to design the database
to have the best structure for atomic elements (that is, elements that cannot be broken down into
smaller parts). Usually, we break large tables into small tables to improve efficiency. Edgar F.
Codd defined the first paradigm in 1970, and finally other paradigms. When normalizing a
database, organize data into tables and columns. Make sure that each table contains only relevant
data. If the data is not directly related, create a new table for that data. Normalization is necessary
to ensure that the table only contains data directly related to the primary key, each data field
contains only one data element, and to remove redundant (duplicated and unnecessary) data.
The process of refining the structure of a database to minimize redundancy and improve integrity
of database is known as Normalization. When a database has been normalized, it is said to be in
normal form.
Types of Normalization
Normalization usually occurs in phases where every phase is assigned its equivalent ‘Normal
form’. As we progress upwards the phases, the data gets more orderly and hence less permissible
to redundancy, and more consistent. The commonly used normal forms include:
1. First Normal Form (1NF): In the 1NF stage, each column in a table is unique, with no
repetition of groups of data. Here, each entry (or tuple) has a unique identifier known as a
primary key.
2. Second Normal Form (2NF): Building upon 1NF, at this stage, all non- key attributes are
fully functionally dependent on the primary key. In other words, the non-key columns in the
table should rely entirely on each candidate key.
3. Third Normal Form (3NF): This stage takes care of transitive functional dependencies. In the
3NF stage, every non-principal column should be non-transitively dependent on each key within
the table.
4. Boyce-Codd Normal Form (BCNF): BCNF is the next level of 3NF that guarantees the
validity of data dependencies. The dependencies of any attributes on non-key attributes are
removed under the third level of normalization . For that reason, it ensures that each determinant
be a candidate key and no dependent can fail to possess an independent attribute as its candidate
key.
5. Fourth Normal Form (4NF): 4NF follows that data redundancy is reduced to another level
with the treatment of multi-valued facts. Simply put, the table is in normal form when it does not
result in any update anomalies and when a table consists of multiple attributes, each is
independent. In other words, it collapses the dependencies into single vs.
multi-valued and eliminates the root of any data redundancy concerned with the multi-valued one.
We can now see that the concepts of denormalization, normalization, and denormalization are
technologies used in databases and are differentiable terms. Normalization is a method of
minimizing insertion, elimination and update exceptions by eliminating redundant data. The
reverse normalization process, which adds redundancy to the data to improve application-
specific performance and data integrity.
When to normalize data Normalization is particularly important for OLTP systems, where insert,
update and delete operations are fast and are usually initiated by the end-user. On the other hand,
normalization is not always seen as important for OLAP systems and data warehouses. Data is
usually denormalized to improve the performance of queries that need to be run in that context.
When to denormalize data It is best to denormalize a database in several situations. Many data
warehouses and OLAP applications use denormalized databases. The main reason for this is
performance. These applications are often used to perform complex queries. Joining many tables
usually returns very large records. There may be other reasons for database denormalization, for
example, to enforce certain restrictions that may not be enforced.
Here are some common reasons you might want to denormalize your database:
The most common queries require access to the entire concatenated data
set.
Most applications perform a table scan when joining tables.
The computational complexity of the derived column requires an overly complex
temporary table or query.
You can implement constraints (based on DBMS) that could not otherwise be
achieved
Although normalization is generally considered mandatory for OLTP and other transactional
databases, it is not always appropriate for some analytical applications.
Example
Let us assume the library database that maintains the required details of books and borrowers. In
an unnormalized database, the library records in one table the book details and the member who
borrowed it, as well as the member’s detail. This would result in repetitive information every time
a member borrows a book.
Normalization splits the data into different tables ‘Books’, “Members” and “Borrowed” and
connects “Books” and “Members” with “Borrowed” through a biunique key. This removes
redundancy, which means data is well managed, and there is less space utilization.
Conclusion
The concepts of normalization, and the ability to put this theory into practice, are key to building
and maintaining comprehensive databases which are both strong and impervious to data anomalies
and redundancy. Properly applied and employed at the right times, normalization boosts database
quality, making it structured, small, and easily manageable.
name → dept_name Students with the same name can have different
dept_name, hence this is not a valid functional dependency.
dept_building → dept_name There can be multiple departments in the same
building. Example, in the above table departments ME and EC are in the same building
B2, hence dept_building → dept_name is an invalid functional dependency.
More invalid functional dependencies: name → roll_no, {name, dept_name} →
roll_no, dept_building → roll_no, etc.
Armstrong’s axioms/properties of functional dependencies:
Here, {roll_no, name} → name is a trivial functional dependency, since the dependent name is
a subset of determinant set {roll_no, name}. Similarly, roll_no → roll_no is also an example of
trivial functional dependency.
Here, enrol_no → dept and dept → building_no. Hence, according to the axiom of transitivity,
enrol_no → building_no is a valid functional dependency. This is an indirect functional
dependency, hence called Transitive functional dependency.
1. Data Normalization
Data normalization is the process of organizing data in a database in order to minimize
redundancy and increase data integrity. Functional dependencies play an important part in data
normalization. With the help of functional dependencies, we are able to identify the primary key,
candidate key in a table which in turns helps in normalization.
2. Query Optimization
With the help of functional dependencies, we are able to decide the connectivity between the
tables and the necessary attributes need to be projected to retrieve the required data from the
tables. This helps in query optimization and improves performance.
3. Consistency of Data
Functional dependencies ensure the consistency of the data by removing any redundancies or
inconsistencies that may exist in the data. Functional dependency ensures that the changes made
in one attribute does not affect inconsistency in another set of attributes thus it maintains the
consistency of the data in database.
Decomposition is lossy if R1 ⋈ R2 ⊃
R Decomposition is lossless if R1 ⋈ R2 = R
To check for lossless join decomposition using the FD set, the following conditions must hold:
3. The common attribute must be a key for at least one relation (R1 or R2)
Att(R1) ∩ Att(R2) -> Att(R1) or Att(R1) ∩ Att(R2) -> Att(R2)
For Example, A relation R (A, B, C, D) with FD set{A->BC} is decomposed into R1(ABC) and
R2(AD) which is a lossless join decomposition as:
Schema Refinement: -
The schema Refinement refers to refine the schema by using Some technique.
Normalization: - Normalization means split the tables. into small tables which will contain less
number of attributes. Normalization or schema Refinement in a technique of organizing the data
in the database It is a systematic approach of decomposing tables to eliminate data redundancy
and undesirable characteristics like insertion, Update and deletion Anomalies.
Types of Normalization: -
Example: -
1NF [First Normal Form]: - A relation is said to in the 1NF if it is already in un- normalized form.
* Each attribute name must be unique
* Each attribute Value must be single or atomic i.e., Single Value attribute
Example: -
2NF: - A relation is said to be in 2NF. If it satisfies 1Nf and it has no partial dependency.
[primary key should be unique].
Ex: -
id name Course C-fee
501 A CSE 80k
502 B CSM 60k
503 C CSM 60k
3NF: - A database is in 3NF if it satisfies 2NF and there is no transitive functional dependency.
* A non- key attribute is depending on a non- key attribute.
4NF: - A relation is said to be in 4NF if it is in Boyce Codd normal form and should have no
multi. Valued dependency
id name id C-fee
501 A 501 80k
502 B 502 60k
503 C 503 60k
5NF: -
* It is a database normalization technique that ensures data consistency and reduces data de
redundancy. It is an extension of the fourth Normal form (4NF) and is considered a stronger
normal form.
Generalization: -
It works on the principle of bottom-up approach. In Generalization lower-level functions
are combined to form highest level function which is called as entities.
In generalization process properties are drawn from particular entities as it combines subclasses
to form superclass
Specialization: -
Specialization is opposite of Generalization. In specialization things a broken down into smaller
things to simplify it further. We can say that in specialization a particular entity gets divided
into sub entities. Also, in specialization Inheritance takes place.
Natural Key: A column, or group of columns, that is generated from the table’s data is
known as a natural key. For instance, since it uniquely identifies every client in the table, the
customer ID column in a customer table serves as a natural key.
Surrogate key: A column that is not generated from the data in the database is known
as a surrogate key. Rather, the DBMS generates a unique identifier for you. In
database tables, surrogate keys are frequently utilized as primary keys.
Surrogate Key
A surrogate key also called a synthetic primary key, is generated when a new record is inserted
into a table automatically by a database that can be declared as the primary key of that table. It is
the sequential number outside of the database that is made available to the user and the
application or it acts as an object that is present in the database but is not visible to the user or
application.
We can say that, in case we do not have a natural primary key in a table, then we need to
artificially create one in order to uniquely identify a row in the table, this key is called the
surrogate key or synthetic primary key of the table. However, the surrogate key is not always the
primary key. Suppose we have multiple objects in a database that are connected to the surrogate
key, then we will have a many-to-one association between the primary keys and the surrogate key
and the surrogate key cannot be used as the primary key.
The surrogate key is called the fact less key as it is added just for our ease of
identification of unique values and contains no relevant fact (or information) that is
useful for the table.
Consider an example: Suppose we have two tables of two different schools having the same
column registration_no, name, and percentage, each table having its own natural primary key,
that is registration_no.
Table of school A:
Table of school B:
Now, suppose we want to merge the details of both the schools in a single table.
Resulting table will be:
surr_no registration_no name percentage
1 210101 Harry 90
2 210102 Maxwell 65
3 210103 Lee 87
4 210104 Chris 76
5 CS107 Taylor 49
6 CS108 Simon 86
7 CS109 Sam 96
8 CS110 Andy 58
As we can observe the above table and see that registration_no cannot be the primary key of the
table as it does not match with all the records of the table though it is holding all unique values of
the table. Now, in this case, we have to artificially create one primary key for this table. We can do
this by adding a column surr_no in the table that contains anonymous integers and has no direct
relation with other columns. This additional column of surr_no is the surrogate key of the table.
4. Flexibility: In the event that the natural key changes, rows can still be uniquely
identified using surrogate keys.
Conclusion
Surrogate keys are an important tool for designing and implementing databases. They can be
applied to enhance database systems, flexibility, stability, and performance.
BCNF is essential for good database schema design in higher-level systems where consistency
and efficiency are important, particularly when there are many candidate keys (as one often finds
with a delivery system).
Rules for BCNF
Rule 1: The table should be in the 3rd Normal Form.
Rule 2: X should be a super key for every functional dependency (FD) X−>Y in a given
relation.
Note: To test whether a relation is in BCNF, we identify all the determinants and make sure that
they are candidate keys.
You came across a similar hierarchy known as the Chomsky Normal Form in the Theory of
Computation. Now, carefully study the hierarchy above. It can be inferred that every relation in
BCNF is also in 3NF. To put it another way, a relation in 3NF need not be in BCNF. Ponder
over this statement for a while.
To determine the highest normal form of a given relation R with functional dependencies, the
first step is to check whether the BCNF condition holds. If R is found to be in BCNF, it can be
safely deduced that the relation is also in 3NF, 2NF, and 1NF as the hierarchy shows. The 1NF has
the least restrictive constraint – it only requires a relation R to have atomic values in each tuple.
The 2NF has a slightly more restrictive constraint.
The 3NF has a more restrictive constraint than the first two normal forms but is less restrictive than
the BCNF. In this manner, the restriction increases as we traverse down the hierarchy.
Examples
Here, we are going to discuss some basic examples which let you understand the properties of
BCNF. We will discuss multiple examples here.
Example 1
Let us consider the student database, in which data of the student are mentioned.
Stu_Branch Table
Stu_ID Stu_Branch
101 Computer Science & Engineering
102 Electronics & Communication
Engineering
Candidate Key for this table: Stu_ID.
Stu_Course Table
Stu_ID Stu_Course_No
101 201
101 202
102 401
102 402
After decomposing into further tables, now it is in BCNF, as it is passing the condition of Super
Key, that in functional dependency X−>Y, X is a Super Key.
Example 2
Find the highest normal form of a relation R (A, B, C, D, E) with FD set as:
Explanation:
Step-1: As we can see, (AC)+ = {A, C, B, E, D} but none of its subsets can determine
all attributes of the relation, So AC will be the candidate key. A or C can’t be derived
from any other attribute of the relation, so there will be only 1 candidate key
{AC}.
Step-2: Prime attributes are those attributes that are part of candidate key {A, C} in this
example and others will be non-prime
{B, D, E} in this example.
Step-3: The relation R is in 1st normal form as a relational DBMS does not allow multi-
valued or composite attributes.
The relation is in 2nd normal form because BC->D is in 2nd normal form (BC is not a proper
subset of candidate key AC) and AC->BE is in 2nd normal form (AC is candidate key) and B->E
is in 2nd normal form (B is not a proper subset of candidate key AC).
The relation is not in 3rd normal form because in BC->D (neither BC is a super key nor D is a
prime attribute) and in B->E (neither B is a super key nor E is a prime attribute) but to satisfy 3rd
normal for, either LHS of an FD should be super key or RHS should be a prime attribute. So the
highest normal form of relation will be the 2nd Normal form.
Note: A prime attribute cannot be transitively dependent on a key in BCNF relation.
Consider these functional dependencies of some relation R
AB ->C C
->B AB -
>B
From the above functional dependency, we get that the candidate key of R is AB and AC. A
careful observation is required to conclude that the above dependency is a Transitive
Dependency as the prime attribute B transitively depends on the key AB through C. Now, the first
and the third FD are in BCNF as they both contain the candidate key (or simply KEY) on their left
sides. The second dependency, however, is not in BCNF but is definitely in 3NF due to the
presence of the prime attribute on the right side. So, the highest normal form of R is 3NF as all
three FDs satisfy the necessary conditions to be in 3NF.
Example 3
A -> BC,
B -> A
Note: BCNF decomposition may always not be possible with dependency preserving, however,
it always satisfies the lossless join condition. For example, relation R (V, W, X, Y, Z), with
functional dependencies:
V, W -> X
Y, Z -> X W
-> Y
Note: Redundancies are sometimes still present in a BCNF relation as it is not always possible to
eliminate them completely.
There are also some higher-order normal forms, like the 4th Normal Form and the 5th Normal
Form.
For more, refer to the 4th and 5th Normal Forms.
Conclusion: -
In conclusion, we can say that Boyce-Codd Normal Form (BCNF) is very much essential as far
as database normalization are concerned which help us in doing normalizing beyond the limits of
3NF. By making sure all functional dependencies depend on super keys, this is how BCNF helps
us avoid redundancy and update anomalies. This makes the BCNF a highly desirable property
and helps in achieving Data Integrity which is number one concern for any Database Designer.
a --> --> b
It is read as a is multi-valued dependent on b. Suppose a person named Geeks is working on 2
projects Microsoft and Oracle and has 2 hobbies namely Reading and Music. This can be
expressed in a tabular format in the following way.
Project and Hobby are multivalued attributes as they have more than one value for a single
person i.e., Geeks.
What is Multivalued Dependency?
When one attribute in a database depends on another attribute and has many independent values, it
is said to have multivalued dependency (MVD). It supports maintaining data accuracy and
managing intricate data interactions.
Multi Valued Dependency (MVD)
We can say that multivalued dependency exists if the following conditions are met.
t1[b] = t3[b] = MS
And
t2[b] = t4[b] = Oracle
t1 = t4 = Reading
And
t2 = t3 = Music
a --> --> b
And for,
a --> --> C
We get,
Hence, we know that MVD exists in the above table and it can be stated by,
name --> --> project name --> -->
hobby
Conclusion
Multivalued Dependency (MVD) is a form of data dependency where two or more
attributes, other than the key attribute, are functionally independent on each other, but
these attributes depend on the key.
Data errors and redundancies may result from Multivalued Dependency.
We can normalize the database to 4NF in order to get rid of Multivalued
Dependency.
4NF: - A relation is said to be in 4NF if it is in Boyce Codd normal form and should
have no multi. Valued dependency
id name id C-fee
501 A 501 80k
502 B 502 60k
503 C 503 60k
5NF: -
* It is a database normalization technique that ensures data consistency and reduces data
de redundancy. It is an extension of the fourth Normal form (4NF) and is considered a
stronger normal form.
Transaction:-collection of operations that from a single logical unit of work are called
transactions.
Example of a transaction:-
Consider a bank transfer where 100/- is transferred from account A to Account B. The
transaction consists of two operation.
1. Deduct 100/- from Account A
2. Add 100/- to Account B
This transaction is atomic; either both operations are complete successfully or neither occurs.
If the first operation (deducting 100/-) succeeds but the second (adding to Account B) fails,
the system will roll back the transaction to ensure consistency.
Transaction State:-
In DBMS, a transaction is a sequence of one or more operations performed on a database.
These operations must follow the ACID properties to ensure the database remains in a valid
state. A transaction can be in one of several states during its life cycle, which are described
below:
1. Active state:-
The transaction is in the active state when it is being executed.
Operations like reading and writing data are performed during this phase.
It can remain in the state until the transaction is complete or aborted.
2. Partially Committed State:-
After the transaction has executed its final operation (e.g.., an update or insert), it
enters the partially committed state.
At this point, the DBMS must ensure that all changes mode by the transaction are
saved permanent. However the changes are not yet mode permanent in the database
until. Successful completion (commit).
3. Failed state:-
If a transaction fails during execution (due to an error system crash), it enters the
failed state.
In this state, any changes mode to the database by the transaction are rolled back
(undone).
4. Aborted State:-
If a transaction cannot proceed due to failure, it is rolled back and all the changes
mode by the transaction are undone.
After rolled, the transaction can either be restarted or completely terminated.
5. Committed State:
If a transaction completes all its operations successfully and the database change are
saved permanently, it enters the committed state.
This is the final state, where the transaction’s changes become visible to other
transactions.
After the system enter the aborted state. At this point, the system has two options:
It can restart the transaction, but only if the transaction was aborted as a result
of some hardware or software error that was not created through the internal logic
of the transaction. A restarted transaction is considered to be a new transaction.
It can kill the transaction. It usually does so because of some internal logical error
that can be corrected
Only by rewriting the application programme, or because the input was bad, or
because the during data were not found in the database
Example:-
We have two Accounts in a bank:
o Account A has a balance of 500/-
o Account B has a balance of 300/-
The goal is to transfer 100/- from Account A to B.
Transaction states:-
1.Active state:-
The transaction begins with a series of operation.
First, the system read the balance of Account A to check if there are sufficient funds.
Next, it deducts 100/- from Account A.
Then, the system reads the balance of Account B and prepares to add 100/- to Account
B.
Operations during Active state:-
Read balance of Account A:500/-
Deducted 100/-from Account A:new balance of Account A:400/-
Read balance of Account B:300/-
Prepare to add 100/- to Account B (but not bone yet).
3.Committed state:-
If everything goes smoothly, the transaction moves into the committed state.
Now, the 100/- has been permanently added to account B, making its new balance
400/-
The changes to both account A&B are written to the database, and the transaction is
now complete.
Final status after commit:-
o Account A:400/-
o Account B:400/-
o The transaction is fully complete, and all changes are durable (i.e.., they will survive
after a crash).
Let’s assume that after the system deducts 100/- from Account A but before it adds
100/- to Account B, a power failure occurs or a system crash happens.
The transaction cannot proceed, so it enters the failed state.
Since part of the transaction completed (the deduction from Account A), but it was not fully
committed, the system recognizes that this failure requires recovery to rollback.
5.Aborted state:-
Since the transaction has failed, the DBMS rolls back the entire transaction, undoing
all changes models during the transaction to maintain the database consistency.
In this case 100/- is restored to account A, So its balance returns to 500/- , and
Account B remains at 300/-
The transaction then enters the aborted state. The DBMS ensures that no partial
updates remain in the database, and everything is restored to its original state.
ACID Properties:-
Concurrent Execution:-
Concurrent Execution transaction means multiple transaction executed/run concurrently in
DBMS with each transaction doing its atomic unit of work.
Example:-
(A=500)
T1 T2
Read(A)
A=A+100
Write(A) Read(A)
Let’s assume the T1 is modifying Account A from 500/- to 600/- But it has not committed yet.
At the sometime, T2 reads this uncommitted balance, which is 600/- (though the transaction
is uncommitted). T1 encounters an error and rolls back the transaction, restoring the balance
of Account A to 500/-.this is called dirty read.
3.unrepeatable read(Read-write conflict)
A transaction reads the same data twice but gets different results because another transaction
has modified the data in between.
Example:-
(A=1000)
T1 T2
Read(A)
1000 Read(A)
A=A+200
Write(A)
Read(A)
300
T1 reads the balance of account A, which is 1000/-. T2 modifies the balance of Account A by
adding 2000/-, making the new balance 3000/-, this transaction commits. T1 re-reads the
balance of Account A and new sees 3000/-. This is called unrepeatable read.
4.Incorrect Summary read(write-write conflict):
This problem occurs when a transaction computer an aggregate function over a set data while
other transaction are updating the data simultaneously.
Example:-
T1 T2
Sum=0
R(A)
Sum=Sum+ A
R(y)
Sum=Sum+ y
R(A)
Y=y+100
W(y)
Here as we didn’t commit the t2 transaction we have to apply the sum operation again in t1 to
get the correct value which is 300/-. This is called incorrect summary read.
Serializability:-
Serializability is a key concept in dbms that ensures the correctness of transaction executed
simultaneously. When multiple users or applications perform transaction concurrently on
database, the system must ensure that the integrity & consistency of the data are maintained.
Example:-
T1 T2
R(A)
W(A)
R(A)
W(A)
Types of Serializability:-
1. Conflict serializability
2. View serializability
1. Conflict serializability:-
Two transaction are said to be conflict-serializable if their execution can be rearranged by
swapping non-conflicting operations to from a serial schedule without affecting the final
result. Two operations conflict if:
Belong different transaction
Access the some data item and
At least one of them is a write operation
Example:
Hence s1 is a serial schedule that we got after swapping non- conflicting operations.
2. View serializability:-
A schedule is view serializable if it produces the same outcome as a serial schedule, even
though the order of the operations might differ. Two schedules S1 a non-serial schedule and
S2 a serial schedule are said to be view equivalent, if they satisfy all the following conditions.
I.e., if in both the schedules S1 & S2, the transaction that performs
Initial read
Final write and
Update read on each data item are same.
Example:-
Initial read
In schedule S1, T1 first reads the data item X. In S2 also T1 first reads the data item X. Let’s
check for y, In S1, t1 first reads the data item Y. In S2 also, t1 first read the data item y. In S2
also, t1 first reads the data item y.
We checked for both data items x &y and. the initial read condition is satisfied in S1 & S2.
Final write
In schedule S1, the final write operation on x is done by T2. In S2 also T2 performs the final
write on X.
Let’s check for y. In S1, the final write operation on y is done by T2. In S2 final write on y is
done by T2.
We checked for bot data items x & y and the final write condition is satisfied in S1 & S2.
Update read
In S1, transaction T2 reads the value of x, written by T1.In s2, the same transaction T2
reads the x after it is written by T1.
In S1, transaction T2 reads the value of y, written by T1.In S2, the same transaction T2
reads the value of y after it is updated by T1.
The update read condition is also satisfied for both the schedules.
Therefore, since all three condition that checks whether the two means S1, and S2 are view
equivalent. Also, as we know that the schedule S2 is the serial schedule S2 is the serial
schedule of S1, thus we can say that the schedule S1 is view serializable schedule.
Serializability process is used to ensure whether the given schedule is serializable or not.
A precedence graph is used to the test the serializability of a non -serial schedule.
Precedence graph is a directed graph consist of the pair G=(V,E)
Where V=all the transaction n in the schedule and
E=set of edges Ti Tj if Ti first perform a conflict operation before Tj
perform. For which one of three conditions holds:
1. Ti executes write (Q) before Tj executes read (Q).
2. Ti executes reads (Q) before Tj executes write (Q).
3. Ti executes write (Q) before Tj executes write (Q).
If precedence graph contains no cycle /acyclic that means
Schedule’s’ is serializable schedule.
Example:-
1. Final vertices
Topological sorting:
Steps:-
T1
Remove the outgoing edges from T1
T2
1 0 T3
T2
So the serialiability order of its equivalent serial schedule is: t1t3t2.
Recoverability:-
Recoverability in DBMS refers to the ability of a system to ensure that a database can be
restored to a consistent state after a failure, such as system crashes, power outages, or
transaction failure.
A schedule is recoverable if it ensure that:
A transaction only commits after ensuring that all transactions whose changes it
depends on have also committed.
This prevents a situation where a transaction depends on an uncommitted
transaction, which might be rolled back later, leading to inconsistencies.
Example:
If transition T2 reads a value written by transaction t1, then T2 should only
commit if T1 has committed.
Levels of isolation:
Isolation levels defines the degree to which a transaction must be isolated from the
data modifications made by any other transaction in the database system.
4 levels of isolation to balance between data consistency and system performance
1. Serializable
2. Repeated read
3. Read committed
4. Read uncommitted
1. Serializable:
The highest isolation level. Transactions are executed in such a way that the
outcome is the same as if the Transactions were executed serially (one after
another). This prevents dirty reads, non-repeatable reads and phantom reads.
Use case suitable for applications where strict consistency is required, such as
financial application thought it can impact performance due to high
contention and blocking.
2. Repeatable Read:-
This is the most restrictive isolation level.
It allows transactions to the change mode by other transaction
The transaction holds read locks on all rows, it references.
It holds write locks on all rows it inserts, updates, or deletes.
It avoids non-repeatable read.
Use case: Ideal for situation where you need, consistency in reading data
across the same transaction.
3. Read committed:-
Transactions can only read committed data. They cannot see uncommitted changes from
other transactions, preventing dirty reads.
Use case: commonly used in many applications since it provides a balance
between performance and consistency.
4. Read uncommitted:-
Transactions can read data that other transactions have written but not yet committed (dirty
reads).
This is the least restrictive level.
No locks are enforced.
Dirty reads are allowed, meaning no transaction waits for other to complete.
Use case: suitable for scenarios where data consistency is important
(example:- reporting or log processing).
Isolation can be performed though locking-mechanism and Time stamp-based protocols.
1. Locking Mechanisms:
It ensures exclusive access to a data item for a transaction. This means that while one
transaction holds a lock on a data item, no other transaction can access the item.
Types of lock 2 types of locks.
i) Shared lock: (s-lock) are used for data that the transaction reads. It is also called as
read lock.
ii) Exclusive lock: (X-lock) are used for those if writes. This is also called as write
lock.
2. Time stamp-based protocol:
Timestamp is a value assigned to each transaction that indicates the order of it’s
another transaction.
Whereas time stamp-based protocol does not use locks but instead. It uses a
concept of time-stamp, whenever a new transaction start it will be associated
with new time stamp, wherever a new time stamp which is increasing in order,
generally it can be a system clock or anything
It determines the order of execution of the transaction that will be ensure older
transaction get priority.
Types of locks:-
1. Shared Lock (s-Lock):-A Shared lock is applied when a transaction wants to read a data
item. Multiple transactions can hold a shared lock on the same data item, allowing them to
read concurrently. However, no transaction can modify the data while it’s locked in shared
mode.
We require that every transaction request a lock in an appropriate mode on data item,
depending on the types of operations that it will perform on the transaction makes the request
to the concurrency-control manager. The transaction can proceed with the operation only
after the concurrency-control manager grants that lock to the transaction. The use of these
two lock modes allows multiple transactions to read a data item but limits write access to just
one transaction at a time.
Example:
Consider the banking example. Let A and B be two accounts that are accessed by transaction
T1
T1: lock-x (B);
Read (B);
B: B-50;
Write (B);
Unlock (B);
Lock-x (A);
Read (A)
A: =A+50;
Write (A);
Unlock (A);
Figure:-Transaction T1.
And T2. Transaction T1 transfer ₹50 from account B to account a. transaction T2 display the
total amount of money in accounts A and B- that is, the sum A+B.
Suppose that the values of accounts A and B are ₹100 and ₹200, respectively. If these two
transactions are executed serially, either in the order T1, T2 or the order T2, T1, then
transaction T2 will display the value ₹300.
1) Growing phase: A transaction may obtain locks, but may not release any lock.
2) Shrinking phase: A transaction may release locks, but may not obtain any new locks.
The point in the schedule where the transaction has obtained its final lock is called the
lock point of the transaction.
Two phase locking is of two types
Strict two phase locking
Rigorous two phase locking
Strict two phase locking: All exclusive locks taken by a transaction beheld until that
transaction commits. (Only exclusive lock)
Rigorous two phase locking: All locks taken by a transaction be hid until the transaction
commits. (Both)
Converting shared lock to exclusive lock is called lock upgrade.
Converting exclusive lock to shared lock is called lock downgrade.
Lock Conversion cannot be allowed arbitarrly. Rather, Upgrading can take place in only the
growing phase, whereas downgrading can take place in only the shrinking phase.
Example:
Lock-x (A); (Growing phase)
Lock-x (B); (Growing phase)
Read (A);
A: =A-50;
Write (A);
Read (B);
B: =B+50;
Write (B);
Unlock (A); (shrinking phase)
Unlock (B); (shrinking phase)
Deadlock Handling:-
A system is in a deadlock state if there exists a set of transactions such that every transaction
in the set is waiting for another in the set.
Let T0, T1……..,Tn be a set of transactions such that T0 is waiting for a data item that T1
holds and T1 is waiting for a data item that T2 holds and……, and Tn-1 is waiting for a data
item Tn holds and Tn is waiting for a data item that T0 holds. None of the transaction can
make progress and this situation is called deadlock.
There are two principal methods for dealing with the deadlock problem. We can use a
deadlock prevention protocol to ensure that the System will never enter a deadlock state.
Alternatively, we can allow the System to enter a deadlock state, and then try to recover by
using a deadlock detection and deadlock recovery.
Deadlock Detection
G= (V, E)
An edge TiTj exists if transaction Ti waits for a data item being by transaction Tj
A deadlock exists in the system if and only if the wait. For graph contains a cycle.
Deadlock Recovery
The most common solution for deadlock is to rollback one or more transactions to break the
deadlock
1. Selection of a victim
2. Rollback
3. Starvation
1. Selection of a victim:-
Given a set of deadlocked transaction, we must determine which transaction to rollback to
break the deadlock
Many factors may determine the cost of a rollback including
a. How long the transaction has computed and how much longer the transaction will
computer before it completes its designated task.
b. How many data item the transaction has used
c. How many more data items the transaction needs for it to complete.
d. How many transaction will be involved in the rollback
2. Rollback:-
Once we have decided that a particular transaction must be rolled back, we must
determine how for this transaction should be rolled back.
There are two options for rollback
a) Total Rollback
b) Partial Rollback
Partial rollback:-The transaction must be rolled back to the point to break the deadlock. It
requires the system to maintain additional information about the state of all the running
transactions. The system should decide which lacks the selected transaction needs to release
in order to break the deadlock.
3. Starvation:-
In a system where the selection of victims is based primarily on cost factors. It may
happen that the same transaction is always picked as a victim. As a result the transaction
never completes and it is known as starvation.
The most common solution is to include the no. of rollback in the cost factor.
Deadlock prevention:-
It includes mechanisms that ensures that the system will never enter a deadlock stage.
Two deadlock prevention schemes using time stamps have been proposed.
1. Wait-die schema
2. Wound-wait schema
Wait-die scheme:-
It is a non-preemptive technique.
When transaction Ti requests a data item currently held by Tj, Ti is allowed to wait
only if Ti has a timestamp smaller than that of Tj. Otherwise Ti is rolled back.
T1 T2
Q Ts(Ti)<Ts(Tj)
Request
Wait
Tj Ti
Q Ts(Ti)>Ts(Tj)
Request
Die
Wound-wait Scheme:-
It is a preemptive technique.
When transaction ti request a data item currently held by Tj, Ti is allowed to
wait only if it has a timestamp larger than that of Tj. Otherwise Tj is rolled
back
Ti Tj Tj Ti
Q Q
Request
Request wait
Wound
Ts (Ti) <Ts (Tj) Ts (Ti) >Ts (Tj)
Timestamp-based protocol:-
A protocol which is based on the ordering of the transaction is known as timestamp-based
protocol.
Timestamp: A unique number assigned to each transaction before the transaction starts its
execution is known a timestamp
The timestamp of a transaction Ti is denoted by Ts (Ti). There are two methods for assigning
timestamp.
1. The value of the system clock is assigned as the timestamp of the transaction.
2. The value of the logical counter is assigned as the timestamp of the transaction. The
value of the logical counter is incremented after a new timestamp has been assigned.
To implement this scheme, we associate each data item Q two timestamp values:
W-timestamp (Q):-It denotes the largest timestamp of any transaction that executed write (Q)
Successfully
R-timestamp (Q):-It denotes the largest timestamp of any transaction that executed read (Q)
Successfully.
These timestamps are updated wherever a new read (Q) or write (Q) instruction is
executed.
The timestamp-ordering protocol ensures that any conflicting read and write
operations are executed in timestamp order.
This protocol operates as follows:
1. Suppose that transaction Ti issues read (Q):
a) If Ts (Ti) <W-timestamp (Q), then the read operation is rejected and Ti is rolled
back.
b) If Ts (Ti) >W-timestamp (Q), then the read operation is executed and R-timestamp
(Q) is set to the maximum R-timestamp (Q) and Ts (Ti).
2. Suppose that transaction Ti issues write (Q)
a) If Ts (Ti) <R-timestamp (Q), then the write operation is rejected and Ti is rolled
back.
b) If Ts (Ti) <W-timestamp (Q), then the write operation is rejected and Ti is rolled
back.
c) Otherwise, the system executes the write operation and sets W-timestamp (Q) to
Ts (Ti).
Recovery Algorithm:-
A recovery algorithm in a dbms is a set of actions that a system uses to recover from a failure
and restore the database to a consistent state.
Transaction Rollback-
First consider transaction rollback during normal operation (that is, not during recovery from
a system crash). Rollback of transaction Ti is performed as follows:
1. The log is scanned backward, and for each log record of Ti of the form <Ti, Xj, V1,
V2>that is found:
a) The value V1 is written to data item Xj and
b) A special redo-any log record<Ti, Xj, V1>is written to the log, where V1 is
the value being restored to data item Xj during the rollback. These log records
are sometimes called compensation log records. Such records do not need
undo information, since we never need to undo such as undo such an undo
operation.
2. Once the log record<Ti start> is found the backward scan is stopped, and a log
record<Ti abort> is written to the log.
Recovery After a system crash
Recovery actions, when the database system is restarted after a crash, takes place in two
phases:
1. In the redo phase, the system replays updates of all transaction by scanning the log
forward from the last checkpoint.
The specific steps takes while scanning the log are as follows:
a. The list of transaction to rolled back, undo-list, is initially set to the list “L” in the
<checkpoint L> log record.
b. Whenever a normal log record of the form <Ti, Xj, V1, V2> or a redo-only log
record of the form<Ti, Xj, V2> is encountered, the operation is redone; that is the
value V2 is written to data item Xj.
c. Whenever a log record of the form<Ti start> is found, Ti is added to undo-list.
d. Whenever a log record of the form<Ti abort> or<Ti commit> is found, Ti is
removed from undo-list.
At the end of the redo phase, undo-list contains the list of all transaction that are incomplete,
that is, they neither committed nor completed rollback before the crash.
2. In the undo phase, the system roll back all transaction in the undo list. It performs
rollback by scanning the log backward from the end.
a. Whenever it finds a log record belonging to a transaction in the undo list, it
performs undo actions just as if the log record had been found during the rollback
of a failed transaction.
b. When the system finds a<Ti start> log record for a transaction Ti in undo-list, it
writes a<Ti abort > log record to the log, and removes Ti from undo-list.
c. The undo phase terminates once undo-list becomes empty, that is, the system has
found<Ti start> log records for all transaction that were initially in undo-list.
After the undo phase of recovery terminates, normal transaction processing can resume.
Recovery algorithm contains log-based recovery, shadow paging, checkpoints to
maintain consistency of database.
Insertion in B+ Trees:
Insertion in B+ Trees is done via the following steps.
Every element in the tree has to be inserted into a leaf node. Therefore, it is necessary to
go to a proper leaf node.
Insert the key into the leaf node in increasing order if there is no overflow.
Deletion in B+ Trees:
Deletion in B+ Trees is just not deletion but it is a combined process of Searching, Deletion,
and Balancing. In the last step of the Deletion Process, it is mandatory to balance the B+
Trees, otherwise, it fails in the property of B+ Trees.
Advantages of B+ Trees:
A B+ tree with ‘l’ levels can store more entries in its internal nodes compared to a B-
tree having the same ‘l’ levels. This accentuates the significant improvement made to
the search time for any given key. Having lesser levels and the presence of
Pnext pointers imply that the B+ trees is very quick and efficient in accessing records
from disks.
Data stored in a B+ tree can be accessed both sequentially and directly.
It takes an equal number of disk accesses to fetch records.
B+ trees have redundant search keys, and storing search keys repeatedly is not possible.
Disadvantages of B+ Trees:
The major drawback of B-tree is the difficulty of traversing the keys sequentially. The
B+ tree retains the rapid random access property of the B-tree while also allowing rapid
sequential access.
Application of B+ Trees:
Multilevel Indexing
Faster operations on the tree (insertion, deletion, search)
Introduction:-
The B+ Tree is a type of balanced binary search tree. It uses multilevel indexing, with leaf
nodes holding actual data references. An important feature is that all leaf nodes in a B+ tree
are kept at the same height.
Search is made efficient as the leaf nodes in a B+ Tree are linked in the form of singly-linked
lists. You can see the difference between B trees and B+ tree structures, respectively, where
records in the latter are stored as linked lists.
B+ Trees in DBMS store a huge amount of data that cannot be stored in the limited main
memory. The leaf nodes are stored in secondary memory, while only the internal nodes are
stored in the main memory. This is called multilevel indexing.
B++ Trees are a type of self-balancing tree data structure that are used to store and retrieve
large amounts of data efficiently. They are similar to B-Trees, but with some additional
features that make them better suited for use in Database systems. B++ Trees maintain a
separate leaf node structure that is connected by a linked list, which allows for efficient range
queries and sequential access to data. Additionally, B++ Trees have a higher fanout than B-
trees.
Why Use a B+ Tree?
A B+ tree is a self-balancing tree data structure primarily used in database and file systems.
It provides an efficient way to store and retrieve data by maintaining balance across its nodes.
B+ trees are an enhancement of B-trees, a balanced tree structure, but with a few distinct
features that make them particularly well-suited for disk-based storage systems.
Here’s why B+ trees are commonly used:
Efficient Range Queries: Unlike binary search trees, B+ trees store all keys in leaf
nodes in a sorted manner. Leaf nodes are linked, allowing sequential access, which
makes range queries efficient (e.g., retrieving all entries in a certain range).
Fast Access to Disk-Based Storage: Since B+ trees minimize the depth of the tree, they
reduce the number of disk I/O operations required to access data. Non-leaf nodes only
contain keys (not actual records), which allows more keys to be stored per node,
minimizing disk reads and reducing the search time.
Improved Memory Utilization: B+ trees use a large branching factor, which minimizes
the height of the tree and enables more efficient use of memory, allowing them to store
more data without increasing the tree’s depth significantly.
Balanced Structure: Like other balanced trees, B+ trees keep all leaf nodes at the same
depth, ensuring uniform data retrieval times. This balance is maintained automatically
with insertions and deletions.
High Insertion and Deletion Efficiency: In B+ trees, insertions and deletions require
minimal reorganization, and since keys are sorted only in the leaf nodes, only those
nodes (and sometimes their parents) need to be adjusted when the structure changes.
Implementation of B+ Tree in DBMS:-
In DBMS, a B+ tree is used for indexing large datasets. The key concepts involved in a B+
tree implementation include nodes, branching factor, and balancing mechanisms. Here’s
an overview of how a B+ tree works in a DBMS context:
Nodes and Order:
A B+ tree has a defined order, which dictates the maximum number of children a node
can have. For instance, an order-4 B+ tree allows a maximum of 4 children per node.
Internal nodes (non-leaf nodes) contain only keys for navigation, while leaf
nodes contain actual data records or pointers to the records.
Insertion:
Inserting a new key follows a search for the appropriate leaf node. If the leaf node has
space, the new key is added.
If the node is full, it splits, promoting the middle key to the parent node and creating a
new leaf. This process continues up the tree as necessary, ensuring balance.
Deletion:
When a key is deleted, it’s removed from the leaf node. If this causes underflow (fewer
keys than the minimum), neighboring nodes may lend keys. If that’s not possible, nodes
may merge, and balancing continues up the tree if needed.
Range Queries:
Since all data records are stored in sorted order at the leaf level and leaf nodes are linked,
range queries can be done quickly by traversing through the leaf nodes, making it very
efficient.
Splitting and Merging:
Splits and merges happen as part of maintaining balance. Splitting involves dividing a
full node and pushing one key up to maintain order. Merging occurs when nodes have
too few keys, keeping the tree balanced without redundancy.
Search Operation:
Searching in a B+ tree involves traversing from the root to the leaf level, following keys
in the internal nodes. Since B+ trees are balanced and have a shallow depth, this search
process is efficient.
Now in our B+ tree of order 2, the yellow internal nodes have between 2-4 keys; they each
have at least a minimum of two. Each of the leaves in green has between 2-4 keys, the key
values of which are shown with an asterisk (to indicate other information associated with a
search key). There are pointers at the leaves, which point to the additional information
associated with keys.
Must Recommended Topic, Introduction to DBMS.
Searching a record in B+ Tree
Searching a record in a B+ tree involves the following steps:
1. Start at the root node of the tree.
2. Compare the search key with the keys in the current node.
3. If the search key is less than the smallest key in the node, follow the leftmost pointer to
the child node.
4. If the search key is greater than or equal to the largest key in the node, follow the
rightmost pointer to the child node.
5. If the search key is between two keys in the node, follow the pointer to the child node
corresponding to the key that is just greater than the search key.
7. Search for the record in the leaf node using the search key to locate the corresponding
entry.
8. If the record is found, return it. If not, return a "record not found" message.
Because B+ trees are balanced and all leaf nodes are at the same level, the search time is
O(log n), where n is the number of records in the tree.
To search for a record with a key of 45, we would start at the root node, which contains the
keys 23, 34, and 45. Because 45 is greater than or equal to the smallest key in the node (23)
and less than the largest key in the node (45), we would follow the middle pointer to the
second child node.
In the second child node, we would find the key value of 45 in the leaf node. We would then
use this key to locate the corresponding record in the leaf node and return the record to the
user. If the key value was not found, we would return a "record not found" message.
2. If the leaf node has room for the new record (i.e., it has fewer than m-1 records, where m
is the order of the tree), insert the new record in the correct position and update the
pointers as necessary.
3. If the leaf node is full, split the node into two halves and promote the median key to the
parent node.
4. Repeat step 2 for the appropriate leaf node in the new subtree.
Here's an example of inserting a record with a key value of 30 in a B+ tree with an order of 3:
[10, 20, 30, 40, 50]
/ | | \
[1, 2, 3] [10] [20] [30, 40] [50, 60, 70]
1. Search the tree to determine the appropriate leaf node for the new record. In this case,
the appropriate leaf node is the one that contains the key values 20 and 30.
2. Because the leaf node has room for the new record, we can simply insert it in the correct
position:
3. If the leaf node were full, we would split it into two halves and promote the median key
to the parent node.
4. We would then update the pointers in the parent node to reflect the new child nodes. If
the parent node were full, we would repeat the split process until the tree was balanced.
3. If the leaf node has more than the minimum number of records (i.e., at least ceil((m-1)/2)
records, where m is the order of the tree), we are done.
4. If the leaf node has fewer than the minimum number of records, try to redistribute
records from neighboring nodes. If redistribution is not possible, merge the node with a
neighboring node.
Here's an example of deleting a record with a key value of 20 from a B+ tree with an order of
3:
[10, 20, 30, 40, 50]
/ | | \
[1, 2, 3] [10] [20, 30] [40] [50, 60, 70]
1. Search the tree to find the leaf node containing the record to be deleted. In this case, the
leaf node is the one that contains the key values 20 and 30.
3. Because the leaf node has more than the minimum number of records, we are done.
4. If the leaf node had fewer than the minimum number of records, we would try to
redistribute records from neighboring nodes. If redistribution was not possible, we
would merge the node with a neighboring node. For example, if the node containing the
key values 10 and 30 had been deleted instead, we would have to redistribute or
5. We would then update the parent node keys and pointers as necessary.
Properties of B+ Tree:-
B+ tree is a specialized Data Structure used for indexing and searching large amounts of data.
Here are the key properties of a B+ tree:
All keys are stored in the leaves, and each leaf node is linked to its neighbouring nodes.
Each node (except for the root) has at least ceil(m/2) keys, where m is the order of the
tree.
All leaf nodes are at the same level, which ensures efficient range queries.
Each node (except for the root) has at least ceil(m/2) child pointers.
The root node may have fewer keys and child pointers than other nodes.
The tree is balanced, which means that the path from the root to any leaf node has the
same length.
Features of B+ Tree
The following are some key features of B+ Trees:-
Balanced Tree Structure
Disk-based storage
Application of B+ Trees
The applications of B+ trees are:
Database Indexing: B+ trees are widely used in database management systems for indexing.
They provide efficient retrieval of records based on the values of the indexed columns.
File Systems: B+ trees are employed in file systems to organize and manage large amounts
of data efficiently. They help in quick retrieval of file blocks and support sequential access.
Information Retrieval: B+ trees are used in information retrieval systems, such as search
engines, to index and quickly locate relevant documents based on keywords.
Geographic Information Systems (GIS): GIS applications use B+ trees to index spatial data
efficiently, facilitating spatial queries and range searches.
DNS Servers: Domain Name System (DNS) servers often use B+ trees to store and manage
domain names, enabling fast lookups and updates.
File Databases: B+ trees are utilized in file databases, where they help organize and manage
large volumes of data with efficient search and retrieval operations.
Caching Mechanisms: B+ trees can be employed in caching mechanisms to efficiently store
and retrieve frequently accessed data, improving overall system performance.
Memory Management: B+ trees are used in memory management systems to efficiently
organize and locate data in virtual memory.
Hashing in DBMS:-
Hashing in DBMS is a technique to quickly locate a data record in a database irrespective
of the size of the database. For larger databases containing thousands and millions of
records, the indexing data structure technique becomes very inefficient because searching a
specific record through indexing will consume more time. This doesn't align with the goals
of DBMS, especially when performance and data retrieval time are minimized. So, to
counter this problem hashing technique is used. In this article, we will learn about various
hashing techniques.
What is Hashing?
The hashing technique utilizes an auxiliary hash table to store the data records using a hash
function. There are 2 key components in hashing:
Hash Table: A hash table is an array or data structure and its size is determined by the
total volume of data records present in the database. Each memory location in a hash
table is called a 'bucket' or hash indice and stores a data record's exact location and can
be accessed through a hash function.
Bucket: A bucket is a memory location (index) in the hash table that stores the data
record. These buckets generally store a disk block which further stores multiple records.
It is also known as the hash index.
Hash Function: A hash function is a mathematical equation or algorithm that takes one
data record's primary key as input and computes the hash index as output.
Hash Function:-
A hash function is a mathematical algorithm that computes the index or the location where
the current data record is to be stored in the hash table so that it can be accessed efficiently
later. This hash function is the most crucial component that determines the speed of
fetching data.
Working of Hash Function:-
The hash function generates a hash index through the primary key of the data record.
Now, there are 2 possibilities:
1. The hash index generated isn't already occupied by any other value. So, the address of
the data record will be stored here.
2. The hash index generated is already occupied by some other value. This is called
collision so to counter this, a collision resolution technique will be applied.
3. Now whenever we query a specific record, the hash function will be applied and returns
the data record comparatively faster than indexing because we can directly reach the exact
location of the data record through the hash function rather than searching through indices
one by one.
Example:
The primary key is used as the input to the hash function and the hash function generates the
output as the hash index (bucket's address) which contains the address of the actual data
record on the disk block.
Static Hashing has the following Properties:
Data Buckets: The number of buckets in memory remains constant. The size of the hash
table is decided initially and it may also implement chaining that will allow handling
some collision issues though, it's only a slight optimization and may not prove worthy if
the database size keeps fluctuating.
Hash function: It uses the simplest hash function to map the data records to its
appropriate bucket. It is generally modulo-hash function
Efficient for known data size: It's very efficient in terms when we know the data size
and its distribution in the database.
It is inefficient and inaccurate when the data size dynamically varies because we have
limited space and the hash function always generates the same value for every specific
input. When the data size fluctuates very often it's not at all useful because collision will
keep happening and it will result in problems like - bucket skew, insufficient buckets etc.
To resolve this problem of bucket overflow, techniques such as - chaining and open
addressing are used. Here's a brief info on both:
1. Chaining:-
Chaining is a mechanism in which the hash table is implemented using an array of type nodes,
where each bucket is of node type and can contain a long chain of linked lists to store the data
records. So, even if a hash function generates the same value for any data record it can still be
stored in a bucket by adding a new node.
However, this will give rise to the problem bucket skew that is, if the hash function keeps
generating the same value again and again then the hashing will become inefficient as the
remaining data buckets will stay unoccupied or store minimal data.