Data Structure Notes
Data Structure Notes
Introduction
Basic Terminology-
Data Structures are the programmatic way of storing data so that
data can be used efficiently. Almost every enterprise application
uses various types of data structures in one or the other way.
Multidimetional-
Address Calculation-
Address Calculation in single (one)
Dimension Array:
Array of an element of an array say “A[ I ]” is calculated using the
following formula:
Address of A [ I ] = B + W * ( I – LB )
Where,
B=Base_Address
W = Storage Size of one element stored in the array (in byte)
I = Subscript of element whose address is to be found
LB = Lower limit / Lower Bound of subscript, if not specified
assume 0 (zero).
Example:
Given the base address of an array B[1300…..1900] as 1020 and
size of each element is 2 bytes in the memory. Find the address
of B[1700].
Solution:
The given values are: B = 1020, LB = 1300, W = 2, I = 1700
Address of A [ I ] = B + W * ( I – LB )
=1020+2*(170–1300)
=1020+2*400
=1020+800
= 1820 [Ans]
Address of A [ I ][ J ] = B + W * [ N * ( I – Lr ) + ( J – Lc ) ]
Where,
B = Base address
I = Row subscript of element whose address is to be found
J = Column subscript of element whose address is to be found
W = Storage Size of one element stored in the array (in byte)
Lr = Lower limit of row/start row index of matrix, if not given
assume 0 (zero)
Lc = Lower limit of column/start column index of matrix, if not
given assume 0 (zero)
M = Number of row of the given matrix
N = Number of column of the given matrix
Examples:
Q 1. An array X [-15……….10, 15……………40] requires one byte of
storage. If beginning location is 1500 determine the location of X
[15][20].
Solution:
As you see here the number of rows and columns are not given in
the question. So they are calculated as:
Address of A [ I ][ J ] = B + W * [ ( I – Lr ) + M * ( J – Lc ) ]
=1500+1*[(15–(-15))+26*(20–15)]
=1500+1*[30+26*5]
=1500+1 * [160]
= 1660 [Ans]
Address of A [ I ][ J ] = B + W * [ N * ( I – Lr ) + ( J – Lc ) ]
=1500+1*[26*(15–(-15)))+(20–15)]
=1500+1*[26*30+5]
=1500+1*[780+5]
=1500+785
=2285[Ans]
Application Array-
Arrays are used to implement mathematical vectors and matrices, as
well as other kinds of rectangular tables. Many databases, small and
large, consist of one-dimensional arrays whose elements
are records.
Arrays are used to implement other data structures, such as
lists, heaps, hash tables, deques, queues and stacks.
One or more large arrays are sometimes used to emulate in-
program dynamic memory allocation, particularly memory
pool allocation. Historically, this has sometimes been the only way to
allocate "dynamic memory" portably.
Arrays can be used to determine partial or complete control flow in
programs, as a compact alternative to (otherwise repetitive)
multiple “if” statements. They are known in this context as control
tables and are used in conjunction with a purpose built interpreter
whose control flow is altered according to values contained in the
array. The array may contain subroutine pointers(or relative
subroutine numbers that can be acted upon by SWITCH statements)
that direct the path of the execution.
Character String in C-
Strings are actually one-dimensional array of characters terminated
by a nullcharacter '\0'. Thus a null-terminated string contains the
characters that comprise the string followed by a null.
The following declaration and initialization create a string
consisting of the word "Hello". To hold the null character at the end
of the array, the size of the character array containing the string is
one more than the number of characters in the word "Hello."
char greeting[6] = {'H', 'e', 'l', 'l', 'o', '\0'};
If you follow the rule of array initialization then you can write the
above statement as follows −
char greeting[] = "Hello";
Following is the memory presentation of the above defined string in C/C++ −
Character String Operation-
A character string is a series of characters represented by bits of
code and organized into a single variable. This string variable
holding characters can be set to a specific length or analyzed by a
program to identify its length.
A character string can play many roles in a computer program. For
example, a programmer can create an unpopulated character string
with a command in the load function of a program.
A user event can input data into that character string. If the user
types in a word or phrase such as "hello world," the program can
then later read that character string and print it, display it on the
screen, reserve it for storage, etc.
In modern programming, character strings are often involved in data
capture and data storage functions that take in names or other types
of information.
Array as Parameters-
An array can also be passed to method as argument or parameter. A
method process the array and returns output. Passing array as
parameter in C++ is pretty easy as passing other value as parameter.
Just create a function that accepts array as argument and then
process them. The following demonstration will help you to
understand how to pass array as argument in C++ programming.
Void main ()
{
Int arr[5]={1,2,3,4,5}
Show (arr, 5);
}
Void show (int a[], int n);
Int i;
For(i=0;i<n;i++)
Cout<<a[i];
}
Ordered List-
The structure of an ordered list is a collection of items where each
item holds a relative position that is based upon some underlying
characteristic of the item. The ordering is typically either ascending
or descending and we assume that list items have a meaningful
comparison operation that is already defined.
In Dynamic data structure the size of the structure in not fixed and
can be modified during the operations performed on it. Dynamic
data structures are designed to facilitate change of data structures in
the run time.
Insert Delete
LIFO
TOS
Pop Operation
Accessing the content while removing it from the stack, is known as
a Pop Operation. In an array implementation of pop() operation, the
data element is not actually removed, instead top is decremented to
a lower position in the stack to point to the next value. But in linked-
list implementation, pop() actually removes data element and
deallocates memory space.
A Pop operation may involve the following steps −
Step 1 − Checks if the stack is empty.
Frame- When the CPU keep the process in waiting then it is store
in stack.
Police Rotation-
1. Prefix
2. Infix
3. Postfix
Converting from Infix to Postfix-(AB+)
It is use BEDMAS Process to solve.
It means- To solve it we use this procees in this sequence-
B=Bracket
E=Exponant
D=Division /
M=Multiplication *
A=Addition +
S=Subtraction -
Example- A/B+C-D*G
(A/B) +C-D*G
(AB/) +C-D*G Let AB/=P
P+C-(D*G)
P+C-(DG*) Let DG*=Q
(P+C)-Q
(PC+)-Q Let PC+=R
R-Q
RQ-
RDG*-
PC/C+DG*- Ans
Q. Explain Stack in 40 20 30 + -
40 40 --------
20 40 20 --------
30 40 20 30 --------
+ PoP(30) --------
PoP(20) --------
30+20=50 50
- 40
Pop(40)
50-40=10 10 Ans
Application of Queue-
Queue, as the name suggests is used whenever we need to
manage any group of objects in an order in which the first
one coming in, also gets out first while the others wait for
their turn, like in the following scenarios:
Serving requests on a single shared resource, like a printer,
are handled in the same order as they arrive i.e First come
first served.
.
A tree is a data structure made up of nodes or vertices and
edges without having any cycle. The tree with no nodes is
called the null or empty tree. A tree that is not empty
consists of a root node and potentially many levels of
additional nodes that form a hierarchy.
In computer science, a tree is a widely used abstract data
type (ADT)—or data structure implementing this ADT—that
simulates a hierarchical tree structure, with a root value
and sub trees of children with a parent node, represented as
a set of linked nodes.
Descendant
A node reachable by repeated proceeding from parent to
child.
Ancestor
A node reachable by repeated proceeding from child to
parent.
Leaf (less commonly called External node)
A node with no children.
Degree
The number of subtrees of a node.
Edge
The connection between one node and another.
Path
A sequence of nodes and edges connecting a node with a
descendant.
Level
The level of a node is defined by 1 + (the number of
connections between the node and the root).
Height of node
The height of a node is the number of edges on the longest
path between that node and a leaf.
Height of tree
The height of a tree is the height of its root node.
Depth
The depth of a node is the number of edges from the tree's
root node to the node.
Forest
A forest is a set of n ≥ 0 disjoint trees.
Binary Tree-
A binary tree is a non linier type of data structure their they
containing node in which each node has at most
two children, which are referred to as the left child and
the right child.
Types of Binary Tree-
Full Binary Tree
A Binary Tree is full if every node has 0 or 2 children.
Following are examples of full binary tree. We can also say a
full binary tree is a binary tree in which all nodes except
leaves have two children.
Use the root of the general tree as the root of the binary tree.
Determine the first child of the root. This is the leftmost
node in the general tree at the next level.
Insert this node. The child reference of the parent node
refers to this node .
Continue finding the first child of each parent node and
insert it below the parent node with the child reference of
the parent to this node.
When no more first children exist in the path just used, move
back to the parent of the last node entered and repeat the
above process. In other words, determine the first sibling of
the last node entered.
Complete the tree for all nodes. In order to locate where the
node fits you must search for the first child at that level and
then follow the sibling references to a nil where the next
sibling can be inserted. The children of any sibling node can
be inserted by locating the parent and then
inserting the first child. Then the above process is repeated.
Tree Traversal-
Traversal is a process to visit all the nodes of a tree and may
print their values too. Because, all nodes are connected via
edges (links) we always start from the root (head) node.
That is, we cannot randomly access a node in a tree. There
are three ways which we use to traverse a tree –
In-order Traversal
Pre-order Traversal
Post-order Traversal
In-order Traversal
In this traversal method, the left subtree is visited first, then
the root and later the right sub-tree. We should always
remember that every node may represent a subtree itself.
If a binary tree is traversed in-order, the output will produce
sorted key values in an ascending order.
Algorithm
Until all nodes are traversed −
Step 1 − Recursively traverse left subtree.
Step 2 − Visit root node.
Step 3 − Recursively traverse right subtree.
Pre-order Traversal
In this traversal method, the root node is visited first, then
the left subtree and finally the right subtree.
Post-order Traversal
Rotation of Tree-
Balance height tree-
Also called AVL Tree.
Stands for Adelson, Velski & Landis.
AVL trees are height balancing binary search tree. AVL tree
checks the height of the left and the right sub-trees and
assures that the difference is not more than 1. This
difference is called the Balance Factor.
Here we see that the first tree is balanced and the next two
trees are not balanced-
In the second tree, the left subtree of C has height 2 and the
right subtree has height 0, so the difference is 2. In the third
tree, the right subtree of A has height 2 and the left is
missing, so it is 0, and the difference is 2 again. AVL tree
permits difference (balance factor) to be only 1.
AVL Rotations-
To balance itself, an AVL tree may perform the following four
kinds of rotations –
Left rotation
Right rotation
Left-Right rotation
Right-Left rotation
The first two rotations are single rotations and the next two
rotations are double rotations. To have an unbalanced tree, we at
least need a tree of height 2. With this simple tree, let's
understand them one by one.
Left Rotation
If a tree becomes unbalanced, when a node is inserted into the
right subtree of the right subtree, then we perform a single left
rotation −
State Action
Right-Left Rotation
The second type of double rotation is Right-Left Rotation. It is a
combination of right rotation followed by left rotation.
State Action
Balanced Tree-
A Balanced-tree is a self-balancing tree data structure that
keeps data sorted and allows searches, sequential access,
insertions, and deletions in logarithmic time. The B-tree is a
generalization of a binary search tree in that a node can have
more than two children.[1] Unlike self-balancing binary
search trees, the B-tree is optimized for systems that read
and write large blocks of data. B-trees are a good example of
a data structure for external memory. It is commonly used
in databases and filesystems.
Graph-
Graph is a non-linear type of data structure.
A Graph is a pair of a sets (V,E) where, V is the set of vertices
and E is the set of edges, connecting the pair of vertices.
A B
C D E
In The above graph
V={A,B,C,D,E}
V={AB,AC,BD,CD,DE}
Types of Graph-
There are 2 types of Graph-
1. Directed Graph
2. Undirected Graph
A B
D C
D C
Graph Terminology-
Weighted graph- A weighted graph is a graph in which
each branch is given a numerical weight. A weighted
graph is therefore a special type of labeled graph in which
the labels are numbers (which are usually taken to be
positive).
Unweighted graph- A unweighted graph is a graph in
which each branch is no numerical weight.
Adjacent- If there is an edge between vertices A and B
then both A and B are said to be adjacent. In other words,
Two vertices A and B are said to be adjacent if there is an
edge whose end vertices are A and B.
Degree-Total number of edges connected to a vertex is
said to be degree of that vertex.
Incident edge-An edge is said to be incident on a vertex if
the vertex is one of the endpoint of that edge.
Isolated vertex- An isolated vertex is avertex with degree
zero; that is, a vertex that is not an endpoint of any edge (the
example image illustrates oneisolated vertex).
Path-A path is a sequence of alternating vertices and edges
that shorts at a vertex and end at a vertex such that each edge is
incident to its predecessor and successor vertex.
Self-loop- Self-loop is an edge with the end vertices the same
vertex.
Graph Representation-
Graph data structure is represented using following
representation.
Adjacency Matrix
Incidence Matrix
Adjcancy List
Adjcancy Matrix-
Graph Traversal-
Graph traversal is technique used for searching a vertex in a
graph. The graph traversal is also used to decide the order of
vertices to be visit in the search process. A graph traversal
finds the egdes to be used in the search process without
creating loops that means using graph traversal we visit all
verticces of graph without getting into looping path.
Example:
BFS (Breadth First Search)-
BFS traversal of a graph, produces a spanning tree as final
result. Spanning Tree is a graph without any loops. We use Queue
data structure with maximum size of total number of vertices in
the graph to implement BFS traversal of a graph.
Shortest Path-
The problem of finding the shortest path in a graph from
one vertex to another. "Shortest" may be least number
of edges, least total weight, etc.
Shortest path (A, C, E, D, F) between vertices A and F in the
weighted directed graph.
Transitive Closuer-
Given a directed graph, find out if a vertex j is reachable from
another vertex i for all vertex pairs (i, j) in the given graph.
Here reachable mean that there is a path from vertex i to
j. The reach-ability matrix is called transitive closure of a
graph.
For example, consider below graph
Algorithm
Linear Search ( Array A, Value x)
Step 1: Set i to 1
Step 2: if i > n then go to step 7
Step 3: if A[i] = x then go to step 6
Step 4: Set i to i + 1
Step 5: Go to Step 2
Step 6: Print Element x Found at index i and go to step 8
Step 7: Print element not found
Step 8: Exit
Binary Search-
Binary search is a fast search algorithm with run-time
complexity of Ο(log n). This search algorithm works on the
principle of divide and conquer. For this algorithm to work
properly, the data collection should be in the sorted form.
Binary search looks for a particular item by comparing the
middle most item of the collection. If a match occurs, then
the index of item is returned. If the middle item is greater
than the item, then the item is searched in the sub-array to
the left of the middle item. Otherwise, the item is searched
for in the sub-array to the right of the middle item. This
process continues on the sub-array as well until the size of
the subarray reduces to zero.
How Binary Search Works?
For a binary search to work, it is mandatory for the target
array to be sorted. We shall learn the process of binary
search with a pictorial example. The following is our sorted
array and let us assume that we need to search the location
of value 31 using binary search.
We change our low to mid + 1 and find the new mid value again.
low = mid + 1
mid = low + (high - low) / 2
Our new mid is 7 now. We compare the value stored at location 7
with our target value 31.
Insertion Sort-
This is an in-place comparison-based sorting algorithm.
Here, a sub-list is maintained which is always sorted. For
example, the lower part of an array is maintained to be
sorted. An element which is to be 'insert'ed in this sorted
sub-list, has to find its appropriate place and then it has to
be inserted there. Hence the name, insertion sort.
The array is searched sequentially and unsorted items are
moved and inserted into the sorted sub-list (in the same
array). This algorithm is not suitable for large data sets as its
average and worst case complexity are of Ο(n2), where n is
the number of items.
Selection Sort-
Selection sort is a simple sorting algorithm. This sorting
algorithm is an in-place comparison-based algorithm in
which the list is divided into two parts, the sorted part at the
left end and the unsorted part at the right end. Initially, the
sorted part is empty and the unsorted part is the entire list.
The smallest element is selected from the unsorted array
and swapped with the leftmost element, and that element
becomes a part of the sorted array. This process continues
moving unsorted array boundary by one element to the
right.
This algorithm is not suitable for large data sets as its
average and worst case complexities are of Ο(n2), where n is
the number of items.
Analysis of sorting algorithm-
Time_complexity_Analysis–
We have discussed best, average and worst case complexity of
different sorting techniques with possible scenarios.
Comparison_based_sorting–
In comparison based sorting, elements of array are compared
with each other to find the sorted array.
Bubble_sort_and_Insertion_sort–
Average and worst case time complexity: n^2
Best case time complexity: n when array is already sorted.
Selection_sort–
Best, average and worst case time complexity: n^2 which is
independent of distribution of data.
Merge_sort–
Best, average and worst case time complexity: nlogn which is
independent of distribution of data.
Heap_sort–
Best, average and worst case time complexity: nlogn which is
independent of distribution of data.
Quick_sort–
It is a divide and conquer approach with recurrence relation:
T(n) = T(k) + T(n-k-1) + cn
T(n) = 2T(n/2) + cn
Solving this we get, T(n) = O(nlogn)
Non-comparison_based_sorting–
In non-comparison based sorting, elements of array are not
compared with each other to find the sorted array.
Radix_sort–
Best, average and worst case time complexity: nk where k is
the maximum number of digits in elements of array.
Count_sort–
Best, average and worst case time complexity: n+k where k is
the size of count array.
Bucket_sort–
Best and average time complexity: n+k where k is the number
of buckets.
Worst case time complexity: n^2 if all elements belong to
same bucket.
Lower bounds-
The term lower bound is defined dually as an element
of K which is less than or equal to every element of S. a set
with a lower bound is said to be bounded from below by
that bound.
For example, 5 is a lower bound for the set
{ 5, 8, 42, 34, 13934 }; so is 4; but 6 is not.
MergeSort(headRef)
1) If head is NULL or there is only one element in the Linked
List
then return.
2) Else divide the linked list into two halves.
FrontBackSplit(head, &a, &b); /* a and b are two halves
*/
3) Sort the two halves a and b.
MergeSort(a);
MergeSort(b);
4) Merge the sorted a and b (using SortedMerge() discussed
here)
and update the head pointer using headRef.
*headRef = SortedMerge(a, b);
Quick sort-
Quick sort is a highly efficient sorting algorithm and is based
on partitioning of array of data into smaller arrays. A large
array is partitioned into two arrays one of which holds
values smaller than the specified value, say pivot, based on
which the partition is made and another array holds values
greater than the pivot value.
Quick sort partitions an array and then calls itself
recursively twice to sort the two resulting subarrays. This
algorithm is quite efficient for large-sized data sets as its
average and worst case complexity are of Ο(n2), where n is
the number of items.
File Structure
External Storage device-
External storage comprises devices that store information
outside a computer. Such devices may be permanently
attached to the computer, may be removable or may use
removable media.
magnetic tape
floppy disk
external hard disk drives
Optical storage
CD
DVD
Blu-ray
Flash memory devices
Memory card
Memory stick
USB drives
Files-
A file is an object on a computer that
stores data, information, settings, or commands used with a
computer program. In a graphical user interface (GUI) such
as Microsoft Windows, files display as icons that relate to
the program that opens the file. For example, the picture is
an icon associated with Adobe Acrobat PDFfiles. If this file
was on your computer, double-clicking the icon in Windows
would open that file in Adobe Acrobat or the PDF reader
installed on the computer.
Sequential Organization-
In sequential organization the records are placed
sequentially onto the storage media i.e. occupy consecutive
locations in the case of tape that means placing records
adjacent to each other.
In addition the physical sequence of records is ordered on
some key called the primary key.
Sequential organization is also possible in the case of DASD
such as a disk. Even though disk storage is really two
dimensional (cylinder x surface) it may be mapped down
into one dimensional memory.
If the disk has c cylinders and s surfaces one possibility will
be to view disk memory as in figure.
Using notation tij to represent the jth track of the ith surface,
the sequence is t11, t21, t31….ts1, t12, t22,…..ts2 etc.
The sequential interpretation in figure is particularly efficient
for batched update and retrieval as the tracks are to be
accessed in order: all tracks on cylinder 1 followed by all
tracks on cylinder 2 etc. as a result of this the read/write
heads are moved one cylinder at a time and this movement is
necessitated only once for every s tracks.
Its main advantages are:
o It is easy to implement;
o It provides fast access to the next record using
lexicographic order.
Its disadvantages:
o It is difficult to update - inserting a new record may
require moving a large proportion of the file;
o Random access is extremely slow.
Random Organization-
Records are stored at random locations on the disk. This
randomization could be achieved by any of several
techniques: direct addressing, directory lookup, hashing.
Direct addressing: in direct addressing with equi-size records,
available disk space is divided out into nodes large enough to
hold a record. Numeric value of primary key is used to
determine the node into which a particular record is to be
stored.
Directory lookup: the index is not direct access type but is a
dense index maintained using a structure suitable for index
operations. Retrieving a record involves searching the index for
the record address and then accessing the record itself. The
storage management scheme will depend on whether fixed size
or variable size nodes are being used. It requires more accesses
for retrieval and update, since index searching will generally
require more than one access. In both direct addressing and
directory lookup, some provision must be made to handle
collisions.
Hashing: the available file space is divided into buckets and
slots. Some space may have to be set aside for an overflow area
in case chaining is being used to handle overflows. When
variable size records are present, the no. of slots per bucket will
be only rough indicator of no. of records a bucket can hold. The
actual no. will vary dynamically with the size of records in a
particular bucket. Random organization on the primary key
using any of the above three techniques overcomes the
difficulties of sequential organizations. Insertion, deletions
become easy. But batch processing of queries becomes
inefficient as records are not maintained in order of primary
key. Handling range queries becomes very inefficient except in
case of directory lookup.
Linked Organization-
Linked organizations differ from sequential organizations
essentially in that the logical sequence of records is generally
different from the physical sequence.
In sequential ith record is placed at location li, then the
i+1st record is placed at li + c where c is the length of ith
record or some fixed constant.
In linked organization the next logical record is obtained by
following link value from present record. Linking in order of
increasing primary key eases insertion deletion.
Searching for a particular record is difficult since no index is
available, so only sequential search possible.
We can facilitate indexes by maintaining indexes
corresponding to ranges of employee numbers eg. 501-700,
701-900. all records with same range will be linked together
i a list.
We can generalize this idea for secondary key level also. We
just set up indexes for each key and allow records to be in
more than one list. This leads to the multilist structure for
file representation.
Inverted File-
Inverted files are similar to multilists. Multilists records with
the same key value are linked together with link information
being kept in individual record. In case of inverted files the
link information is kept in index itself.
EG. We assume that every key is dense. Since the index
entries are variable length, index maintenance becomes
complex fro multilists. Benefits being Boolean queries
require only one access per record satisfying the query.
Queries of type k1=xx and k2=yy can be handled similarly by
intersecting two lists.
The retrieval works in two steps. In the first step, the indexes
are processed to obtain a list of records satisfying the query
and in the second, these records are retrieved using the list.
The no. of disk accesses needed is equal to the no. of records
being retrieved + the no. to process the indexes.
Inverted files represent one extreme of file organization in
which only the index structures are important. The records
themselves can be stored in any way.
Inverted files may also result in space saving compared with
other file structures when record retrieval doesn’t require
retrieval of key fields. In this case key fields may be deleted
from the records unlike multilist structures.
Indexing Techniques-
We know that data is stored in the form of records. Every
record has a key field, which helps it to be recognized
uniquely.
Indexing is a data structure technique to efficiently retrieve
records from the database files based on some attributes on
which the indexing has been done. Indexing in database
systems is similar to what we see in books.
Indexing is defined based on its indexing attributes. Indexing
can be of the following types −
Primary Index − Primary index is de ined on an ordered data
file. The data file is ordered on a key field. The key field is
generally the primary key of the relation.
Secondary Index − Secondary index may be generated from a
field which is a candidate key and has a unique value in every
record, or a non-key with duplicate values.
Clustering Index − Clustering index is defined on an ordered
data file. The data file is ordered on a non-key field.