1 Introduction to File Structures
Introduction
Introduction to
to File
File Organization
Organization
Data processing from a computer science perspective:
1
– Storage of data
Introduction to File Structures
– Organization of data
– Access to data
This will be built on your knowledge of
Data Structures
CIS 256 (File Structures)
Data Structures vs.
Data Structures vs. File
File Structures
Structures
Both involve:
1
Representation of Data
Introduction to File Structures
+
Operations for accessing data
Difference:
– Data Structures deal with data in main memory
– File Structures deal with data in secondary storage
device (File).
CIS 256 (File Structures)
Computer
Computer Architecture
Architecture
1
CPU Differences
Introduction to File Structures
Registers Fast
Cache Small
Expensive
Volatile
RAM
Main Memory
(Semiconductor) Slow
Large
Disk, Tape, Cheap
Second Storage
DVD-R Stable
CIS 256 (File Structures)
Memory
Memory Hierarchy
Hierarchy
1
Introduction to File Structures
CIS 256 (File Structures)
Memory
Memory Hierarchy
Hierarchy
On systems with 32-bit addressing, only 232 bytes can be
1
directly referenced in main memory.
Introduction to File Structures
The number of data objects may exceed this number!
Data must be maintained across program executions. This
requires storage devices that retain information when the
computer is restarted.
– We call such storage nonvolatile.
– Primary storage is usually volatile, whereas secondary and
tertiary storage are nonvolatile.
CIS 256 (File Structures)
Definition
Definition
1
File Structures is the Organization of Data in Secondary
Storage Device in such a way that minimize the access
Introduction to File Structures
time and the storage space.
A File Structure is a combination of representations for
data in files and of operations for accessing the data.
A File Structure allows applications to read, write and
modify data. It might also support finding the data that
matches some search criteria or reading through the data
in some particular order.
CIS 256 (File Structures)
Why
WhyStudy
StudyFile
FileStructure
StructureDesign?
Design?
I.I.Data
DataStorage
Storage
1
Computer Data can be stored in three kinds of
locations:
Introduction to File Structures
– Primary Storage ==> Memory [Computer
Memory]
Our – Secondary Storage [Online Disk/ Tape/ CDRom
Focus that can be accessed by the computer]
– Tertiary Storage ==> Archival Data [Offline
Disk/Tape/ CDRom not directly available to the
computer.]
CIS 256 (File Structures)
II.
II.Memory
Memoryversus
versusSecondary
SecondaryStorage
Storage
1
Secondary storage such as disks can pack thousands of
megabytes in a small physical location.
Introduction to File Structures
Computer Memory (RAM) is limited.
However, relative to Memory, access to secondary
storage is extremely slow [E.g., getting information
from slow RAM takes 120. 10-9 seconds (= 120
nanoseconds) while getting information from Disk
takes 30. 10-3 seconds (= 30 milliseconds)]
CIS 256 (File Structures)
III.
III.How
HowCan
CanSecondary
SecondaryStorage
StorageAccess
AccessTime
Timebe
beImproved?
Improved?
1
By improving the File Structure.
Introduction to File Structures
Since the details of the representation of the data and the
implementation of the operations determine the efficiency
of the file structure for particular applications, improving
these details can help improve secondary storage access
time.
CIS 256 (File Structures)
Overview
Overviewof ofFile
FileStructure
StructureDesign
Design
I.I.General
GeneralGoals
Goals
1
Introduction to File Structures
Get the information we need with one access to the disk.
If that’s not possible, then get the information with as few
accesses as possible.
Group information so that we are likely to get everything we need
with only one trip to the disk.
CIS 256 (File Structures)
II.
II.Fixed
Fixedversus
versusDynamic
DynamicFiles
Files
1
It is relatively easy to come up with file structure designs
that meet the general goals when the files never change.
Introduction to File Structures
When files grow or shrink when information is added
and deleted, it is much more difficult.
CIS 256 (File Structures)
II.
II.The
Theemergence
emergenceof
ofDisks
Disksand
andIndexes
Indexes
As files grew very large, unaided sequential access was not a good
1
solution.
Disks allowed for direct access.
Introduction to File Structures
Indexes made it possible to keep a list of keys and pointers in a
small file that could be searched very quickly.
With the key and pointer, the user had direct access to the large,
primary file.
CIS 256 (File Structures)
How
How Fast?
Fast?
Typical times for getting info
1
9
– Main memory: ~120 nanoseconds = 120 u 10
Introduction to File Structures
6
– Magnetic Disks: ~30 milliseconds = 30 u 10
An analogy keeping same time proportion as above
– Looking at the index of a book: 20 seconds
versus
– Going to the library: 58 days
CIS 256 (File Structures)
Comparison
Comparison
Main Memory
1
– Fast (since electronic)
Introduction to File Structures
– Small (since expensive)
– Volatile (information is lost when power failure occurs)
Secondary Storage
– Slow (since electronic and mechanical)
– Large (since cheap)
– Stable, persistent (information is preserved longer)
CIS 256 (File Structures)
Goal
Goal of
of the
the Course
Course
Minimize number of trips to the disk in order to get
1
desired information. Ideally get what we need in one disk
Introduction to File Structures
access or get it with as few disk access as possible.
Grouping related information so that we are likely to get
everything we need with only one trip to the disk (e.g.
name, address, phone number, account balance).
Locality of Reference in Time and Space
CIS 256 (File Structures)
Good
Good File
File Structure
Structure Design
Design
Fast access to great capacity
1
Reduce the number of disk accesses
Introduction to File Structures
By collecting data into buffers, blocks or buckets
Manage growth by splitting these collections
CIS 256 (File Structures)
History
History of
of File
File Structure
Structure Design
Design
1. In the beginning… it was the tape
1
– Sequential access
– Access cost proportional to size of file
Introduction to File Structures
[Analogy to sequential access to array data structure]
2. Disks became more common
– Direct access
[Analogy to access to position in array]
– Indexes were invented
• list of keys and points stored in small file
• allows direct access to a large primary file
Great if index fits into main memory.
As file grows we have the same problem we had with a
large primary file
CIS 256 (File Structures)
History
History of
of File
File Structure
Structure Design
Design
3. Tree structures emerged for main memory (1960`s)
1
– Binary search trees (BST`s)
– Balanced,
Balanced self adjusting BST`s: e.g. AVL trees (1963)
Introduction to File Structures
4. A tree structure suitable for files was invented:
B trees (1979) and B+ trees
good for accessing millions of records with 3 or 4 disk
accesses.
5. What about getting info with a single request?
– Hashing Tables (Theory developed over 60’s and 70’s but still
a research topic)
good when files do not change too much in time.
– Expandable, dynamic hashing (late 70’s and 80’s)
one or two disk accesses even if file grows dramatically
CIS 256 (File Structures)