Data Storage and Access Methods
Min Song IS698
Database Design Process
Application 1 Application 2 Application 3 Application 4
External Model
Application 1
External Model
External Model
External Model
Conceptual requirements
Application 2
Conceptual requirements
Application 3
Conceptual requirements
Application 4
Conceptual Model
Logical Model
Internal Model
Conceptual requirements
Physical Design
Physical Database Design
Many physical database design decisions are implicit in the technology adopted Also, organizations may have standards or an information architecture that specifies operating systems, DBMS, and data access languages -- thus constraining the range of possible physical implementations. We will be concerned with some of the possible physical implementation issues
Physical Database Design
The primary goal of physical database design is data processing efficiency We will concentrate on choices often available to optimize performance of database services Physical Database Design requires information gathered during earlier stages of the design process
Physical Design Information
Information needed for physical file and database design includes:
Normalized relations plus size estimates for them Definitions of each attribute Descriptions of where and when data are used entered, retrieved, deleted, updated, and how often Expectations and requirements for response time, and data security, backup, recovery, retention and integrity Descriptions of the technologies used to implement the database
Physical Design Decisions
There are several critical decisions that will affect the integrity and performance of the system
Storage Format Physical record composition Data arrangement Indexes Query optimization and performance tuning
Storage Format
Choosing the storage format of each field (attribute). The DBMS provides some set of data types that can be used for the physical storage of fields in the database Data Type (format) is chosen to minimize storage space and maximize data integrity
Objectives of data type selection
Minimize storage space Represent all possible values Improve data integrity Support all data manipulations The correct data type should, in minimal space, represent every possible value (but eliminate illegal values) for the associated attribute and can support the required data manipulations (e.g. numerical or string operations)
Access Data Types
Numeric (1, 2, 4, 8 bytes, fixed or float) Text (255 max) Memo (64000 max) Date/Time (8 bytes) Currency (8 bytes, 15 digits + 4 digits decimal) Autonumber (4 bytes) Yes/No (1 bit) OLE (limited only by disk space) Hyperlinks (up to 64000 chars)
Access Numeric types
Byte Integer
Stores numbers from 0 to 255 (no fractions). 1 byte
Stores numbers from 32,768 to 32,767 (no fractions) 2 bytes Long Integer (Default) Stores numbers from 2,147,483,648 to 2,147,483,647 (no fractions). 4 bytes Single Stores numbers from -3.402823E38 to 1.401298E45 for negative values and from 1.401298E45 to 3.402823E38 for positive values. 4 bytes Double Stores numbers from 1.79769313486231E308 to 4.94065645841247E324 for negative values and from 1.79769313486231E308 to 4.94065645841247E324 for positive values. 15 8 bytes Replication ID Globally unique identifier (GUID) N/A 16 bytes
Designing Physical Records
A physical record is a group of fields stored in adjacent memory locations and retrieved together as a unit Fixed Length and variable fields
Data Storage
Storing Data: Disks Buffer manager Representing relational data in a disk
The Memory Hierarchy
Main Memory = Disk Cache Processor Cache: Volatile access time 10 nanos 256M-1G 512K Access time: 10-100 nanoseconds Disk Tape Persistent 1.5 MB/S transfer rate 10-100 GB storage 280 GB typical speed: capacity Rate=5-10 MB/S Only sequential access Access time= Not for operational 10-15 msecs. data
Main Memory
Fastest, most expensive (excluding cache) Today: 512MB are common even on PCs Many databases could fit in memory
New industry trend: Main Memory Database E.g TimesTen
Main issue is volatility
Secondary Storage
Disks Slower, cheaper than main memory Persistent !!! The unit of disk I/O = block
Typically 1 block = 4k A disk block is also called a disk page or simply a page
Used with a main memory buffer
Block
Blocking factor (bfr) for a file is the average number of records stored in a disk block. Suppose the block size of a database system is 2000 bytes. Customer table has an average record length of 190 bytes. Assume the overhead of a block for the data is 100 bytes.
What is the blocking factor?
The Mechanics of Disk
Mechanical characteristics: Rotation speed (5400RPM) Disk head Number of platters (1-30) Number of tracks (<=10000) Number of sectors (256/track) Number of bytes / sector (29=512) Block size (212=4096)
Cylinder
Spindle Tracks
Sector
Arm movement
Platters
Arm assembly
Important Disk Access Characteristics
Block access time = Disk latency + transfer time Disk latency = seek time + rotational latency Seek time = time for the head to reach the right track 10ms 40ms Rotational latency = rotation time to get to the right sector Time for one rotation = 10ms Average rotation latency = 10ms/2 Transfer time = typically 5-10MB/s Disks read/write one block at a time (typically 4kB)
Representing Data Elements
Relational database elements:
CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), description VARCHAR(200), maker CHAR(10) REFERENCES Company(name))
A tuple is represented as a record
Record Formats: Fixed Length
F1 L1 F2 L2 F3 F4
L3
L4
Base address (B)
Address = B+L1+L2
Information about field types same for all records in a file; stored in system catalogs. Finding ith field requires scan of record. Note the importance of schema information!
Record Header
To schema length F1
L1 header timestamp F2 L2
F3
L3
F4
L4
Need the header because: The schema may change for a while new+old may coexist Records from different relations may coexist
Variable Length Records
Other header information
header
F1 L1
F2 L2
F3
L3
F4
L4
length
Place the fixed fields first: F1, F2 Then the variable length fields: F3, F4 Null values take 2 bytes only Sometimes they take 0 bytes (when at the end)
Records With Referencing Fields
Other header information
header
F1 L1
F2 L2
F3
L3
length
E.g. to represent one-many or many-many relationships
Storing Records in Blocks
Blocks have fixed size (typically 4k)
BLOCK R4 R3 R2 R1
Spanning Records Across Blocks
block header block header
R1
R2
R2
R3
When records are very large Or even medium size: saves space in blocks
BLOB
Binary large objects Supported by modern database systems E.g. images, sounds, etc. Storage: attempt to cluster blocks together
Modifications: Insertion
File is unsorted
add it to the end
File is sorted:
Is there space in the right block ?
Yes: we are lucky, store it there
Is there space in a neighboring block ?
Look 1-2 blocks to the left/right, shift records
If anything else fails, create overflow block
Overflow Blocks
Blockn-1 Blockn Blockn+1
Overflow
After a while the file starts being dominated by overflow blocks: time to reorganize
Modifications: Deletions
Free space in block, shift records Maybe be able to eliminate an overflow block
Modifications: Updates
If new record is shorter than previous, easy If it is longer, need to shift records, create overflow blocks
Physical Addresses
Each block and each record have a physical address that consists of:
The host The disk The cylinder number The track number The block within the track For records: an offset in the block sometimes this is in the blocks header
Logical Addresses
Logical address: a string of bytes (1016) More flexible: can blocks/records around But need translation table:
Logical address L1 L2 L3 Physical address P1 P2 P3
Main Memory Address
When the block is read in main memory, it receives a main memory address Buffer manager has another translation table
Memory address M1 M2 M3 Logical address L1 L2 L3
Designing Physical/Internal Model Overview terminology Access methods
Physical Design
Internal Model/Physical Model
User request Interface 1
External Model
DBMS Model Internal
Access Methods
Interface 2 Operating System Access Methods
Interface 3
Data Base
Physical Design
Interface 1: User request to the DBMS. The user presents a query, the DBMS determines which physical DBs are needed to resolve the query Interface 2: The DBMS uses an internal model access method to access the data stored in a logical database. Interface 3: The internal model access methods and OS access methods access the physical records of the database.
Physical File Design
A Physical file is a portion of secondary storage (disk space) allocated for the purpose of storing physical records Pointers - a field of data that can be used to locate a related field or record of data Access Methods - An operating system algorithm for storing and locating data in secondary storage Pages - The amount of data read or written in one disk input or output operation
Internal Model Access Methods
Many types of access methods:
Physical Sequential Indexed Sequential Indexed Random Inverted Direct Hashed
Differences in
Access Efficiency Storage Efficiency
Physical Sequential
Key values of the physical records are in logical sequence Main use is for dump and restore Access method may be used for storage as well as retrieval Storage Efficiency is near 100% Access Efficiency is poor (unless fixed size physical records)
Indexed Sequential
Key values of the physical records are in logical sequence Access method may be used for storage and retrieval Index of key values is maintained with entries for the highest key values per block(s) Access Efficiency depends on the levels of index, storage allocated for index, number of database records, and amount of overflow Storage Efficiency depends on size of index and volatility of database
Index Sequential
Adams Becker Dumpling
Data File Block 1
Actual Value Dumpling Harty Texaci ...
Address Block Number 1 2 3
Getta Harty
Block 2
Mobile Sunoci Texaci
Block 3
Indexed Sequential: Two Levels
Key Value
150 385 Key Value 385 678 805 Address
Address
1 2
001 003 . . 150 251 . . 385 455 480 . . 536 605 610 . . 678 705 710 . . 785
7 8 9
Key Value
536 678
Address
3 4
Key Value
785 805
Address
5 6
791 . . 805
Indexed Random
Key values of the physical records are not necessarily in logical sequence Index may be stored and accessed with Indexed Sequential Access Method Index has an entry for every data base record. These are in ascending order. The index keys are in logical sequence. Database records are not necessarily in ascending sequence. Access method may be used for storage and retrieval
Indexed Random
Becker Harty
Actual Value Adams Becker Dumpling Getta Address Block Number 2 1 3 2
Adams Getta
Harty
Dumpling
Btree
F || P || Z| B || D || F| H || L || P| R || S || Z|
Devils Flyers Hawkeyes Hoosiers Minors Panthers Seminoles
Aces Boilers Cars
Inverted
Key values of the physical records are not necessarily in logical sequence Access Method is better used for retrieval An index for every field to be inverted may be built Access efficiency depends on number of database records, levels of index, and storage allocated for index
Inverted
CH 145 101, 103,104
Actual Value CH 145 CS 201 CS 623 PH 345 Address Block Number 1 2 3
Student name
Course Number
Adams Becker
CH145 cs201
Dumpling ch145
CS 201 102
Getta
Harty Mobile
ch145
cs623 cs623
CS 623 105, 106
Direct
Key values of the physical records are not necessarily in logical sequence There is a one-to-one correspondence between a record key and the physical address of the record May be used for storage and retrieval Access efficiency always 1 Storage efficiency depends on density of keys No duplicate keys permitted
Hashing
Key values of the physical records are not necessarily in logical sequence Many key values may share the same physical address (block) May be used for storage and retrieval Access efficiency depends on distribution of keys, algorithm for key transformation and space allocated Storage efficiency depends on distibution of keys and algorithm used for key transformation
Comparative Access Methods
Factor Storage space Sequential retrieval on primary key Random Retr. Multiple Key Retr. Deleting records Sequential No wasted space Very fast Indexed
No wasted space for data but extra space for index
Hashed
more space needed for addition and deletion of records after initial load
Moderately Fast Moderately Fast Very fast with multiple indexes OK if dynamic
Impractical Very fast Not possible very easy
Impractical Possible but needs a full scan can create wasted space Adding records requires rewriting file Updating records usually requires rewriting file
OK if dynamic
Easy but requires Maintenance of indexes
very easy
very easy