Computer Operating Systems Guide
Computer Operating Systems Guide
BSEH201
Benny M Nyambo
(2014)
Table of Contents
COURSE GUIDE DESCRIPTION ............................................................................................ 11
INTRODUCTION ........................................................................................................................ 11
TEXT ARRANGEMENT GUIDE .............................................................................................. 12
Unit 1 Operating System: An Overview ................................................................................... 13
1.0 Introduction ................................................................................................................... 13
1.1 Learning Outcomes ....................................................................................................... 13
1.2 What is an Operating System ....................................................................................... 13
1.3 Goals of an Operating System ...................................................................................... 15
1.4 Generations of Operating Systems ............................................................................... 15
1.4.1. 0th Generations ........................................................................................................ 16
1.4.2. First Generations (1951 – 1956)............................................................................... 16
1.4.3. Second Generations (1956 – 1964) .......................................................................... 17
1.4.4. Third Generations (1964 – 1979) ............................................................................. 19
1.4.5. Fourth Generations (1979 - Present) ........................................................................ 20
1.5 Types of Operating Systems ......................................................................................... 21
1.5.1. Batch Processing Operating System ........................................................................ 21
1.5.2. Time Sharing ............................................................................................................ 22
1.5.3. Real Time Operating System (RTOS) ..................................................................... 22
1.5.4. Multiprogramming Operating System ..................................................................... 22
1.5.5. Multiprocessing System ........................................................................................... 22
1.5.6. Networking Operating System ................................................................................. 22
1.5.7. Distributed Operating System .................................................................................. 23
1.5.8. Operating Systems for Embedded Devices .............................................................. 23
1.6 Desirable Qualities of Operating System..................................................................... 24
1.7 Operating Systems: Some Examples............................................................................ 24
1.7.1. DOS .......................................................................................................................... 24
1.7.2. UNIX ........................................................................................................................ 25
1.7.3. Windows................................................................................................................... 25
1.7.4. Macintosh ................................................................................................................. 25
1.8 Functions of Operating System .................................................................................... 26
1.8.1. Process Management ................................................................................................ 26
1.8.2. Memory Management .............................................................................................. 27
1.8.3. Secondary Storage Management .............................................................................. 27
2
1.8.4. I/O Management ....................................................................................................... 28
1.8.5. File Management ...................................................................................................... 28
1.8.6. Protection ................................................................................................................. 28
1.8.7. Networking ............................................................................................................... 29
1.8.8. Command Interpretation .......................................................................................... 29
1.9 Summary ........................................................................................................................ 30
Unit 2 Processes ........................................................................................................................ 32
2.0. Introduction .................................................................................................................. 32
2.1. Learning Outcomes ...................................................................................................... 32
2.2. The Concept of Process ............................................................................................... 33
2.2.1. Implicit and Explicit Tasking ................................................................................... 34
2.2.2. Processes Relationship ............................................................................................. 35
2.2.3. Process States ........................................................................................................... 35
2.2.4. Implementation of Processes .................................................................................... 37
Context Switch ....................................................................................................................... 38
2.2.5. Process Hierarchy ..................................................................................................... 39
2.2.6. Threads ..................................................................................................................... 40
Why Use Threads? ................................................................................................................. 40
2.2.7. Level of Threads ....................................................................................................... 41
2.3. Systems Calls for Process Management ...................................................................... 41
2.4. Process Scheduling ........................................................................................................ 43
2.4.1. Scheduling Objectives .............................................................................................. 43
2.4.2. Types of Schedulers ................................................................................................. 44
2.4.3. Scheduling Criteria ................................................................................................... 45
2.5. Scheduling Algorithms .................................................................................................. 46
2.5.1. First-Come First-Serve (FCFS) ................................................................................ 47
2.5.2. Shortest-Job First (SJF) ............................................................................................ 49
2.5.3. Round Robin (RR) ................................................................................................... 50
2.5.4. Shortest Remaining Time Next (SRTN) .................................................................. 51
2.5.5. Priority Based Scheduling or Event Driven (ED) Scheduling ................................. 52
2.6. Performance Evaluation of the Scheduling Algorithms ............................................ 54
2.7. Summary ........................................................................................................................ 56
REFERENCES ......................................................................................................................... 57
Unit 3 Interprocess Communication and Synchronisation .................................................... 58
3
3.0 Introduction ................................................................................................................... 58
3.1 Learning Outcomes ....................................................................................................... 59
3.2 Interprocess Communication ....................................................................................... 59
3.2.1. Shared-Memory System ........................................................................................... 60
3.1.2 Message-Passing System............................................................................................... 60
3.2 Interprocess Synchronisation ............................................................................................ 63
3.2.1 Serialisation ................................................................................................................... 63
3.2.2 Mutexes: Mutual Exclusion .......................................................................................... 63
3.2.3 Critical Sections: The Mutex Solution .......................................................................... 64
3.2.4 Dekker's Solution for Mutual Exclusion ....................................................................... 65
3.2.5 Bakery’s Algorithm ....................................................................................................... 66
3.3 Semaphores ......................................................................................................................... 67
3.4 Classical Problems in Concurrent Programming ........................................................... 70
3.4.1 Producers/Consumers Problem ..................................................................................... 70
3.4.2 Readers and Writers Problem ........................................................................................ 71
3.4.3 Dining Philosophers Problem........................................................................................ 73
3.4.4 Sleeping Barber Problem .............................................................................................. 74
3.5 Locks ................................................................................................................................... 75
3.6 Monitors and Condition Variables .................................................................................. 76
3.7 Summary ............................................................................................................................ 79
Unit 4 Deadlocks ....................................................................................................................... 81
4.0 Introduction ................................................................................................................... 81
4.1 Learning Outcomes...................................................................................................... 82
4.2 Deadlocks ...................................................................................................................... 82
Example 1:.................................................................................................................................. 82
4.3 Characterisation of a Deadlock........................................................................................ 83
4.3.1. Mutual Exclusion Condition .................................................................................... 83
4.3.2. Hold and Wait Condition ......................................................................................... 83
4.3.3. No-Preemptive Condition ........................................................................................ 83
4.3.4. Circular Wait Condition ........................................................................................... 83
4.4 A Resource Allocation Graph ....................................................................................... 84
4.5 Dealing with Deadlock Situations ................................................................................ 86
4.5.1 Deadlock Prevention ................................................................................................ 86
4.5.2 Deadlock Avoidance ................................................................................................ 88
4
4.5.3 Deadlock Detection and Recovery ........................................................................... 91
4.6. Summary ........................................................................................................................ 92
Unit 5 Memory Management ................................................................................................ 94
6.0 Introduction ................................................................................................................... 94
6.1 Learning Outcomes.......................................................................................................... 95
6.2 Overlays and Swapping ................................................................................................... 95
6.3 Logical and Physical Address Space ............................................................................... 98
6.4 Single Process Monitor (Monoprogramming) ................................................................. 98
6.5 Contiguous Allocation Methods ...................................................................................... 99
5.5.1 Single-partition System ............................................................................................... 100
5.5.2 Multiple–partition System: Fixed-sized Partition ....................................................... 100
6.6 Paging ............................................................................................................................... 104
5.6.1 Principles of Operation........................................................................................... 104
5.6.2 Page Allocation ........................................................................................................... 104
5.6.3 Hardware Support for Paging ...................................................................................... 105
5.6.4 Protection and Sharing ................................................................................................ 107
6.7 Segmentation ................................................................................................................ 108
5.7.1 Principles of Operation................................................................................................ 108
5.7.2 Address Translation..................................................................................................... 109
5.7.3 Protection and Sharing ................................................................................................ 110
6.8 Summary ...................................................................................................................... 110
Unit 6 Virtual Memory......................................................................................................... 112
6.0 Introduction ................................................................................................................. 112
6.2 Virtual Memory ........................................................................................................... 113
6.2.1 Principles of Operation.......................................................................................... 114
6.1.2 Virtual Memory Management ................................................................................ 115
6.2.3 Protection and Sharing ................................................................................................ 117
6.3 Demand Paging ............................................................................................................ 119
6.4 Page Replacement Policies .......................................................................................... 121
6.4.1 First In First Out (FIFO) .............................................................................................. 121
6.4.2 Second Chance (SC).................................................................................................... 122
6.4.3 Least Recently Used (LRU) ........................................................................................ 122
6.4.4 Optimal Algorithm (OPT) ........................................................................................... 122
6.5 Thrashing ................................................................................................................... 123
5
6.5.1 Working-Set Model ..................................................................................................... 124
6.5.2 Page-fault Rate ............................................................................................................ 125
6.6 Demand Segmentation ................................................................................................ 125
6.7 Combined Systems ....................................................................................................... 126
6.7.1 Segmented Paging ....................................................................................................... 126
6.7.2 Paged Segmentation .................................................................................................... 127
UNIT 7 I/O and File Management ........................................................................................ 129
7.0. Introduction ................................................................................................................. 129
7.4. Disk Organisation ........................................................................................................ 132
7.4.1 Device Drivers and IDs ............................................................................................... 133
7.4.2 Checking Data Consistency and Formatting ............................................................... 134
7.5. Disk Scheduling............................................................................................................ 134
7.5.1 FCFS Scheduling......................................................................................................... 135
7.5.2 SSTF Scheduling ......................................................................................................... 135
7.5.3 SCAN Scheduling ....................................................................................................... 136
7.5.4 C-SCAN Scheduling ................................................................................................... 137
7.5.5 LOOK and C-LOOK Scheduling ................................................................................ 137
7.6. RAID ............................................................................................................................. 138
7.7. Disk Cache .................................................................................................................... 138
7.8. Command Language User’s View of the File System .............................................. 139
7.9. The System Programmer’s View of the File System ................................................ 139
7.10. The Operating System’s View of File Management ............................................. 139
7.10.1 Directories ................................................................................................................. 140
7.10.2 Disk Space Management ........................................................................................... 143
7.10.3 Disk Address Translation .......................................................................................... 147
7.10.4 File Related System Services .................................................................................... 148
7.10.5 Asynchronous Input/Output ...................................................................................... 149
7.11. Summary ................................................................................................................... 150
Unit 8 Introduction to Networking .................................................................................... 152
8.0. Introduction ................................................................................................................. 152
8.1. Learning Outcomes ..................................................................................................... 152
8.2. Network Basics ............................................................................................................. 152
8.2.1 Network Protocols ....................................................................................................... 154
8.3. Remote Procedure Call Protocols .............................................................................. 170
6
8.3.1 Marshalling.................................................................................................................. 172
8.3.2 Reliable Semantics ...................................................................................................... 175
8.4. Summary ...................................................................................................................... 183
Unit 9 Distributed Systems ................................................................................................. 184
9.0. Introduction ................................................................................................................. 184
9.2.1 Economy...................................................................................................................... 185
9.2.2 Reliability .................................................................................................................... 185
9.2.3 Resource sharing ......................................................................................................... 185
9.2.4 Performance ................................................................................................................ 185
9.2.5 Incremental growth. .................................................................................................... 186
9.3. Naming .......................................................................................................................... 186
9.3.1 Name server................................................................................................................. 186
9.3.2 Internet names and addresses ...................................................................................... 186
9.3.3 Internet name server .................................................................................................... 187
9.4. Operating Systems ....................................................................................................... 189
9.4.1 Network operating system ........................................................................................... 189
9.4.2 Distributed operating system ....................................................................................... 189
9.4.3 Hybrid systems ............................................................................................................ 190
9.5. Sockets. ......................................................................................................................... 191
9.5.1 The socket interface..................................................................................................... 191
9.5.2 Creating a socket ......................................................................................................... 192
9.5.3 Binding an address ...................................................................................................... 192
9.5.4 Connecting sockets ...................................................................................................... 192
9.6. Remote Procedure Call ............................................................................................... 193
9.6.1 Overview ..................................................................................................................... 193
9.6.2 Generating stubs .......................................................................................................... 194
9.7. Distributed Mutual Exclusion .................................................................................... 194
9.7.1 Centralized algorithm .................................................................................................. 195
9.7.2 A distributed algorithm ............................................................................................... 195
9.8. Deadlock in Distributed Systems ................................................................................ 195
9.8.1 Centralized algorithm .................................................................................................. 196
9.8.2 Distributed algorithm .................................................................................................. 196
9.8.3 Recovery...................................................................................................................... 197
9.9. Distributed Shared Memory ....................................................................................... 197
7
9.9.1 Implementation............................................................................................................ 197
9.9.2 Local run-time support ................................................................................................ 198
9.10. Distributed File Systems .......................................................................................... 198
9.10.1 Client/server systems................................................................................................. 198
9.10.2 Naming schemes ....................................................................................................... 199
9.10.3 Reading and writing .................................................................................................. 200
9.10.4 Caching...................................................................................................................... 200
9.10.5 Replication ................................................................................................................ 201
9.10.6 Sun Network File System .......................................................................................... 201
9.11. Summary ................................................................................................................... 202
FURTHER READING. ........................................................................................................ 202
Unit10 Fault Tolerance and Security ................................................................................... 203
10.0. Introduction .............................................................................................................. 203
10.1. Learning Outcomes .................................................................................................. 203
10.2. Fault tolerance .......................................................................................................... 203
10.2.1 Types of fault ............................................................................................................ 204
10.2.2 Detecting faults ......................................................................................................... 204
10.2.3 Recovering from faults .............................................................................................. 204
10.2.4 Faults in file systems ................................................................................................. 205
10.2.5 Faults in distributed systems ..................................................................................... 205
10.3. Security ..................................................................................................................... 206
10.3.1 Security policy........................................................................................................... 206
10.3.2 Protection mechanisms .............................................................................................. 206
10.3.3 Access matrix ............................................................................................................ 207
Activity 10.1 ......................................................................................................................... 208
10.4. Security in Distributed Systems .............................................................................. 208
10.4.1 Authentication ........................................................................................................... 208
10.4.2 Cryptography ............................................................................................................. 209
10.5.3 Digital signatures ....................................................................................................... 211
10.5. UNIX Security .......................................................................................................... 211
10.5.1. Unix Protection System ........................................................................................ 212
10.5.2 Unix Authorization .................................................................................................... 213
10.5.3 UNIX Security Analysis ............................................................................................ 215
10.5.4 UNIX Vulnerabilities ................................................................................................ 216
8
10.6. Windows Security .................................................................................................... 218
10.6.1 Windows Protection System ..................................................................................... 218
10.6.2 Windows Authorization ............................................................................................ 220
10.6.3 Windows Security Analysis ...................................................................................... 222
10.6.4 Windows Vulnerabilities ........................................................................................... 223
10.7. Summary ................................................................................................................... 225
REFERENCES ....................................................................................................................... 225
Unit 11 Case Study 1: LINUX ............................................................................................. 226
11.0. Introduction .............................................................................................................. 226
11.1. Learning Outcomes .................................................................................................. 226
11.2. History of UNIX and LINUX .................................................................................. 226
11.3. Overview of LINUX ................................................................................................. 228
11.3.1 Linux Goals ............................................................................................................... 228
11.3.2 Interfaces to Linux .................................................................................................... 228
11.3.3 The Shell ................................................................................................................... 229
11.3.4 Linux Utility Programs .............................................................................................. 229
11.3.5 Kernel Structure ........................................................................................................ 230
11.4. Processes in Linux .................................................................................................... 232
11.4.1 Fundamental Concepts .............................................................................................. 232
11.4.2 Process Management System Calls in Linux ............................................................ 236
11.5. Memory Management in Linux .............................................................................. 240
11.6. Input/Output in Linux ............................................................................................. 242
11.6.1 Disk Scheduling ........................................................................................................ 242
11.6.2 The Elevator Scheduler ............................................................................................. 242
11.6.3 Deadline Scheduler ................................................................................................... 243
11.6.4 Anticipatory I/O Scheduler ....................................................................................... 244
11.6.5 Linux Page Cache ...................................................................................................... 244
11.7. The Linux File System ............................................................................................. 245
11.7.1 Fundamental Concepts .............................................................................................. 245
11.7.2. The file system in reality .......................................................................................... 248
11.8. Summary ................................................................................................................... 250
Unit 12 Case Study: Windows Vista..................................................................................... 251
12.0. Introduction .............................................................................................................. 251
12.2.1 Single-User Multitasking .......................................................................................... 254
9
12.2.2 Architecture ............................................................................................................... 255
12.2.3 Operating System Organization ................................................................................ 256
12.2.4 Client/Server Model .................................................................................................. 259
12.2.5 Threads and SMP ...................................................................................................... 260
12.2.6 Windows Objects ...................................................................................................... 260
12.3. Windows Thread and SMP Management .............................................................. 262
12.3.1 Process and Thread Objects ...................................................................................... 264
12.3.2. Multithreading .......................................................................................................... 265
12.3.4 Support for OS Subsystems....................................................................................... 268
12.3.5.Symmetric Multiprocessing Support ......................................................................... 268
12.4. Windows Concurrency Mechanisms ...................................................................... 269
12.4.1 Wait Functions .......................................................................................................... 269
12.4.2 Dispatcher Objects .................................................................................................... 269
12.3.3 Critical Sections ........................................................................................................ 270
12.4.4 Slim Read-Writer Locks and Condition Variables ................................................... 271
12.5. Windows Memory Management ............................................................................ 271
12.5.1 Windows Virtual Address Map ................................................................................. 272
12.5.2 Windows Paging ....................................................................................................... 273
12.6. Windows Scheduling ................................................................................................ 273
12.6.1 Process and Thread Priorities .................................................................................... 274
12.6.2 Multiprocessor Scheduling ........................................................................................ 276
12.7. Windows I/O ............................................................................................................. 277
12.7.1. Basic I/O Facilities ............................................................................................. 277
12.7.3 Software RAID .......................................................................................................... 278
12.8. Windows File System ............................................................................................... 279
12.8.1 Key Features of NTFS ............................................................................................... 279
12.8.2 NTFS Volume and File Structure .............................................................................. 280
12.9. Summary ................................................................................................................... 284
10
COURSE GUIDE
COURSE GUIDE DESCRIPTION
You must read this Course Guide carefully from the beginning to the end. It tells you briefly what
the course is about and how you can work your way through the course material. It also suggests
the amount of time you are likely to spend in order to complete the course successfully. Please
keep on referring to the Course Guide as you go through the course material as it will help you to
clarify important study components or points that you might miss or overlook.
INTRODUCTION
BIT 201 Operating System is one of the courses offered by Faculty of Information Technology
and Multimedia Communication at Zimbabwe Open University (ZOU).
COURSE AUDIENCE
This module aims to impart learners to the understanding of operating system in greater detail
and at the same time will observe several architectural aspects of an operating system.
As an open and distance learner, you should be able to learn independently and optimise the
learning modes and environment available to you. Before you begin this course, please confirm
the course material, the course requirements and how the course is conducted.
STUDY SCHEDULE
It is a standard ZOU practice that learners accumulate 40 study hours for every credit hour. As
such, for a three-credit hour course, you are expected to spend 120 study hours. Table 1 gives an
estimation of how the 120 study hours could be accumulated.
11
LEARNING OUTCOMES
Before you go through this module, it is important that you note the text arrangement.
Understanding the text arrangement should help you to organise your study of this course to be
more objective and more effective. Generally, the text arrangement for each unit is as follows:
Learning Outcomes: This section refers to what you should achieve after you have completely
gone through a unit. As you go through each unit, you should frequently refer to these learning
outcomes. By doing this, you can continuously gauge your progress of digesting the unit.
Activity:Activities are also placed at various locations or junctures throughout the module.
Activity can appear in various forms such as questions, short case studies or it may even ask you
to conduct an observation or research. Activity may also ask your opinion and evaluation on a
given scenario. When you come across an Activity, you should try to widen what you have
gathered from the module and introduce it to real situations. You should engage yourself in
higher order thinking where you might be required to analyse, synthesise and evaluate instead of
just having to recall and define.
Summary: You can find this component at the end of each unit. This component helps you to
recap the whole unit. By going through the summary, you should be able to gauge your
knowledge retention level. Should you find points inside the summary that you do not fully
understand, it would be a good idea for you to revisit the details from the module.
Key Terms: This component can be found at the end of each unit. You should go through this
component to remind yourself of important terms or jargons used throughout the module. Should
you find terms here that you are not able to explain, you should look for the terms from the
module.
References: References is where a list of relevant and useful textbooks, journals, articles,
electronic contents or sources can be found. This list can appear in a few locations such as in the
Course Guide (at References section), at the end of every unit or at the back of the module. You
are encouraged to read and refer to the suggested sources to elicit the additional information
needed as well as to enhance your overall understanding of the course.
PRIOR KNOWLEDGE
A basic knowledge of the subject matter is not required for students who follow this module.
However, basic knowledge of computer system will be an advantage.
12
Unit 1 Operating System: An Overview
1.0 Introduction
Computer software can be divided into two main categories: application software and system
software. Application software consists of the programs for performing tasks particular to the
machine’s utilisation. This software is designed to solve a particular problem for users. Examples
of application software include spreadsheets, database systems, desktop publishing systems,
program development software and games.
On the other hand, system software is more transparent and less noticed by the typical computer
user. This software provides a general programming environment in which programmers can
create specific applications to suit their needs. This environment provides new functions that are
not available at the hardware level and performs tasks related to executing the application
program. System software acts as an interface between the hardware of the computer and the
application software that users need to run on the computer. The most important type of system
software is the operating system.
An Operating Systems (OS) is a collection of programs that acts as an interface between a user of
a computer and the computer hardware. The purpose of an operating system is to provide an
environment in which a user may execute the programs. Operating Systems are viewed as
resource managers. The main resource is the computer hardware in the form of processors,
storage, input/ouput devices, communication devices, and data. Some of the operating system
functions are implementing the user interface, sharing hardware among users, allowing users to
share data among themselves, preventing users from interfering with one another, scheduling
resources usage, facilitating parallel operations, organizing data for secure and rapid access, and
handling network communications. This unit presents the definition of the operating system,
goals of the operating system, generations of OS, different types of OS and functions of OS.
13
the use of the hardware among the various systems programs and application programs for the
various users.
We can view an operating system as a resource allocator. A computer system has many resources
(hardware and software) that may be required to solve a problem: CPU time, memory space, files
storage space, input/ouput devices etc. The operating system acts as the manager of these
resources and allocates them to specific programs and users as necessary for their tasks. Since
there may be many, possibly conflicting, requests for resources, the operating system must decide
which requests are allocated resources to operate the computer system fairly and efficiently.
An operating system is a control program. This program controls the execution of user programs
to prevent errors and improper use of the computer. Operating systems exist because : they are a
reasonable way to solve the problem of creating a usable computing system. The fundamental
goal of a computer systems is to execute user programs and solve user problems
While there is no university agreed upon definition of the concept of an operating system, the
following is a reasonable starting point.
A computer’s operating system is a group of programs designed to serve two basic purpose:
(a) To control the allocation and use of the computing system’s resources among the various
users and tasks; and
(b) To provide an interface between the computer hardware and the programmer that
simplifies and makes feasible the creation, coding, debugging, and maintenance of
application program.
Through systems programs such as editors and translators and the various utility programs (such
as sort and file transfer program) are not usually considered part of the operating system, the
operating system is responsible for providing access to these system resources.
The abstract view of the components of a system and the positioning of OS is shown in the
Figure 1.1.
14
Figure 1.1: Abstract view of the components of a computer system
ACTIVITY 1.1
15
1.4.1. 0th Generations
The term 0th generation is used to refer to the period of development of computing, which
predated the commercial production and sale of computer equipment. We consider that the period
might be way back when Charles Babbage invented the Analytical Engine. Afterwards the
computers by John Atanasoff in 1940; the Mark I, built by Howard Aiken and a group of IBM
engineers at Harvard in 1944; the ENIAC, designed and constructed at the University of
Pencylvania by Wallace Eckert and John Mauchly and the EDVAC, developed in 1944-46 by
John Von Neumann, Arthur Burks, and Herman Goldstine (which was the first to fully
implement the idea of the stored program and serial execution of instructions) were designed. The
development of EDVAC set the stage for the evolution of commercial computing and operating
system software. The hardware component technology of this period was electronic vacuum
tubes.
The actual operation of these early computers took place without the benefit of an operating
system. Early programs were written in machine language and each contained code for initiating
operation of the computer itself.
The mode of operation was called “open-shop” and this meant that users signed up for computer
time and when a user’s time arrived, the entire (in those days quite large) computer system was
turned over to the user. The individual user (programmer) was responsible for all machine set up
and operation, and subsequent clean-up and preparation for the next user. This system was clearly
inefficient and dependent on the varying competencies of the individual programmer as
operators.
Operation continued without the benefit of an operating system for a time. The mode was called
„closed shop” and was characterised by the appearance of hired operators who would select the
job to be run, and so forth. Programs began to be written in higher level, procedure-oriented
language, and thus the operator’s rountine expanded. The operator now selected a job, ran the
translation program to assemble or compile the source program, and combined the translated
object program along with any existing library programs that the program might need for input to
the linking program, loaded and ran the composite linked program, and then handled the next job
in a similar fashion.
Application program were run one at a time, and were translated with absolute computer
addresses that bound them to be loaded and run from these reassigned storage addresses set by
the translator for moving a program to different location in storage for any reason. Similarly, a
program bound to specific devices could not be run at all if any of these devices were busy or
broken.
The inefficiencies inherent in the above methods of operation led to the development of the
mono-programmed operating system, which eliminated some of the human intervention in
16
running job and provided programmers with a number of desirable functions. The OS consisted
of a permanently resident kernel in main storage, and a job scheduler and a number of utility
programs kept in secondary storage. User application programs were preceded by control or
specification cards (in those day, computer program were submitted on data cards) which
informed the OS of what system resources such as tape drives and printer) were needed to run a
particular application. The systems were designed to be operated as batch processing system.
These systems continued to operate under the control of a human operator who initiated
operations by mounting a magnetic tape that contained the operating system executable code onto
a „boot device”, and then pushing the IPL (Initial Program Load) or „boot” button to initiate the
bootstrap loading of the operating system. Once the system was loaded, the operator entered the
date and time, and then initiated the operation of the job scheduler program which read and
interpreted the control statements, secured the needed resources, executed the first user program,
recorded timing and accounting information, and then went back to begin processing of another
user program, and so on, as long as there were programs waiting in the input queue to be
executed.
The first generation saw the evolution from hands-on operation to closed shop operation to the
development of mono-programmed operating system. At the same time, the development of
programming languages was moving away from the basic machine languages; first to assembly
language, and later to procedure oriented languages, the most significant being the development
of FORTRAN by John W. Backus in 1956. Several problems remained, however, the most
obvious was the inefficient use of system resources, which was most evident when the CPU
waited while the relatively slower, mechanical I/O devices were reading or writing program data.
In addition, system protection was a problem because the operating system kernel was not
protected from being overwritten by an various application program. Moreover, other user
programs in the queue were not protected from destruction by executing programs.
The most significant innovations addressed the problem of excessive central processor delay due
to waiting for input/output operations. Recall that programs were executed by processing the
machine instructions in a strictly sequential order. As a result, the CPU, with its high speed
electronic component, was often forced to wait for completion of I/O operations which involved
mechanical devices, (card readers and tape drivers) that were order of magnitude slower. This
problem led to the introduction of the data channel, an integral and special-purpose computer
with its own instruction set, registers, and control unit designed to process input/output operations
separately and asynchronously from the operation of the computer’s main CPU near the end of
the first generation, and its widespread adoption in the second generation.
17
The data channel allowed some I/O to be buffered. That is, a program’s input data could be read
“ahead” from data cards or tape into a special block of memory called a buffer. Then, when the
user’s program came to an input statement, the data could be transferred from the buffer locations
at the faster memory access speed rather than the slower I/O device speed. Similarly, a program’s
output could be written another buffer and later a moved from buffer to the printer, tape or card
punch. What made this all work was the data channel’s ability to work asynchronously and
concurrently with the main processor. Thus, the slower mechanical I/O could be happening
concurrently with main program processing. This process was called I/O overlap.
The data channel was controlled by a channel program set up by the operating system I/O control
routines and initiated by a special instruction executed by the CPU. Then, the channel
independently processed data to or from the buffer. This provided communication from the CPU
to the data channel to initiate an I/O operation. It remained for the channel to communicate to the
CPU such events as data errors and the completion of a transmission. At first, this communication
was handled by polling - the CPU stopped its work periodically and polled the channel to
determine if there is any message.
Polling was obviously inefficient (imagine stopping your work periodically to go to the psot
office to see if an expected letter has arrived) and led to another significant innovation of the
second generation - thr interrupt. The data channel was able to interrupt the CPU with a message
- usually „I/O complete”. Infact, the interrupt idea was later extended from I/O to allow signaling
of number of exceptional conditions such as arithmetic overflow, division by zero and time-run-
out. Of course, interval clocks were added in conjunction with the latter, and thus operating
system came to have a way of regaining control from an exceptionally long or indefinitely
looping program.
These hardware developments led to enhancements of the operating system. I/O and data channel
communication and control became functions of the operating system, both to relieve the
application programmer from the difficult details of I/O programming and to protect the integrity
of the system to provide improved service to users by segmenting jobs and running shorter job
first (during „prime time”) and relegating longer jobs to lower priority or night time runs. System
libraries became more widely available and more comprehensive as new utilities and application
software components were available to programmers.
In order to further mitigate the I/O wait problem, system were set up to spool the input batch
from slower I/O devices such as the card reader to the much higher speed tape drive and
similarly, the output from the higher speed tape to the slower printer. In this scenario, the user
submitted a job at a window, a batch of jobs was accumulated and spooled from cards to tape „off
line”, the tape was moved to the main computer, the jobs were run, and their output was collected
on another tape that later was taken to a satellite computer for off line tape-to-printer output. User
then picked up their output at the submission windows.
Toward the end of this period, as random access devices became available, tape-oriented
operating system began to be replaced by disk-oriented systems. With the more sophisticated disk
hardware and the operating system supporting a greater portion of the programmer’s work, the
computer system that users saw was more and more removed from the actual hardware-users saw
a virtual machine.
18
The second generation was a period of intense operating system development. Also it was the
period for sequential batch processing. But the sequential processing of one job at a time
remained a significant limitation. Thus, there continued to be low CPU utilisation for I/O bound
jobs and low I/O device utilisation for CPU bound jobs. This was a major concern, since
computers were still very large (room-size) and expensive machines. Researchers began to
experiment with multiprogramming and multiprocessing in their computing services called the
time-sharing system. A noteworthy example is the Compatible Time Sharing System (CTSS),
developed at MIT during the early 1960s.
Operating system development continued with the introduction and widespread adoption of
multiprogramming. This marked first by the appearance of more sophisticated I/O buffering in
the form of spooling operating systems, such as the HASP (Houston Automatic Spooling) system
that accompanied by the IBM OS/360 system. These systems worked by introducing two new
systems programs, a system reader to move input jobs from cards to disk, and a system writer to
move job output from disk to printer, tape, or cards. Operation of spooling system was, as before,
transparent to the computer user who perceived inout as coming directly from the cards and
output going directly to the printer.
The idea of taking fuller advantage of the computer’s data channel I/O capabilities continued to
develop. That is, designers recognised that I/O needed only to be initiated by a CPU instruction -
the actual I/O data transmission could take place under control of separate and asynchronously
operating channel program. Thus, by switching control of the CPU between the currently
executing user program, the system reader program, and the system writer program, it was
possible to keep the slower mechanical I/O device running and minimises the amount of time the
CPU spent waiting for I/O completion. The net result was an increase in system throughput and
resource utilisation, to the benefit of both user and providers of computer services.
This concurrent operation of three programs (more properly, apparent concurrent operation, since
systems had only one CPU, and could, therefore execute just one instruction at a time) required
that additional features and complexity be added to the operating system. First, the fact that the
input queue was now on disk, a direct access device, freed the system scheduler from the first-
come-first-served policy so that it could select the „best” next job to enter the system (looking for
either the shortest job or the highest priority job in the queue. Second, since the CPU was to be
shared by the user program, the system reader, and the system writer, some processor allocation
rule or policy was needed. Since the goal of spooling was to increase resource utilisation by
enabling the slower I/O devices to run asynchronously with user program processing, and since
I/O processing required the CPU only for short periods to initiate data channel instructions, the
CPU was dispatched to the reader, the writer, and the program in that order. Moreover, if the
writer or the user program was executing when something became available to read, the reader
program would preempt the currently executing program to regain control of the CPU for its
initiation instruction, and the writer program would preempt the user program for the same
19
purpose. This rule, called the static priority rule with preemption, was implemented in the
operating system as a system dispatcher program.
The spooling operating system in fact had multiprogramming since more than one program was
resident in main storage at the same time. Later this basic idea of multiprogramming was
extended to include more than one active user program in memory at time. To accommodate this
extension, both the scheduler and the dispatcher were enhanced. The scheduler became able to
manage the diverse resource needs of the several concurrently active used programs, and the
dispatcher included policies for allocating processor resources amont the competing user
programs. In addition, memory management became more sophisticated in order to assure that
the program code for each job or at least that part of the code being executed, was resident in
main storage.
The advent of large-scale multiprogramming was made possible by several important hardware
innovations such as:
(a) The widespread availability of large capacity, high-speed disk units to accommodate the
spooled input streams and the memory overflow together with the maintenance of several
concurrently active program in execution.
(b) Relocation hardware which facilitated the moving of blocks of code within memory
without any undue overhead penalty.
(c) The availability of storage protection hardware to ensure that user jobs are protected from
one another and that the operating system itself is protected from user programs.
(d) Some of these hardware innovations involved extensions to the interrupt systems in order
to handle a variety of external conditions such as program malfunctions, storage
protection violations and machine checks in addition to I/O interrupts. In addition, the
interrupt system became the technique for the user program to request services from the
operating system kernel.
(e) The advent of privileged instructions allowed the operating system to maintain
coordination and control over the multiple activities now going on with in the system.
Successful implementation of multiprogramming opened the way for the development of a new
way of delivering computing services-time-sharing. In this environment, several terminals,
sometimes up to 200 of them, were attached (hard wired or via telephone lines) to a central
computer. Users at their terminals, “logged in” to the central system, and worked interactively
with the system. The system’s apparent concurrency was enabled by the multiprogramming
operating system. Users shared not only the system hardware nut also its software resources and
file system disk space.
The third generation was an exciting time, indeed, for the development of both computer
hardware and the accompanying operating system. During this period, the topic of operating
systems, in reality, a major element of the discipline of computing.
20
containing thousands of transistors on a small chip, made possible the development of desktop
computers with capabilities exceeding those that filled entire rooms and floors of building just
twenty years earlier.
The operating system that control these desktop machines have brought us back in a full circle, to
the open shop type of environment where each user occupies an entire computer for the duration
of a job’s execution.This works better now, not only because the progress made over the years
has made the virtual computer resulting from the operating system/hardware combination so
much easier to use or, in the words of the popular press “user-friendly”.
However, improvements in hardware miniaturisation and technology have evolved so fast that we
now have inexpensive workstation - class computers capable of supporting multiprogramming
and time-sharing. Hence the operating systems that supports today’s personal computers and
workstations look much like those which were available for the minicomputers of the third
generation. Examples are Microsoft’s DOS for IBM-compatible personal computers and UNIX
for workstation.
However, many of these desktop computers are now connected as networked or distributed
systems. Computers in a networked system each have their operating systems augmented with
communication capabilities that enables users to remotely log into any system on the network and
transfer information among machines that are connected to the network. The machines that make
up distributed system operate as a virtual single processor system from the user’s point of view; a
central operating system controls and makes transparent the location in the system of the
particular processor or processors and file systems that are handling any given program.
ACTIVITY 1.2
21
1.5.2. Time Sharing
Another mode for delivering computing services is provided by time sharing operating systems.
In this environment a computer provides computing services to several or many users
concurrently in-line. Here, the various users are sharing the central processor, the memory, and
other resources of the computer system in a manner facilitated, controlled, and monitored by the
operating system. The user, in this environment, has nearly full interaction with the program
during its execution, and the computer’s response time may be expected to be no more than a few
second.
These time operating systems are used to control machinery, scientific instruments and industrial
systems. An RTOS typically has very little user-interface capability, and no end-user utilities. A
very important part of an RTOS is managing the resources of the computer so that a particular
operation executes in precisely the same amount of time every time it occurs. In a complex
machine, having a part move more quickly just because system resources are available may be
just as catastrophic as having it not move at all because the system is busy.
Network operating systems are not fundamentally different from single processor operating
systems. They obviously need a network interface controller and some low-level software to
22
drive it, as well as programs to achieve remote login and remote files access, but these additions
do not change the essential of the operating systems.
A distributed operating system, in contrast, is one that appears to its users as a traditional uni-
processor system, even though it is actually composed of multiple processors. In a true
distributed system, users should not be aware of where their programs are being run or where
their files are located; that should all be handled automatically and efficiently by the operating
system.
True distributed operating systems require more than just adding a little code to a uni-processor
operating system, because distributed and centralized systems differ in critical ways. Distributed
systems, for example, often allow program to run on several processors at the same time, thus
requiring more complex processor scheduling algorithms in order to optimize the amount of
parallelism achieved.
Let us study various examples of the popular operating systems in the next section.
ACTIVITY 1.3
23
1.6 Desirable Qualities of Operating System
The desirable qualities of an operating system are in terms of:
1. Usability
(a) Robutness;
(b) Consistency;
(c) Proportionality;
(d) Convenience; and
(e) Powerful with high level facilities.
2. Facilities
(i) Sufficient for intended use;
(ii) Complete; and
(iii) Appropriate.
3. Costs
(a) Want low cost and efficient services
(b) Good Algorithms Make use of space/time tradeoffs, special hardware.
(c) Low Overhead Costs of doing nothing should be low. E.g.: idle time at a terminal.
(d) Low Maintenance Cost
System should not require constant attention.
4. Adaptability
(i) Tailored to the Environment
Support necessary activities. Do not impose unnecessary restrictions. What are the
things people do most - make them easy.
(ii) Changeable Over Time
Adapt as needs and resources change. E.g.: expanding memory and new devices or
new user population.
(iii) Extendible - Extensible
Adding new facilities and features - which look like the old ones.
1.7.1. DOS
DOS (Disk Operating System) was the first widely-installed operating system for personal
computers. It is a master control program that is automatically run when you start your personal
computer (PC). DOS stays in the computer all the time letting you run a program and manage
files. It is a single-user operating system from Microsoft for the PC. It was the first OS for the PC
and is the underlying control program for Windows 3.1, 95, 98 and ME. Windows NT, 2000 and
XP emulate DOS in order to support existing DOS applications.
24
1.7.2. UNIX
UNIX operating systems are used in widely-sold workstation products from Sun Microsystems,
Silicon Graphics, IBM and a number of other companies. The UNIX environment and the
client/server program model were important elements in the development of the Internet and the
reshaping of computing as centered in networks rather than in individual computers. Linux, a
UNIX derivative available in both “free software” and commercial versions, is increasing in
popularity as an alternative to proprietary operating systems. UNIX is written in C. Both UNIX
and C were developed by AT&T and freely distributed to government and academic institutions,
causing it to be ported to a wider variety of machine families than any other operating system. As
a result, UNIX became synonymous with “open systems”.
UNIX is made of the kernel, file system and shell (command line interface). The major shells are
the Bourne shell (original), C Shell and Korn shell. The UNIX vocabulary is exhaustive with
more than 600 commands that manipulate data and text in every way conceivable. Many
commands are cryptic, but just as Windows hid the DOS prompt, the Motif GUI presents a
friendlier image to UNIX users. Even with its many versions, UNIX is widely used in mission
critical applications for client/server and transaction processing systems. The UNIX versions that
are widely used are Sun’s Solaris, Digital’s UNIX, HP’s HP-UX, IBM’s AIX and SCO’s
UnixWare. A large number of IBM mainframe also run UNIX applications, because the UNIX
interfaces were added to MVS and OS/390, which have obtained UNIX branding. Linux, another
variant of UNIX, is also gaining enormous popularity.
1.7.3. Windows
Windows is a personal computer operating system from Microsoft that, together with some
commonly used business applications such as Microsoft Word and Excel, has become a de facto
„standard” for individual users in most corporations as well as in most homes. Windows contains
built-in networking, which allow users to share files and applications with each other if their PC’s
are connected to a network. In large enterprises, Windows clients are often connected to a
network of UNIX and NetWare servers.The server versions of Windows NT and 2000 are
gaining market share, providing a Windows only solution for both the client and server. Windows
is supported by Microsoft, the largest software company in the world, as well as the Windows
industry at large, which includes tens of thousands of software developers.
This networking support is the reason why Windows became successful in the first place.
However, Windows 95, 98, ME, NT, 2000 and XP are complicated operating environments.
Certain combinations of hardware and software running together can cause problems and
troubleshooting can be daunting. Each new version of Windows has interface changes that
constantly confuse users and keep support people busy and Installing Windows 2000 and
Windows XP more resilient to installation of problems and crashes in general.
1.7.4. Macintosh
The Macintosh (often called “the Mac”), introduced in 1984 by Apple Computer, was the first
widely sold personal computer with a Graphical User Interface (GUI).The Mac was designed to
provide users with a natural, intuitively understandable, and, in general, “user-friendly” computer
interface. This includes the mouse, the use of icons or small visual images to represent objects or
actions, the point and click and click-and-drag actions, and a number of window operation ideas.
Microsoft was successful in adapting user interface concepts first made popular by the Mac in its
first Windows operating system. The primary disadvantage of the Mac is that there are fewer
25
Mac applications on the market than for Windows. However, all the fundamental applications are
available, and the Macintosh is a perfectly useful machine for almost everybody. Data
compatibility between Windows and Mac is an issue, although it is often overblown and readily
solved.
The Macintosh has its own operating system, Mac OS which, in its latest version is called Mac
OS X. Originally built on Motorola’s 68000 series microprocessors, Mac versions today are
powered by the PowerPC microprocessor, which was developed jointly by Apple, Motorola, and
IBM. While Mac users represent only about 5% of the total numbers of personal computer users,
Macs are highly popular and almost a cultural necessity among graphic designers and online
visual artists and the companies they work for.
In this section, we will discuss some services of the operating systems used by its users. Users of
operating system can be divided into two broad classess: command language users and system
call users. Command language users are those who can interact with operating systems using the
commands. On the other hand, system call users invoke services of the operating system by
means of run time system calls during the execution of programs.
ACTIVITY 1.4
1. Discuss different views of the Operating Systems.
2. Summarise the characteristics of the following Operating Systems:
(a) Batch Processing
(b) Time Sharing
(c) Real time
26
The operating system is responsible for the following activities in connection with processes
management:
(a) The creation and deletion of both user and system processes;
(b) The suspension and resumption of processes;
(c) The provision of mechanisms for process synchronisation; and
(d) The provision of mechanisms for deadlock handling.
There are various algorithms that depend on the particular situation to manage the memory.
Selection of a memory management scheme for a specific system depends upon many factors, but
especially upon the hardware design of the system. Each algorithm requires its own hardware
support. The operating system is responsible for the following activities in connection with
memory management.
(a) Keep track of which parts of memory are currently being used and by whom;
(b) Decide which processes are to be loaded into memory when memory space becomes
available; and
(c) Allocate and deallocate memory space as needed.
Memory management techniques will be discussed in great detail in Unit 5 of this course.
The operating system is responsible for the following activities in connection with disk
management:
(a) Free space management;
(b) Storage allocation; and
(c) Disk scheduling.
27
1.8.4. I/O Management
One of the purposes of an operating system is to hide the peculiarities or specific hardware
devices from the user. For example, in UNIX, the peculiarities of I/O devices are hidden from the
bulk of the operating system itself by the I/O system. The operating system is responsible for the
following activities in connection to I/O management:
A buffer caching system;
To activate a general device driver code; and
To run the driver software for specific hardware devices as and when required.
For convenient use of the computer system, the operating system provides a uniform logical view
of information storage. The operating system abstracts from the physical properties of its storage
devices to define a logical storage unit, the file. Files are mapped, by the operating system, onto
physical devices. A file is a collection of related information defined by its creator. Commonly,
files represents programs (both source and object forms) and data. Data files may be numeric,
alphabetic or alphanumeric. Files may be free-form, such as text files or may be rigidly
formatted. In general a file is a sequence of bits, bytes, lines or records whose meaning is defined
by its creator and user. It is a very general concept.
The operating system implements the abstract concept of the file by managing mass storage
device, such as types and disks. Also files are normally organised into directories to ease their
use. Finally, when multiple users have access to files, it may be desirable to control by whom and
in what ways files may be accessed.
The operating system is responsible for the following activities in connection to the file
management.
(a) The creation and deletion of files;
(b) The creation and deletion of directory;
(c) The support of primitives for manipulating files and directories;
(d) The mapping of files onto disk storage;
(e) Backup of files on stable (non volatile) storage; and
(f) Protection and security of the files.
1.8.6. Protection
The various processes in an operating system must be protected from each other’s activities. For
that purpose, various mechanisms which can be used to ensure that the files, memory segment,
CPU and other resources can be operated on only by those processes that have gained proper
authorisation from the operating system.
For example, memory addressing hardware ensures that a process can only execute within its
own address space. The timer ensurers that no process can gain control of the CPU without
28
relinquishing it. Finally, no process is allowed to do its own I/O, to protect the integrity of the
various peripheral devices. Protection refers to a mechanism for controlling the access of
programs, processes, or users to the resources defined by a computer controls to be imposed,
together with some means of enforcement.
Protection can improve reliability by detecting latent errors at the interfaces between component
subsystems. Early detection of interface errors can often prevent contamination of a healthy
subsystem by a subsystem that is malfunctioning. An unprotected resource cannot defend against
use (or misuse) by an unauthorised or incompetent user.
1.8.7. Networking
A distributed system is a collection of processors that do not share memory or a clock. Instead,
each processor has its own local memory, and the processors communicate with each other
through various communication lines such as high speed buses or telephone lines. Distributed
systems vary in size and function. They may involve microprocessors, workstations,
minicomputers and large general purpose computer systems.
The processors in the system are connected through a communication network, which can be
configured in the number of different ways. The network may be fully or partially connected. The
communication network design must consider routing and connection strategies and the problem
of connection and security.
A distributed system provides the user with access to the various resources the system maintains.
Access to a shared resource allows computation speed-up, data availability and reliability.
The Figure 1.2 depicts the role of the operating system in coordinating all the functions.
29
Figure 1.2: Functions coordinated by the operating system
ACTIVITY 1.5
1. Mention the advantanges and limitations of Multiuser Operating Systems.
2. What is a Multitasking system and mention its advantages.
3. Illustrate a simple operating system for a security control system.
1.9 Summary
This Unit presented the principle operation of an operating system. In this unit we had
briefly described about the history, the generations and the types of operating systems.
An operating system is a program that acts as an interface between a user of a computer
and the computer hardware. The purpose of an operating system is to provide an
environment in which a user may execute programs. The primary goal of an operating
system is to make the computer convenient to use. And the secondary goal is to use the
hardware in an efficient manner.
Operating systems may be classified by both how many tasks they can perform
„simultaneously” and by how many users can be using the system „simultaneously”. That
is: single-user or multi-user and single-task or multi-tasking. A multi-user system must
clearly be multi-tasking. In the next unit we will discuss the concept of processes and
their management by the OS.
KEY TERMS
Application program Multiprogramming Multi user
Central processing unit (CPU) Operating system Multiprocessing
Disk operating system (DOS) Real time operating system Time sharing
(RTOS)
Embedded device UNIX
Single task
Macintosh Windows
Single user
Memory
30
REFERENCES
Deital, H., M. (1984). An introduction to operating systems. Peterson: Addition Wesley
Publishing Company.
Dhamdhere, D., M. (2006). Operating systems: A concept-based approach. New Delhi: Tata
McGraw-Hill.
Madnicj, & Donovan. (1974). Operating system - Concepts and design. New York: McGraw-Hill
International Education.
Milenkovic, M. (2000). Operating systems: Concept and design. New York: McGraw-Hill
International Education.
Silberschatz, A., & James, L. (1985). Operating system concepts. Peterson: Addition Weslet
Publishing Company.
Tanenbaum, A., S., & Woodhull, A., S. (2009). Operating system design and implementation.
UK: Pearson.
31
Unit 2 Processes
2.0. Introduction
In the earlier unit we have studied the overview and the functions of an operating system. In this
unit we have detailed discussion on the processes and their management by the operating system.
The other resource management features of operating systems will be discussed in the subsequent
units. The CPU executes a large number of programs. While its main concern is the execution of
user programs, the CPU is also needed for other system activities. These activities are called
processes. A process is a program in execution. Typically, a batch job is a process. A time-shared
user program is a process. A system task, such as spooling, is also a process. For now, a process
may be considered as a job or a time-shared program, but the concept is actually more general.
In general, a process will need certain resources such as CPU time, memory, files, I/O devices,
etc, to accomplish its task. These resources are given to the process when it is created. In addition
to the various physical and logical resources that a process obtains when it is created, some
initialisation data (input) may be passed along. For example, a process whose function is to
display the status of a file, say F1, on the screen will get the name of the file F1 as an input and
execute the appropriate program to obtain the desired information.
We emphasise that a program by itself is not a process; a program is a passive entity, while a
process is an active entity. It is known that two processes may be associated with the same
program; they are nevertheless considered two separate execution sequences. A process is the
topic of work in a system. Such a system consists of a collection of processes, some of which are
operating system processes, those that execute system code, and the rest being user processes,
those that execute user code. All of those processes can potentially execute concurrently.
The operating system is responsible for the following activities in connection with processes
managed.
(a) The creation and deletion of both user and system processes;
(b) The suspension is resumption of processes;
(c) The provision of mechanism for process synchronisation; and
(d) The provision of mechanism for deadlock handling.
We will learn the operating system view of the processes, types of schedulers, different types of
scheduling algorithms, in the subsequent sections of this topic.
32
2.2. The Concept of Process
The term „process” was first used by the operating system designers of the MULTICS system
way back in 1960s. There are different definitions to explain the concept of process. Some of
these are, a process is:
(a) An instance of a program in execution;
(b) An asynchronous activity;
(c) The “animated spirit” of a procedure;
(d) The “locus of control” of a procedure in execution;
(e) The “dispatchable” unit; and
(f) Unit of work individually schedulable by an operating system.
Formally, we can define a process is an executing program, including the current values of the
program counter, registers and variables. The subtle difference between a process and a program
is that the program is a group of instructions whereas the process is the activity.
The operating system works as the computer system software that assists hardware in performing
process management functions. Operating system keeps track of all the active processes and
allocate system resources to them according to policies devised to meet design performance
objectives. To meet process requirements OS must maintain many data structures efficiently. The
process abstraction is fundamental means for the OS to manage concurrent program execution.
OS must interleave the execution of a number of processes to maximise processor use while
providing reasonable response time. It must allocate resources to processes in conformance with
a specific policy. In general, a process will need certain resources such as CPU time, memory,
files, I/O devices etc to accomplish its tasks. These resources are allocated to the process when it
is created. A single processor may be shared among several processes with some scheduling
algorithm being used to determine when to stop work on one process and provide service to a
different one which we will discuss later in this topic.
Operating systems must provide some way to create all the processes needed. In simple systems,
it may be possible to have all the processes that will ever be needed be present when the system
comes up. In almost all systems however, some way is needed to create and destroy processes as
needed during operations. In UNIX, for instant, processes are created by the fork system call,
which makes an identical copy of the calling process. In other systems, system calls exist to
create a process, load its memory. And start it running. In general, processes need a way to create
33
other processes. Each process has one parent process, but zero, one, two or more children
processes.
34
Distributed computing network server can handle multiple concurrent client sessions by
dedicating an individual task to each active client session.
The first process, running cat, concatenates and outputs three files. Depending on the relative
speed of the two processes, it may happen that grep is ready to run, but there is no input for it. It
must then block until some input is available. It is also possible for a process that is ready and
able to run to be blocked because the operating system is decided to allocate the CPU to other
process for a while.
35
The process is being created.
(b) Ready
The process is waiting to be assigned to a processor.
(c) Running
Instructions are being executed.
(d) Waiting/Suspended/Blocked
The process is waiting for some event to occur.
(e) Halted/Terminated
The process has finished execution.
The transition of the process states are shown in Figure 2.1 and their corresponding transition are
described.
As shown in Figure 2.1, four transitions are possible among the states. Transition 1 appears when
a process discovers that it cannot continue. In order to get into blocked state, some systems must
execute a system call block. In other systems, when a process reads from a pipe or special file
and there is no input available, the process is automatically blocked.
Transition 2 and 3 are caused by the process schedule, a part of the operating system. Transition 2
occurs when the scheduler decides that the running process has run long enough, and it is time to
let another process have some CPU time. Transition 3 occurs when all other processes have had
their share and it is time for the first process to run again.
Transition 4 appears when the external event for which a process was waiting was happened. If
no other process is running at that instant, transition 3 will be triggered immediately, and the
process will start running. Otherwise, it may have to wait in ready state for a little while until the
CPU is available.
Using the process model, it becomes easier to think about what is going on inside the system.
There are many processes like user processes, disk processes, terminal processes and so on,
which may be blocked when they are waiting for s
typed, the process waiting for it is unblocked and is ready to run again. The process model, an
integral part of an operating system, can be summarised as follows. The lowest level of the
36
operating system is that scheduler with a number of processes on top of it. Alstudied is the
subsequent sections.
To implement the process model, the operating system maintains a table, an array of structures,
called the Process Table or Process Control Block (PCB) or Switch Frame. Each entry identifies
a process with information such as process state, its program counter, stack pointer, memory
allocation, the status of its open files, its accounting and scheduling information. In other words,it
must contain everything about the process that must be saved when the process is witched from
the running state to the following is the information stores in a PCB.
(a) Process number, each indicates the address of the next instruction to be executed for this
process;
(b) CPU registers, which vary in number and type, depending on theconcrete microprocessor
architecture;
(c) Memory management information, which include base and bounds registers or page table;
(d) I/O status information, composed I/O requests, I/O devices allocated to this process, a list of
open files and so on;
(e) Processor scheduling information, which includes process priority, pointers to scheduling
queues and any other scheduling parameters;
(f) (f)List of open files.
37
Context Switch
A context switch (also sometimes referred to as a process switch or a taskswitch) is the switching
of the CPU (central processing unit) from one processor to another. A context in the contents of a
CPU’s registers and program counter at any point in time. A context switch is sometimes
described as the kernel suspending execution ofone process on the CPU and resuming execution
of some other process that had previously been suspended.
Let us understand with the help of an example. Suppose if two processes A and B are in ready
queue. If CPU is executing Process A and Process B is in wait state. If an interrupt occurs for
Process A, the operating system suspends the execution of the first process, and stores the current
information of Process A in its PCB and context to the second process namely Process B. In
doing so, the program counter from the PCB of Process B is loaded, and thus execution can
continue with the new process. The switching between two processes, Process A and Process B is
illustrated in the Figure 2.3.
38
2.2.5. Process Hierarchy
Modern general purpose operating systems permit a user to create and destroy processes. A
process may create several new processes during its time of execution.The creating process is
called parnt process, while the new processes are called child processes. There are different
possibilities concerning new processes:
(a) Execution
The parent process continues to execute concurrently with its children processes or its
waits until all of its children processes have terminated (sequential).
(b) Sharing
Either the parent and children processes share all resources (likes memory of files) or the
children processes share only a subset of their parent’s resources or the parent and
children processes share no resources in common.
A parent process can terminate the execution of one of its children for one of these resource:
(a) The child process has exceeded its usage of the resources it has been allocated. In order to
do this, a mechanism must be available to allow the parent process to inspect the state of
its children processes.
(b) The task assigned to the child process is no longer required.
Let us discuss this concept with an example. In UNIX this is done by the fork system call, which
creates a child process, and the exit system call, which terminated the current process. After a
fork both parent and child keep running (indeed they have the same program text) and each can
fork off other processes. This results in a process tree. The root of the tree is a special process
created by the OS during startup. A process can choose to wait for children to terminate. For
example, if C issued a wait () system call it would block until G finished. This is shown in the
Figure 2.4.
Old or primitive operating system like MS-DOS are not multiprogrammed so when one process
starts another, the first process is automatically blocked and waits until the second is finished.
39
2.2.6. Threads
Threads, sometimes called Light-weight Processes (LWPs) are independently scheduled parts of
a single program. We say that a task is multithreaded if it is composed of several independent
sub-processes which do work on common data, and if each of those pieces could (at least in
principle) run in parallel.
If we write a program which uses threads - there is only one program, one executable file, one
task in the normal sense. Threads simply enable us to split up that program into logically separate
pieces, and have the pieces run independently of one another, until they need to communicate. In
a sense, threads are a further level of object orientation for multitasking systems. They allow
certain functions to be executed in parallel with others.
On a truly parallel computer (several CPUs) we might imagine parts of a program (different
subroutines) running on quite different processors, until they need to communicate. When one
part of the program needs to send data to the other part, the two independent pieces must be
synchronised, or be made to wait for one another. But what is the point of this? We can always
run independent procedures in a program as separate programs, using the process mechanisms we
have already introduced. They could communicate using normal interprocesses communication.
Why introduce another new concept? Why do we need threads?
The point is that threads are cheaper than normal processes, and that they can be scheduled for
execution in a user-dependent way, with less overhead.Threads are cheaper than a whole process
because they do not have a full set of resources each. Whereas the process control block for a
heavyweight process is large and costly to context switch, the PCBs for threads are much smaller,
since each thread has only a stack and some registers to manage. It has no open file lists or
resource lists, no accounting structures to update. All of these resources are shared by all threads
within the process. Threads can be assigned priorities - a higher priority thread will get put to the
front of the queue. Let us define heavy and lightweight processes with the help of a Table 2.1.
40
organise the execution of a program in such a way that something is always being done,
whenever the scheduler gives the heavyweight process CPU time.
(a) Threads allow a programmer to switch between lightweight processes when it is best for
the program. (The programmer has control).
(b) A process which uses threads does not get more CPU time than an ordinary process - but
the CPU time is gets is used to do work on the threads. It is possible to write a more
efficient program by making use of threads.
(c) Inside a heavyweight process, threads are schedules on a FCFS basis, unless the program
decides to force certain threads to wait for other threads. If there is only one CPU, then
only one thread can be running at a time.
(d) Threads context switch without any need to involve the kernel-the switching is performed
by a user level library, so time is saved because the kernel doesn’t need to know about the
threads.
If the kernel itself is multithreaded, the scheduler assigns CPU time on a thread basis rather than
on a process basis. A kernel level thread behaces like a virtual CPU, or power-point to which
user-processes can connect in order to get computing power. The kernel has as many system level
threads as it has CPUs and each of these must be shared between all of the user-threads, which in
turn is equal to the number of CPUs on the system.
Since threads work “inside” a single task, the normal process scheduler cannot normally tell
which thread to run and which not to run - that is up to the program. When the kernel schedules a
process for execution, it must then find out from that process which is the next thread it must
execute. If the program is lucky enough to have more than one processor available, then several
threads can schedules at the same time.
Some important implementation of threads are:
(a) The Mach System / OSFI user and system level);
(b) Solaris 1 (user level);
(c) Solaris 2 (user and system level);
(d) OS/2 (system level only);
(e) NT threads (user and system level);
(f) IRIX threads; and
(g) POSIX standardised user threads interface.
41
As an example of how system calls are used, consider writing a simple program to read data from
one file and to copy them to another file. There are two names of two different files in which one
input file and the other is the output file. One approach is for the program to ask the user the
names of the two files. In an interactive system, this approach will require a sequence of systems
calls, first to write a prompting message on the screen and then to read from the keyboard the
character that the two files have. Once the two file names are obtained the program must open the
input file and create the output file. Each of these operations requires another system call and
may encounter possible error conditions. When the program tries to open the input file, it may
find that no file of that name exists or that the file is protected against access. In these cases the
program should print the message on the console and then terminate abnormally which require
another system call. If the input file exists then we must create the new output file. We may find
an output file with the same name. This situation may cause the program to abort or we may
delete the existing file and create a new one. After opening files, we may enter a loop that reads
from input file and writes to output file. Each read and write must return status information
regarding various possible error conditions. Finally, after the entire file is copies the program
may close both files. Examples of some operating system calls are:
(i) Create
In response to the create call the operating system creates a new process with the specified
or default attributes and indentifier. Some of the parameters definable at the process
creation time include:
(i) Level of privilege, such as system or user;
(ii) Priority;
(iii)Size and memory requirements;
(iv) Maximum data area and/or stack size;
(v) Memory protection information and access rights; and
(vi) Other system dependent data.
(ii) Delete
The delete service is also called destroy, terminate or exit. Its execution causes the
operating system to destroy the designated process and remove it from the system.
(iii)Abort
It is used to terminate the process forcibly. Although a process could concervanly abort
itself, the most frequent use of this call is for involuntary terminations, such as removal of
malfunctioning process from the system.
(iv) Fork/Join
Another method of process creation and termination is by means of FORK/JOIN pair,
originally introduced as primitives for multiprocessor ssyten. The FORK operations are
used to split a sequenve of instruction into two concurrently executable sequences. JOIN is
used to merge the two sequences of code divided by the FORK and it is available to a
parent process for synchronisation with a child.
(v) Suspend
The suspend system call is also called BLOCK in some systems. The designated process is
suspended indefinitely and placed in the suspend state. A process may be suspended itself
or another process when authorized to do so.
(vi) Resume
The resume system call is also called WAKEUP in some systems. This call resumes the
target process, which is presumably suspended. Obviously a suspended process can not
42
resume itself because a process must be running to have its operating system call processed.
So a suspended process depends on a partner process to issue the resume.
(vii) Delay
The system call delay is also known as SLEEP. The target process is suspended for the
duration of the specified time period. The time may be expressed in terms of system clock
ticks that are system dependent and not portable or in standard time units such as seconds
and minutes. A process may delay itself or optionally, delay some other process.
(viii) Get_Attributes
It is an enquiry to which the operating system responds by providing the current values of
the process attributes, or their specified subset, from the PCB.
(ix) Change Priority
It is an instance of a more general SET-PROCESS-ATTRIBUTES system call. Obviously,
this call is not implemented in systems where process priority is static.
ACTIVITY 2.1
1. Explain the difference between a process and a thread with some examples.
2. Identify the differentstates a live process may occupy and show how a process moves
between these states.
3. Define what is meant by a context switch. Explain the reason many systems use two
levels of scheduling.
43
(b) Maximise Interactive User
Maximise the number of interactive user receiving acceptable response times.
(c) Be Predictable
A given job should utilise the same amount of time and should cost the same regardless of
the load on the system.
(d) Minimise Overhead
Scheduling should minimise the wasted resources overhead.
(e) Balance Resource Use
The scheduling mechanisms should keep the resources of the system busy. Processes that
will use under utilised resources should be favoured.
(f) Achieve a Balance between Response and Utilisation
The best way to guarantee good response times is to have sufficient resource available
whenever they are needed. In real time system fast responses are essential and resource
utilisation is less important.
(g) Enforce Priorities
In environments in which processes are given priorities, the scheduling mechanism should
favour the higher-priority processes.
(h) Give preference to processes holding key resources
Even though a low priority process may be holding a key resource, the process may be in
demand by high priority processes. If the resource is not perceptible, then the scheduling
mechanism should give the process better treatment that it would ordinarily receive so
that the process will release the key resource sooner.
(i) Degrade Gracefully Under Heavy Loads
A scheduling mechanism should not collapse under heavy system load. Either it should
prevent excessive loading by not allowing new processes to be created when the load in
heavy or it should provide service to the heavier load by providing a moderately reduced
level of service to all processes.
(j) Avoid Indefinite Postponement
It could be fair if all processes are treated the same and no process can suffer indefinite
postponement.
44
frequently when compared with the short term scheduler. It controls the degress of
multiprogramming (no. of processes in memory at a time). If the degree of
multiprogramming is to be kept stable (say 10 processes at a time), the long-term scheduler
may only need to be invoked till the process finishes execution. The long-term scheduler
must select a good process mix of I/O-bound and processor bound processes. If most of the
processes selected are I/O-bound, then the ready queue will almost be empty, while the
device queue(s) will be very crowded. If most of the processes are processor-bound, then
the device queue(s) will almost be empty while the ready queue is very crowded and that
will cause the short-term scheduler to be invoked very frequently.Time-sharing systems
(mostly) have no long-term scheduler. The stability of these systems either depends upon a
physical limitation (no of available terminals) or the self-adjusting nature of users (if you
can’t get response, you quit). It can sometimes be good to reduce the degree of
multiprogramming by removing processes from memory and storing them on disk.
(c) Medium Term Scheduler
The processes of reducing the degree of multiprogramming by removing processes from
memory and storing them on disk can then be reintroduced into memory by the medium-
term scheduler. This operation is also know as swapping. Swapping may be necessary to
improve the process mix or to free memory.
Processor Utilisation = (Processor Busy Time)/(Processor Busy Time + Processor idle time)
(b) Throughput
It refers to the amount of work completed in a unit of time. One way to measure throughput
is by means of the number of processes that are completed in a unit of time. The higher the
number of processes, the more work apparently is being done by the system. But this
approach is not very useful for comparison because this is dependent on the characteristics
and resource requirement of the process being executed. Therefore to compare throughput
of several scheduling algorithms it should be fed into the process with similar requirements.
The throughput can be calculated by using the formula:
45
Throughput = (No. of processes completed) / (Time unit)
But the scheduling algorithm affects or considers the amount of time that a process spends
waiting in a ready queue. Thusrather than looking at turnaround time waiting time is usually
the waiting time for each process.
(e) Response time
It is most frequently considered in time sharing and real time operating system. However,
its characteristics differ in the two systems. In time sharing system it may be defined as
interval from the time the last character of a command line of a program or transaction is
entered to the time the last result appears on the terminal. In real time system it may be
defined as interval from the time an internal or external event is signaled to the time the first
instruction of the respective service routine is executed.
One of the problems in designing schedulers and selecting a set of its performance criteria
is that they often conflict with each other. For example, the fastest response time in time
sharing and real time system may result in low CPU utilisation.
Throughput and CPU utilisation may be increased by executing the large number of
processes, but then response time may suffer. Therefore, the design of a scheduler usually
requires balance of all the different requirements and constraints. In the next section we will
discuss various scheduling algorithms.
46
CPU scheduling deals with the problem of deciding which of the processes in the ready queue is
to be allocated to the CPU. There are several scheduling algorithms which will be examined in
this section. A major division among scheduling algorithms is that whether they support pre
emptive or non-preemptive scheduling discipline.
(a) Preemptive Scheduling
Preemption means the operating system moves a process from running to ready without the
process requesting it. An OS implementing this algorithms switches to the processing of a
new request before completing the processing of the current request. The preempted request
is put back into the list of the pending requests. Its servicing would be resumed sometime in
the future when it is scheduled again. Preemptive scheduling is more useful in high priority
process which requires immediate response. For example, in Real time system the
consequence of missing one interrupt could be dangerous.
Round Robin scheduling, priority based scheduling or event driven scheduling and SRTN
are considered to be the preemptive scheduling algorithms.
(b) Non-Preemptive scheduling
A scheduling is non-preemptive if once a process has been allotted to the CPU, the CPU
cannot be taken away from the process. A non-preemptible discipline always processes a
scheduled request to its completion. In non-preemptive systems, jobs are made to wait by
longer jobs, but the treatment of all processes is fairer.
First come First Served (FCFS) and Shortest Job First (SJF), are considered to be the non-
preemptive scheduling algorithms.
The decision whether to schedule preemptive or not depends on the environment and the
type of application most likely to be supported by a given operating system.
ACTIVITY 2.2
1. Briefly explain the following Operating System calls:
a) Create
b) Fork/Join
c) (c) Delay
2. Distinguish between a foreground and a background process in UNIX.
3. Identify the information which must be maintained by the operating system for each live
process.
FCFS tends to favour CPU-Bound processes. Consider a system with a CPU-bound process and a
number of I/O-bound processes. The I/O bound processes will tend to execute briefly, then block
for I/O. A CPU bound process in the ready should not have to wait long before being made
runable. The system will frequently find itself with all the I/O-Bound processes blocked and
CPU-bound process running. As the I/O operations complete, the ready Queue fill up with the I/O
bound processes.
47
Under some circumstances, CPU utilisation can also suffer. In the situation described above, once
a CPU-bound process does issue an I/O request, the CPU can return to process all the I/O-bound
processes. If their processing completes before the CPU-bound process’s I/O completes, the CPU
sits idle. So with no preemption, component utilisation and the system throughput rate may be
quite low.
Example :
Calculate the turn around time, waiting time, average turnaround time, average waiting time,
throughput and processor utilisation for the given set of processes that arrive at a given arrive
time shown in the table, with the length of processing time given in milliseconds:
P1 0 3
P2 2 3
P3 3 1
P4 5 4
P5 8 2
If the processes arrives as per the arrival time, the Gantt chart will be:
P1 P2 P3 P4 P5
0 3 6 7 11 13
48
Processor Utilisation = (13/13)*100 = 100%
Throughput = 5/13 = 0.38
Note : If all the processes arrive at time 0, then the order of scheduling will be P3, P5, P1, P2
and P4.
Example:
Consider the following set of processes with the following processing time which arrived at the
same time.
Using SJF scheduling because the shortest length of process will first get execution, the Gantt
chart will be:
P4 P1 P3 P2
0 3 9 16 24
Because the shortest processing time is of the process P4, then process P1 and then P3 and
Process P2. The waiting time for process P1 is 3 ms, for process P2 is 16 ms, for process P3 is 9
ms and for the process P4 is 0 ms as –
49
Processor Utilisation = (24/24)*100 = 100%
Throughput = 4/24 = 0.16
Example:
Consider the following set of process with the processing time given in milliseconds.
P1 24
P2 03
P3 03
If we use a time Quantum of 4 milliseconds then process P1 gets the first 4 milliseconds. Since it
requires another 20 milliseconds, it is preempted after the first time Quantum, and the CPU is
given to the next process in the Queue, Process P2. Since process P2 does not need and
milliseconds, it quits before its time Quantum expires. The CPU is then given to the next process,
Process P3 one each process has received 1 time Quantum, the CPU is returned to process P1 for
an additional time quantum. The Gantt chart will be:
P1 P2 P3 P1 P1 P1 P1 P1
0 4 7 10 14 18 22 26 30
P1 24 30 - 0 = 30 30 - 24 = 6
P2 03 7-0=7 7-3=4
P3 03 10 - 0 = 10 10 - 3 = 7
This is the preemptive version of shortest job first. This permits a process that enters the ready
list to preempt the running process if the time for the new process (or for its next burst) is less the
remaining time for the running process (or for its current burst). Let us understand with the help
of an example.
Example:
Consider the set of four processes arrived as per timings described in the table:
At time 0, only process P1 has entered the system, so it is the process that executes. At time 1,
process P2 arrives. At that time, process P1 has 4 time units left to execute. At this juncture
process 2’s processing time is less compared to the P1 left out time (4 units). So P2 starts
executing at time 1. At time 2, process P3 enters the system with the processing time 5 units.
Process P2 continues executing as it has the minimum number of time units when compared with
P1 and P3. At time 3, process P2 terminates and process P4 enters the system. Of the processes
P1, P3 and P4.
51
P4 has the smallest remaining execution time so it starts executing. When process P1 terminates
at time 10, process P3 executes. The Gantt chart is shown below:
P1 P2 P4 P1 P3
0 1 3 6 10 15
Turnaround time for each process can computed by subtracting the time it terminated from the
arrival time.
Turn around Time = t(Process Completed) - t(Process Submitted)
The turnaround time for each of the processes is:
P1: 10 - 0 = 10
P2: 3-1=2
P3: 15 - 2 = 3
P4: 6-3=3
P1: 10 - 5 = 5
P2: 2-2=0
P3: 13 - 5 = 8
P4: 3-3=0
A priority is associated with each process and the scheduler always picks up the highest priority
process for execution from the ready queue. Equal priority processes are scheduled FCFS. The
level of priority may be determined on the basis of resource requirements, processes
characteristics and its run time behavior.
A major problem with a priority based scheduling is indefinite blocking of a lot priority process
by a high priority process. In general, completion of a process within finite time cannot be
guaranteed with this scheduling algorithm. A solution to the problem of indefinite blockage of
low priority process os provided by aging priority. Aging priority is a technique of gradually
increasing the priority of processes (of low priority) that wait in the system for a long time.
Eventually, the older processes attain high priority and are ensured if completion in a finite time.
Example:
52
Consider the following set of five processes, assumed to have arrived at the same time with the
length of processor timing in milliseconds.
Using priority scheduling we would schedule these processes according to the following Gantt
chart:
P2 P5 P1 P3 P4
0 1 6 16 18 19
Priorities can be defined either internally or externally. Internally defined priorities use one
measurable quantity or quantities to complete the priority of a process.
SELF-CHECK 2.2
53
2.6. Performance Evaluation of the Scheduling Algorithms
Performance of an algorithm for a given set of processes can be analysed if the appropriate
information about the process is provided. But how do we select a CPU-scheduling algorithm for
a particular system? There are many scheduling algorithms so the selection of an algorithm for a
particular system can be difficult. To select an algorithm there are some specific criteria such as:
(a) Maximise CPU utilisation with the maximum response time; and
(b) Maximise throughput.
Example:
Assume that we have the following five processes arrived at time 0, in the order given with the
length of CPU time given in milliseconds.
First consider the FCFS scheduling algorithm for the set of processes.
For FCFS scheduling the Gantt chart will be:
P1 P2 P3 P4 P5
0 10 39 42 49 61
P3 P4 P1 P5 P2
0 3 10 20 32 61
54
Process Processing Time Waiting Tim
P1 10 10
P2 29 32
P3 03 00
P4 03 42
P5 12 20
Now consider the Round Robin scheduling algorithm with a quantum of 10 milliseconds. The
Gantt chart will be:
P1 P2 P3 P4 P5 P2 P5 P2
0 10 20 23 30 40 49 52 61
Now if we compare average waiting time above algorithms, we see that SJF policy results in less
than one half of the average waiting time to that of FCFS scheduling; the RR algorithm gives us
an intermediate value. So performance of algorithm can be measured when all the necessary
information is provided.
ACTIVITY 2.3
55
Process Processing Time
P1 13
P2 08
P3 83
4. For the given five processes arriving at time 0, in the order with the length of CPU time in
milliseconds.
Process Processing Time
P1 10
P2 29
P3 03
P4 07
Consider the FCFS, SJF and RR (time slice = 10 milliseconds) scheduling algorithms for
the above set of process which algorithm would give the minimum average waiting time?
2.7. Summary
A process is an instance of a program in execution.
A process can be defined by the system or process is an important concept in modern
operating system.
Processes provide a suitable means for informing the operating system about independent
activities that may be scheduled for concurrent execution. Each process is represented by a
Process Control Block (PCB).
Several PCBs can be linked together to form a queue of waiting processes. The selection and
allocation of processes is done by a scheduler. There are several scheduling algorithms.
We have discussed FCFS, SJF, RR, SJRT and priority algorithms along with the
performance evaluation.
56
KEY TERMS
REFERENCES
Deital, H., M. (1984). An introduction to operating systems. Peterson: Addition Wesley
Publishing Company.
Dhamdhere, D., M. (2006). Operating systems: A concept-based approach. New Delhi: Tata
McGraw-Hill.
Madnicj, & Donovan. (1974). Operating system - Concepts and design. New York: McGraw-Hill
International Education.
Milenkovic, M. (2000). Operating systems: Concept and design. New York: McGraw-Hill
International Education.
Silberschatz, A., & James, L. (1985). Operating system concepts. Peterson: Addition Weslet
Publishing Company.
Tanenbaum, A., S. & Woodhull, A., S. (2009). Operating system design and implementation.
UK: Pearson.
57
Unit 3 Interprocess Communication and Synchronisation
3.0 Introduction
In the earlier unit we have studied the concept of processes. In addition to process scheduling,
another important responsibility of the operating system is process synchronisation.
Synchronisation involves the orderly sharing of system resources by processes.
A simple batch operating system can be viewed as three processes - a reader process, an executor
process and a printer process. The reader reads cards from card reader and places card images in
an input buffer. The executor process reads card images from input buffer and performs the
specified computation and store the result in an output buffer. The printer process retrieves the
data from the output buffer and writes them to a printer. Concurrent processing is the basis of
operating system which supports multiprogramming.
The operating system supports concurrent execution of a program without necessarily supporting
elaborate form of memory and file management. This form of operation is also known as
multitasking. One of the benefits of multitasking is that several processes can be made to
cooperate in order to achieve their goals. To do this, they must do one the following:
(a) Communicate
Interprocess communication(IPC) involves sending information from one process to
another. This can be achieved using a “mailbox” system, a socket which behaves like a
virtual communication network (loop back), or through the use of „pipes”, Pipes are a
system construction which enable one process to open another process as if it were a file
for writing or reading.
(b) Share Data
A segment of memory must be available to both the processes. (Most memory is locked to
a single process).
(c) Waiting
Some processes wait for other processes to give a signal before continuing, This is an
issue of synchronisation.
58
Synchronisation is often necessary when processes communicate. Processes are executed with
unpredictable speeds. Yet to communicate one process must perform some action such as setting
the value of a variable or sending a message that the other detects. This only works if the events
perform an action or detect an action are constrained to happen in that order. Thus one can view
synchronisation as a set of constrainst on the ordering of events. The programmer employs a
synchronisation mechanism to delay execution of a process in order to satisfy such constraints.
In this unit, let us study the concept of interprocess communication and synchronisation, need of
semaphores, classical problems in concurrent processing, critical regions, monitors and message
passing.
Example:
Consider a machine with a single printer running a time-sharing operation system. If a process
needs to print its results, it must request that the operating system gives it access to the printer’s
device driver. At this point, the operating system must decide whether to grant this request,
depending upon whether the printer is already being used by another process. If it is not, the
operating system should grant the request and allow the process to continue; otherwise, the
operating system should deny the request and perhaps classify the process as a waiting process
until the printer becomes available. Indeed, if two processes were given simultaneous access to
the machine’s printer, the results would be worthless to both.
59
Consider the following related definition to understand the example in a better way:
(a) Critical Resource
It is a resource shared with constraints on its use (e.g, memory, files, printers, etc).
(b) Critical Exclusion
It is code that accesses a critical resource.
(c) Mutual Exclusion
At most one process may be executing a critical section with respect to a particular
critical resource simultaneously.
In the example given above, the printer is the critical resource. Let’s suppose that the processes
which are sharing this resource are called process A and process B. The critical sections of
process A and process B are the section of the code which issue the print command. In order to
ensure that both processes do not attempt to use the printer at the same, they must be granted
mutually exclusive access to the printer driver.
First we consider the interprocess communication part. There exist two complementary inter-
process communication types:
(a) Shared-memory system; and
(b) Message-passing system.
It is clear that these two schemes are not mutually exclusive, and could be used simultaneously
within a single operating system.
A critical problem occurring in shared-memory system is that two or more processes are reading
or writing some shared variables or shared data, and the final results depend on who runs
precisely and when. Such situations are called race conditions. In order to avoid race conditions
we must find some way to prevent more than one process from reading and writing shared
variables or shared data at the same time, i.e, we need the concept of mutual exclusion (which we
will discuss in the later section). It must be sure that if one process is using a shared variable, the
other process will be excluded from doing the same thing.
The function of a message-passing system is to allow processes to communicate with each other
without the need to resort to shared variable. An interprocess communication facility basically
provides two operations: send (message) and receive (message). In order to send and to receive
messages, a communication link must exist between two involved processes. This link can be
implemented in different ways. The possible basic implementation questions are:
(a) How are links established?
(b) Can a link be associated with more than two processes?
60
(c) How many links can there be between every pair of process?
(d) What is the capacity of a link? That is, does the link have some buffer space? If so, how
much?
(e) What is the size of the message? Can the link accommodate variable size or fixed-size
message?
(f) Is the link unidirectional or bi-directional?
In the following, we consider several methods for logically implementing a communication link
and the send/receive operations. These methods can be classified into two categories:
(a) Naming
It consisting of direct and indirect communication.
(i) Direct Communication
In direct communication, each process that wants to send or receive a message must
explicitly name the recipient or sender of the communication. In this case, the send and
receive primitives are defined as follows:
Send (P, message). To send a message to the process P
Receive (Q, message). To receive a message from process Q
This scheme shows the symmetry in addressing, i.e, both the sender and the
receiver have to name one another in order to communicate. In contrast to this, a
symmetry in addressing can be used, i.e, only the sender has to name the recipient;
the recipient is not required to name the sender. So the send and receive primitives
can be defined as follows:
Send (P, message). To send a message to the process P
Receive (id, message). To receive a message from any process; id is set to the
name of the process with whom the communication has taken place.
(ii) Indirect Communication
With indirect communication, the messages are sent to, and received from a mailbox. A
mailbox can be abstractly viewed as an object into which messages may be placed and
from which messages may be removed by processes. In order to distinguish one from the
other, each mailbox owns a unique identification. A process may communicate with some
other process by a number of different mailboxes. The send and receive primitives are
defined as follows:
Send (A, message). To send a message to the mailbox A.
Receive (A, message). To receive a message from the mailbox A.
61
a. Create a new mailbox end and receive message through the mailbox; and
b. Destroy a mailbox
Since all processes with access rights to a mailbox may terminate, a mailbox may
no longer be accessible by any process after some time. In this case, the operating
system should reclaim whatever space was used for the mailbox.
(b) Buffering
(ii) Messages
Messages sent by a process may be one of three varieties: a) fixed-sized, b) variable-
sized and c) typed messages. If only fixed-sized messages can be sent, the physical
implementation is straightforward. However, this makes the task of programming
more difficult. On the other hand, variable-size messages require more complex
physical implementation, but the programming becomes simpler. Typed messages, i.e,
associating a type with each mailbox, are applicable only to indirect communication.
The messages that can be sent to, and received from a mailbox are restricted to the
designated type.
ACTIVITY 3.1
62
3.2 Interprocess Synchronisation
When two or more processes work on the same data simultaneously strange things can happen.
Suppose, when two parallel threads attempt to update the same variable simultaneously, the result
is unpredictable. The value of the variable afterwards depends on which of the two threads was
the last one to change the value. this called a race condition. The value depends on which of the
threads wins the race to update the variable. What we need in a multitasking system is a way of
making such situation predictable. This is called serialisation. let us study the serialisation
concept in detail in the section.
3.2.1 Serialisation
The key idea in process synchronisation is serialisation. This means that we have to got to some
pains to undo the work we have put into making an operating system perform several tasks in
parallel. As we mentioned, in the case of print queues, parallelism in the next section.
Synchronisation is a large and difficult topic, so we shall only undertake to describe the problem
and some of the principles involved here.
There are essentially two strategies to serializing processes in a multitasking environment.
(a) The scheduler can be disabled for a short period of time, to prevent control being given to
another process during a critical action like modifying shared data. This method is very
inefficient on multi processor machines, since all other processor have to be halted every
time one wishes to execute a critical section.
(b) A protocol can be introduced which all programs sharing data must obey. The protocol
ensures that processes have to queue up to gain access to shard data. Processes which
ignore the protocol ignore it at their won peril (and the peril of the reminder of the
system!).This methods works on multiprocessor machines also, though it is more difficult
to visualize. The responsibility of serializing important operations falls on programmers.
The OS cannot impose any restriction on silly behaviour-it can only provide tools and
mechanisms to assist the solution of the problem.
Consider the following examples: Two processes sharing a printer must take turns using it; if they
attempt to use it simultaneously, the output from the two processes must be mixed into an
arbitrary jumble which is unlikely to be of any use. two processes attempting to update the same
bank account must take turns; if each process reads the current balance from some database,
updates it, and writes it back, one of the updates will be lost.
Both of the above examples can be solved of there is some way for each process to exclude the
other from using the shared object during critical sections of code. Thus, the general problem is
described as the mutual exclusion problem. The mutual exclusion problem was recognised (and
successfully solved) as early as 1963 in the Burroughs AOSP operating system, but the problem
is sufficiently difficult widely understood for some time after that. A significant number of
attempts to solve the mutual exclusion problem have suffered from two specific problems, the
63
lockout problem, in which a subset of the processes can conspire to indefinitely lock some other
process out of a critical section, and the deadlock problem, where two or more processes
simultaneously trying to enter a critical section lock each other out.
Mutual execlusion can be achieved by a system of locks. A mutual exclusion lock is colloquially
called a mutex. You can see an example of mutex locking in the multithreaded file reader in the
previous section.
Get_Mutex(m);
//Update share data
Release_Mutex(m)
This protocol is meant to ensure that only one process at a time can get past the functions
Get_Mutex. All other processes or threads are made to wait at the function Get_Mutex until that
one process calls Release_Mutex to release the lock. A method for implementing this is discussed
below. Mutexes are a central part of multithreaded programming.
(a) Codes that refer one or more variables in a “read-update-write” fashion while any of those
variables is possibly being altered by another thread.
(b) Codes that alter one or more variable that are possibly being referenced in “read-update-
write” fashion by another thread.
(c) Codes use a data structure while any part of it is possibly being altered by another thread.
(d) Codes alter any part of a data structure while it possibly in use another thread.
In the past it was possible to implement this by generalising the idea of interrupt masks. By
switching off interrupts (or more appropriately, by switching off the scheduler) a process can
guarantee itself uninterrupted access to shared data. This method has drawbacks:
(a) Masking interrupts can be dangerous in a multiprocessor- there is always the interrupts
will be missed;
64
(b) It is not general enough in a multiprocessor environment, since interrupts will continue to
be serviced by other processors - so all processors would have to be switched off;
(c) It is too harsh. We only need to prevent two programs from being in their critical sections
simultaneously if they share the same data. Programs A and B might share different data
to programs C and D, so why should they wait for C and D?
In 1981 G. L Peterson discovered a simple algorithm for achieving mutual exclusion between two
processes with PID equal to 0 and 1. The code is as follows:
int turn;
int interested[2];
void Get_Mutex (int pid)
{
in other;
other - 1 - pid;
interested[process] = true;
turn - pid;
while (turn == pid && interested[other]) // Loop until no
one
{ // else is interested
}
}
Realease_Mutex (int pid)
{
interested[pid] = false;
}
Where more processes are involved, some modifications are necessary to this algorithm. The key
to serialisation ere is that, if second process tries to obtain the mutex, when another already has it,
it will get caught in a loop, which does not terminate until the other process has released the
mutex. This solution is said to involve busy waiting-i.e., the program actively executes an empty
loop, wasting CPU cycles, rather than moving the process out of the scheduling queue. This is
also called a spin lock, since the system 'spins' on the loop while waiting. Let us see another
algorithm which handles critical section problem for n processes.
65
type processid = 0..1:
var need: array [processid of boolem {initially false};
turn: processid { initially either 0 or 1}
procedure dekkerwait (me: processid);
var other: processid;
begin
other : = 1 - me;
need [me] := true;
whilr nee[other] do begin { there iscontention}
if turn = other then begin
need[me] : = false (false{let other take a turn};
while turn = other do {nothing};
need[me] : = true {re-assert my interest};
end;
end;
end {dekkersignal (me: processid);
begin
need[me] := false;
turn := 1 - me {now, it is the other's turn};
end {dekkersignal};
Dekker's solution to the mutual exclusion problem requires that each of the contending processes
have a unique process identifier which is called “me” which is passed to the wait and signal
operations. although none of the previously mentioned solutions require this, most systems
provide some from of process identifier which can be used for this purpose.
It should be noted that Dekker's solution does rely on one very simple assumption about the
underlying hardware; it assumes that if two processes attempt to write two different values in
same memory location at the same time, one or the other value will be stored and not some
mixture of the two. This is called the atomic update assumption. The atomically updatable unit of
memory varies considerably is atomic, but an attempt to update a byte is not atomic, while on
others, updating a byte is atomic while words are updated by a sequence of byte updates.
(a) Before entering its critical section, process receives a number. Holder of the smallest number
enters the critical section.
(b) If processes Pi and Pj receive the same number, if i < j, then Pi is served first; else Pj is
served first.
(c) The numbering scheme always generates numbers in increasing order of enumeration; i.e.,
1,2,3,3,3,4,5 ⁄
(d) Notation <= lexicographical order (ticket #, process id #).
(i) (a,b) < (c,d) if a < c or if a = c and b< d.
(ii) max(a0,.., an-1) is a number, k , such that k > = ai for i = 0,.., n - 1.
(e) Shared data
66
boolean choosing[n]; // initialise all to false
int number[n]; //initialise all to 0
(f) Data structures are initialized to false and 0, respectively.
do
{
choosing[i] = true;
number[i] = max(number[0], number[1],…,number{ n-1
} + 1;
choosing[i] = false;
for(int j = 0; j <n; j++)
{
while (choosing[j] = = true)
{
/*do nothing*/
}
while ((number[j]!=0) &&
(number[j]j)<(number[i],i))
// see Reference point
{
/*do nothing*/
}
}
do critical section
number[i] = 0;
do remainder section
}while (true)
In the next section, we will study how the semaphores provides a much more organise approach
of synchronisation of processes.
3.3 Semaphores
Semaphores provides a much more organised approach to controlling the interaction of multiple
processes than would be available if each user had to solve all interprocess communications using
simple variables, but more organisation is possible. In a sense, semaphores are something like the
goto statement in early programming languages; they can be used to solve a variety of problems,
but they impose little structure on the solution and the results can be hard to understand without
the aid of numerous comments. Just as there have been numerous control structures devised in
sequential programs to reduce or even eliminate the need for goto statements, numerous
specialized concurrent control structures have been developed which reduce or eliminate the need
for semaphores.
67
Definition: The effective synchronisation tools often used to realise mutual exclusion in more
complex systems are semaphores. A semaphore S is an integer variable which can be accessed
only through two standard atomic operations: wait and signal. The definition of the wait and
signal operation are:
Wait(S):while S ≤ 0 do skip;
S; = S – 1;
Signal(S): S:+S + 1;
Or in C languages notation we can write it as:
Wait(s)
{
While (S<=0)
{
/*do nothing*/
}
S=S-1;
}
Signal(S)
{
S = S + 1;
}
It should be noted that the test (S ª 0) and modification of the integer value of S which is S;=S - 1
must be executed without interruption. In general, if one process modifies the integer value of S
in the wait and signal operations, no other process can simultaneously modify that same S value.
We briefly explain the usage of semaphores in the following example:
Consider two currently running processes: P1 with a statement S1 and P2 with a statement S2.
Suppose that we require that S2 be executed only after S1 has completed. This scheme can be
implemented by letting P1 and P2 share a common semaphore synch, initialised to 0, and by
inserting the statements:
S1;
signal(synch);
in the process P1 and the statements;
wait(synch);
S2;
in the process P2
Since synch is initialised to 0, P2 will execute S2 only after P1 has involved signal (synch),
which is after S1.
The disadvantage of the semaphore definition given above is that it requires busy-waiting, i.e.,
while process is in its critical region, any either process it trying to enter its critical region must
68
continuously loop in the entry code. It is clear that through busy-waiting, CPU cycles are wasted
by which some other processes might use those productively.
To overcome busy-waiting, we modify the definition of the wait and signal operations. When a
process executes the wait operation and finds that the semaphores value is not positive, the
process blocks itself. The block operation places the process into a waiting state. Using a
scheduler the CPU then can be allocated to other processes which are ready to run.
A process that is blocked, i.e. waiting on a semaphore S, should be restarted by the execution of a
signal operation by some other processes, which changes its state from blocked to ready. To
implement a semaphore under this condition, we define a semaphore as;
struct semaphore
{
Int value;
List *L; //a list of processes
}
Each semaphore has an interger value and a list of processes. When a process must wait on a
semaphore, it is added to this list. A signal operation removes one process from the list of waiting
processes, and awakens it. The semaphore operation can be now defined as follows:
Wait(S)
{
S value = S value – 1;
if (S value < 0)
{
add this process to SL;
Block;
}
}
signal(S)
{
S value = S value + 1
If(S value < = 0)
{
remove a process P from SK;
wakeup(P);
}
}
The block operation suspends the process. The wakeup (P) operations resumes the execution of a
blocked process P. These two operations are provided by the operating system as basic system
calls.
One of the almost critical problem concerning implementing semaphore is the situation where
two or more processes are waiting indefinitely for an event that can be only caused by one of the
waiting processes: these processes are said to be deadlocked. To illustrate this, consider a system
69
consisting of two processes P1 and P2, each accessing two semaphores S and Q, set to the value
one:
P1 P2
wait(S); wait(Q);
wait(Q); wait(S);
……
signal(S); signal(Q);
signal(Q); signal(S);
Suppose P1 executes wait(S) and then P2 executes wait(Q). When P1 executes wait(Q), it must
wait until P2 executes signal(Q). It is no problem, P2 executes wait(Q), then signal(Q). Similarly,
when P2 executes wait(S), it must wait until P1 executes signal(S). Thus these signal operations
cannot be carried out, P1 and P2 are deadlocked. It is clear, that a set of processes are in a
deadlocked state, when every process in the set is waiting for an event that can only be caused by
another process in the set.
ACTIVITY 3.2
70
Shared Data
char item; //could be any data type
char buffer[n];
semaphore full = 0; //counting semaphore
semaphore empty = n; //counting
semaphore
semaphore mutes = 1; //binary semaphore
char nextp,nextc;
Producer Process
do
{
Produce an item in nextp
wait( empty);
wait(mutex);
add nextc to buffer;
signal( mutex );
signal( full );
consume the item in nextc;
}
While (true)
Consumer Process
do
{
wait( full);
wait(mutex);
remove an item from buffer to nextc
signal( mutex );
signal( empty );
consume the item in nextc;
}
71
An airline reservation system consists of a huge database with many processes that read and write
the data. Reading information from the database will not cause a problem since no data is
changed. The problem lies in writing information to the database. If no constraints are put on
access to the database, data may change at any moment. By the time a reading process displays
the result of a request for information to the user, the actual data in the database may have
changes. What if, for instance, a process reads the number of available seats on a flight, finds a
value of one, and reports it to the customer? Before the customer has a chance to make their
reservation, another process makes a reservation for another customer, changing the number of
available seats to zero.
72
3.4.3 Dining Philosophers Problem
Five philosophers sit around a circular table. Each philosopher spends his life alternatively
thinking and eating. In the centre of the table is a large bowl of rice. A philosopher needs two
chopsticks to eat. Only five chopsticks are available and a chopstick is placed between each pair
of philosophers. They agree that each will only use the chopstick to his immediate right and left.
From time to time, a philosopher gets hungry and tries to grab the two chopsticks that are
immediate left and right to him. When hungry philosopher has both his chopsticks at the same
time, he eats without releasing his chopsticks. When he finishes eating, he puts down both his
chopstick’s and starts thinking again.
Here’s a solution for the problem which does not require a process to write another process’s
state, and gets equivalent parallelism.
73
eat();
put_forks();
}
}
74
up (&mutex); //shop is full: do not wait
}
{
Explanation:
(a) This problem is similar to various queuing situations.
(b) The problem is to program the barber and the customers without getting into race
conditions.
(i) Solution uses three semaphores:
• Customers; counts the waiting customers;
• Barber; the number of barbers (0 or 1);
• Mutex; used for mutual exclusion; and
• Also need a variable waiting; also counts the waiting customers (reason; no way
to read the current value of semaphore).
(c) The barber executes the procedure barber, causing him to block on the semaphore
customers (initially 0);
(d) The barber then goes to sleep.
(e) When a customer arrives, he executes customer, starting by acquiring mutex to enter a
critical region;
(f) If another customer enters, shortly thereafter, the second one will not be able to do
anything until the first one has released mutex;
(g) The customer then checks to see if the number of waiting customers is less that the
number of chairs;
(h) If not, he releases mutex and leaves without a haircut;
(i) If there is an available chair, the customer increments the integer variable, waiting;
(j) Then he does an up on the semaphore customers;
(k) When the customer releases mutex, the barber begins the haircut;
3.5 Locks
Locks are another synchronisation mechanism. A lock has got two atomic operations (similar to
semaphore) to provide mutual exclusion. These two operations are Acquire and Release. A
process will acquire a lock before accessing a shared variable, and later it will be released. A
process locking a variable will run the following code:
Lock-Acquire();
Critical section
Lock-
Released():
The difference between a lock and a semaphore is that a lock is released only by the process that
have acquired it earlier. As we discussed above any process can increment the value of the
semaphore. To implement locks, here are some things you should keep in mind:
75
(a) To make Acquire () and Release () atomic.
(b) Build a wait mechanism.
(c) Making sure that only the process that acquires the lock will release the lock.
A common observation about critical sections is that many of the procedures for manipulation
shared abstract data types such as files have critical sections making up their entire bodies. Such
abstract data types have come to be known as monitors where the critical sections and
semaphoresimplicit in the use of a monitor were all implicit. All that this notation requires is that
the programmer encloses the declaration of the procedures and the representation of the data type
in a monitor block; the compiler supplies the semaphores and the wait and signal operations this
this implies. Using Hoare’s suggested notation, shared counters might be implemented as shown
below:
Calls to procedures within the body of a monitor are done using record notation; thus, to
increment one of the counters declared in above example, one would call “i.increment”. This call
would implicitly do a wait operation on the semaphores implicitly associated with “I”, then
execute the body of the “increment” procedure before doing a signal operatin on the semaphore.
Note that the call to “i.increment” implicitly passes a specific instance of the monitor as a
parameter to the “increment” procedure, and that fields of this instance become global variables
to the body of the procedure, as if there was an implicit „with” statement.
76
There are a number of problems with monitors which have been ignored in the above example.
For example, consider the problem of assigning a meaning to a call from within one monitor
procedure to a procedure within another monitor. This can easily lead to a deadlock. For
example, when procedures within two different monitors each calling the other. It has sometimes
been proposed that such calls should never be allowed, but they are sometimes useful! We will
study more on deadlocks in the next units of this course.
The most important problem with monitors is that of waiting for resources when they are not
available. For example, consider implementing a queue monitor with internal procedures for the
enqueue and dequeue operations. When the queue empties, a call to dequeue must wait, but this
wait must nor block further entries to the monitor through the enqueue procedure, In effect, there
must be a way for a process to temporarily step outside of the monitor, releasing mutual
exclusion while it waits for some other process to enter the monitor and do some needed action.
Hoare’s suggested solution to this problem involves the introduction of condition variables which
may be local to a monitor, along with the operations wait and signal. Essentially, if it is the
monitor semaphore, and c is a semaphore representing a condition variable, “wait c” is equivalent
to “signal(s); wait(c); and “signal c” is equivalent to “signal(c)”. The details of Hoare’s wait and
signal operations were somewhat more complex than is shown here because the waiting process
was given priority over other processes trying to enter the monitor, and condition variables had
no memory; repeated signaling of a condition had no effect and signaling a condition on which
no process was waiting had no effect. Following is an example monitor:
monitor synch
integer i:
condition c;
procedure procedure(x);
.
.
end;
procedure consumer(x);
.
.
end;
end monitor;
There is only one process that can enter a monitor, therefore every monitor has its own waiting
list with process waiting to enter the monitor.
Let us see the dining philosopher’s which was explained in the above section with semaphores,
can be re-written using the monitors as:
Example: Solution to the Dining Philosophers Problem using Monitors
77
monitor dining-philosophers
{
enum state {thinking, hungry, eating};
state state[5];
condition self[5];
void pickup (int i)
{
state[i] = hungry
test(i)
if(state[i]!=eating
self[i] wait;
}
void putdown (int i)
{
sState[i] = thinking;
test(i+4%5);
test(i+1%5);
}
Void test (int k)
{
If ((state[k+4%5]1= eating) && (state[k]=-hungry)
&& state[k+1%5]!=eating))
{
state[k]-eating
self[k].signal
\}
}
init
{
for (int 1 = 0; i<5; i++)
state[i] = thinking;
}
}
Conditions Variables
If a process cannot enter the monitor it must block itself. This operation is provided by the
condition variables. Like locks and semaphores, the condition has got a wait and a signal
function. But it also has the broadcast signal. Implementation of condition variables is part of a
synch h; it is your job to implement it. Wait(), Signal () and Broadcast () have the following
semantics:
Wait () releases the lock, gives up the CPU until signaled and then re-acquire the lock.
Signal() wakes up a thread if there are any waiting on the condition variable.
Broadcast() wakes up all threads waiting on the condition.
When you implement the condition variable, you must have a queue for the processes waiting on
the condition variable.
78
ACTIVITY 3.3
1. What are race conditions? How race conditions occur in Operating Systems?
2. What is a critical section? Explain.
3. Provide the solution to a classical synchronisation problem namely “cigarette smoker’s
problem”. The problem is defined as follows:
There are four processes in this problem: three smoker processes and an agent process.
Each of the smoker processes will make a cigarette and smoke it. To make a cigarette
requires tobacco, paper, and matches. Each smoker process has one of the three items.
i.e. one process has tobacco, another has paper, and a third has matches. The agent has
an infinite supply of all three. The agent places two of the three items on the table, and the
smoker that has the third item makes the cigarette.
3.7 Summary
• Interprocess communication provides a mechanism to allow process to communicate with
other processes. Interprocess communication system is best provided by a message
passing system. Messages systems can be defined in many different ways. If there are
collection of cooperating sequential processes that share some data. Mutual exclusion
must be provided. Different algorithms are available for solving the critical section
problem which we have discussion in this topic. The bakery algorithm is used for solving
the n process critical section problem.
• Interprocess synchronisation provides the processes to synchronise their activities.
Semaphores can be used to solve synchronisation problems. Semaphore can only be
accessed through two atomic operations and can be implemented efficiently. The two
operations are wait and signal.
• There are a number of classical synchronisation problems which we have discussed in this
unit (such as producer-consumer problems, readers - writers problem and dining -
philosophers problem). These problems are important mainly because they are examples
for a large class of concurrency-control problems. In the next unit we will study an
important aspect called as “Deadlocks”.
79
KEY TERMS
REFERENCES
Dhamdhere, D., M. (2006). Operating systems: A concept-based approach. New Delhi: Tata
McGraw-Hill.
Milenkovic, M. (2000). Operating systems: Concept and design. New York: McGraw-Hill
International Education.
Tanenbaum, A., S., & Woodhull, A., S. (2009). Operating system design and implementation.
UK: Pearson.
Stalling, W. (2001). Operating system (4th ed.). New Jersey: Prentice Hall.
80
Unit 4 Deadlocks
4.0 Introduction
The operating system is responsible for making sure that the requesting process has been
allocated the resource. A system table indicates if each resource is free or allocated, and if
allocated, to which process. If a process requests a resource that is currently allocated to another
process, it can be added to a queue of processes waiting for this resource.
In some cases, several processes may compete for a fixed number of resources. A process
requests resources and if the resources are not available at that time, it enters a wait state. It may
happen that it will never gain access to the resources, since those resources are being held by
other waiting processes.
For example, assume a system with one tape drive and one plotter. Process P1 requests the tape
drive and process P2 requests the plotter. Both requests are granted. Now P1 requests the plotter
(without giving up the tape drive) and P2 requests the tape drive (without giving up the plotter).
Neither request can be granted so both processes enter a situation called the deadlock situations.
81
release of the remaining resources held by others thus making it impossible for any of the
deadlocked processes to proceed.
In the earlier units, we have gone through the concept of process and the need for the interprocess
communication and synchronisation. In this unit we will study about the deadlocks, its
characterisation, deadlock avoidance and its recovery.
4.2 Deadlocks
Before studying about deadlocks, let us look at the various types of resources. There are two
types of resources namely: Pre-emptable and Non-pre-emptable Resources.
Reallocating resources can resolve deadlocks that involve pre-emptable resources. Deadlocks that
involve non-preemptable resources are difficult to deal with. Let us see how a deadlock occurs.
Definition
A set of processes is in deadlock state if each process in the set is waiting for an event that can be
caused by only another process in the set. In other words, each member of the set of deadlock
processes is waiting for a resource that can be released only by a deadlock process. None of the
processes can run, none of them can release any resources and none of them can awakened. It is
important to note that the number of processes and the number and kind of resources processed
and requested are unimportant.
Let us understand the deadlock situation with the help of examples.
Example 1:
The simplest example of deadlock is where process 1 has been allocated a non-shareable resource
A, say, a tape drive, and process 2 has been allocated a non-sharable resource B, say, a printer.
Now, if it turns out that process 1 needs resource B (printer) to proceed and process 2 needs
resource A (the tape drive) to proceed and these are the only two processes in the system, each
82
has blocked the other and all useful work in the system stops. This situation is termed as
deadlock.
The system is in deadlock state because each process holds a resource being requested by the
other process and neither process is willing to release the resource it holds.
Example 2:
Consider a system with three disk drivers. Suppose there are three processes, each is holding one
of these three disk drives. If each process now requests another disk drive, three processes will be
in a deadlock state, because each process is waiting for the event “disk drive is released”, which
can only be caused by one of the other waiting processes. Deadlock state involves processes
competing not only for the same resource type, but also for different resource types.
Deadlocks occur most commonly in multitasking an client/server environments and are also
known as a “Deadly Embrace”. Ideally, the program that are deadlocked or the operating system
should resolve the deadlock, but this doesn’t always happen.
From the above examples. We have understood the concept of deadlocks. In the examples, we
were given some instances, but we will study the necessary conditions for a deadlock to occur, in
the next section.
Coffman (1971) identified four necessary conditions that must hold simultaneously for a
deadlock to occur.
The resources involved are non-shareable. At least one resource must be held in a non-shareable
mode, that is, only one process at a time claims exclusive control of the resource. If another
process requests that resource, the requesting must be delayed until the resource has been
released.
83
Figure 4.1: Traffic deadlock
The simple rule to avoid traffic deadlock is that a vehicle should only enter an intersention if it
assured that it will not have to stop inside the intersection. It is not possible to have a deadlock
involving only one single process. The deadlock involves a circular “hold-and-wait” condition
between two or more processes, so “one” process cannot hold a resource, yet be waiting for
another resource that it is holding. In addition, deadlock is not possible between two threads, that
is, each thread has access to the resources held by the process.
84
We can use these graphs to determine if a deadline has occurred or may occur. If for example, all
resources have only one instance (all resource node rectangles have one dot) and the graph is
circular, then a deadlock has occurred. If on the other hand some resources have several
instances, then a deadlock may occur. If the graph is not circular, a deadlock cannot occur (the
circular wait condition wouldn’t be satisfied). The following are the tips which will help you to
check the graph easily to predict the presence of cycles:
(a) If no cycle exists in the resource allocation graph, there is no deadlock.
(b) If there is a cycle in the graph and each resource has only one instance, then there is a
deadlock. In this case, a cycle is a necessary and sufficient condition for deadlock.
(c) If there is a cycle in the graph, and each resource has more than one instance, there may
or may not be a deadlock. (A cycle may be broken if some process outside the cycle has a
resource instance that can break the cycle). Therefore, a cycle in the resource allocation
graph is a necessary but not sufficient condition for deadlock, when multiple resource
instances are considered.
Figure 4.3: Resource allocation graph having a cycle and not in a deadlock
85
The above graph shown in Figure 4.3 has a cycle and is not in deadlock.
(Resource 1 has one instance shown by a star)
(Resource 2 has two instances a and b, shown as two stars)
R1 P1 P1 R2 (a)
R2 (b) P2 P2 R1
Let’s examine each strategy one by one to evaluate their respective strengths and weaknesses.
86
For example, a program requiring ten tap drives must request and receive all ten
drives before it begins executing. If the program needs only one tap drive to begin
execution and then does not need the remaining tap drives for several hours then
substantial computer resources (nine tape drives) will sit idle for several hours. This
strategy can cause indefinite postponement (starvation), since not all the required
resources may become available at once.
87
Rule: Processes can request resources whenever they want to, but all requests must be made
in numerical order. A process may request first printer and then a tape drive (order: 2,4), but
it may not request first a plotter and then a printer (orderL: 3,2). The problem with this
strategy is that may be impossible to find an ordering that satisfies everyone.
This strategy, if adopted, may result in low resource utilisation and in some cases starvation
is possible too.
The Banker’s Algorithm is based on the banking system, which never allocates its available cash
in such a manner that it can no longer satisfy the needs of all its customers. Here we must have
the advance knowledge of the maximum possible claims for each process, which is limited by the
resource availability. During the run of the system we should keep monitoring the resource
allocation status to ensure that no circular wait condition can exist.
If the necessary conditions for a deadlock are in place, it is still possible to avoid deadlock by
being careful when resources are allocated. The following are the features that are to be
considered for avoidance of the deadlock as per the Banker’s Algorithms.
(a) Each process declares maximum number of resources of each type that it may need;
(b) Keep the system in a safe state in which we can allocate resources to each process in some
order and avoid deadlock;
(c) Check for the safe state by finding a safe sequence: <P1, P2, ⁄., Pn> where resources that
Pi needs can be satisfied by available resources plus resources held by Pj where j < I; and
(d) Resource allocation graph algorithm uses claim edges to check for a safe state.
The resource allocation state is now defined by the number of available and allocated resources,
and the maximum demands of the processes. Subsequently the system can be in either of the
following states:
(a) Safe State
Such a state occur when the system can allocate resources to each process (up to its
maximum) in some order and avoid a deadlock. This state will be characterised by a safe
sequence. It must be mentioned here that we should not falsely conclude that all unsafe
states are deadlocked although it may eventually lead to a deadlock.
(b) Unsafe State
If the system did not follow the safe sequence of resource allocation from the beginning
and it is now in a situation, which may lead to a deadlock, then it is in an unsafe state.
(c) Deadlock State
If the system has some circular wait condition existing for some processes, then it is in
deadlock state.
88
Let us study this concept with the help of an example as shown below:
Consider an analogy in which four processes (P1, P2, P3 and P4) can be compared with the
customers in a bank, resources such as printers etc. as cash available in the bank and the
operating system as the Banker.
P1 0 6
P2 0 5
P3 0 4
P4 0 7
P1 1 6
P2 1 5
P3 2 4
P4 4 7
Available resources = 2
Safe State: The key to a state being safe is that there is at least one way for all users to finish. In
other words the state of Table 4.2 is safe because with 2 units left, the operating system can delay
any request except P3, thus letting P3 finish and release all four resources. With four units in
hand, the Operating system can let either P4 or P2 have the necessary units and so on.
Unsafe State: Consider what would happen if a request from P2 for one more unit was granted in
Table 4.3. We would have following situation as shown in Table 4.4.
89
Table 4.4: Unsafe State
Processes Resources Used Maximum Resources
P1 1 6
P2 2 5
P3 2 4
P4 4 7
Available resources = 1
This is an unsafe state.
If all the processes request for their maximum resources respectively, then the operating system
could not satisfy any of them and we would have a deadlock.
Important Note: It is important to note that an unsafe state does not imply the existence or even
the eventual existence of a deadlock. What an unsafe state does imply is simply that some
unfortunate sequence of events might lead to a deadlock.
The Banker’s algorithm is thus used to consider each request as it occurs, and see if granting it
leads to a safe state. If it does, the request is granted, otherwise, it is postponed until later.
Haberman [1969] has shown that executing of the algorithm has a complexity proportional to N2
where N is the number of processes and since the algorithm is executed each time a resource
request occurs, the overhead is significant.
90
(e) A resource becoming unavailable (e.g,. a tape drive breaking), can result in an unsafe
state.
Consider the scenario where a process is in the state of updating a data file and it is
terminated. The file may be left in an incorrect state by the unexpected termination of the
updating process. Further, processes should be terminated based on some criterion/policy.
Some of the criteria may be as follows:
(i) Priority of a process;
(ii) CPU time used and expected usage before completion;
(iii)Number and type of resources being used (can they be pre-empted easily?);
(iv) Number of resources needed for completion;
(v) Number if processes needed to be terminated; and
(vi) Are the processes interactive or batch?
If a deadlock is detected, one or more processes are restarted from their last checkpoint.
Restarting a process from a checkpoint is called rollback. It is done with the expectation
that the resource requests will not interleave again to produce deadlock.
Deadlock recovery is generally used when deadlocks are rare, and the cost of recovery
(process termination or rollback) is low.
91
Process checkpointing can also be used to improve reliability (long running
computations), assist in process migration, or reduce startup costs.
ACTIVITY 4.1
4.6. Summary
A deadlock occurs a process has some resource and is waiting to acquire another resource,
while that resource is being held by some process that is waiting to acquire the resource that
is being held by the first process.
A deadlock needs four conditions to occur: Mutual exclusion, hold and wait, non-
preemption and circular waiting.
We can handle deadlocks in three major ways: We can prevent them, handle them when
we detect them, or simply ignore the whole deadlock issue altogether.
KEY TERMS
92
REFERENCES
93
Unit 5 Memory Management
6.0 Introduction
In the previous units, we have studied about introductory concepts of the OS, process
management and deadlocks. In this unit, we will go through another important function of the
Operating System - the memory management.
Memory is central to the operation of a modern computer system. Memory is a large array of
words or bytes, each location with its own address. Interaction is achieved through a sequence of
reads/writes of specific memory address. The CPU fetches from the program from the hard disk
and stores in memory. If a program is to be executed, it must be mapped to absolute addresses
and loaded into memory.
In a multiprogramming environment, in order to improve both the CPU utilisation and the speed
of the computer’s response, several processes must be kept in memory.
There are many different algorithms depending on the particular situation to manage the memory.
Selection of a memory management scheme for a specific system depends upon many factors, but
especially upon the hardware design of the system.
Each algorithm requires its own hardware support.
The Operating System is responsible for the following activities in connection with memory
management:
(a) Keep track of which parts of memory are currently being used and by whom;
(b) Decide which processes are to be loaded into memory when memory space becomes
available; and
(c) Allocate and deallocate memory space as needed.
94
protection like processor (hardware) support that is able the instructions violating protection and
trying to interrupt other processes.
This unit collectively depicts such memory management related responsibilities in detail by the
OS. Further we will discuss, the basic approaches of allocation are of two types:
(a) Contiguous Memory Allocation
Each programs data and instructions are allocated memory space in memory.
(b) Non-Contiguous Memory Allocation
Each programs data and instruction are allocated memory space that is not continuous.
This unit focuses on contiguous memory allocation scheme.
Usually, programs reside on a disk in the form of executable files and for this execution they
must be brought into memory and must be placed within a process. Such programs form the
ready queue. In general scenario, processes are fetched from ready queue, loaded into memory
and then executed. During these stages, address must be represented in different ways like in
source code addresses or in symbolic from (ex. LABEL). Compiler will bind this symbolic
address to relocatable addresses (for example, 16 bytes from base address or start of module).
The linking editor will bind these relocatable addresses to absolute addresses. Before we learn a
program is memory we must bind the memory addresses that the program is going to use.
Binding is basically assigning which address the code and data are going to occupy. You can bind
at compile-time, load-time or execution time and explanation as follows:
(a) Compile-time
If memory location is known a priori, absolute code can be generated.
(b) Load-time
If it is known, it must generate relocatable at complete time.
(c) Execution-time
95
Binding is delayed until run-time; process can be moved during its execution. We need
H/W support for address maps (base and limit registers).
For better memory utilisation all modules can be kept on disk in a relocatable format and only
main program is loaded into memory and executed. Only on need the other routines are called,
loaded and address is updated. Such scheme is called dynamic loading, which is user’s
responsibility rather than OS. But Operating System provides library routines to implement
dynamic loading.
In the above discussion, we have seen that entire program and its related data is loaded in
physical memory for execution. But what if process is large than the amount of memory allocated
to it? We can overcome this problem by adopting a technique called as Overlays. Like dynamic
loading, overlays can also be implemented by users without OS support. The entire program or
application is divided into instructions and data sets such that when one instruction set needed it
is loaded in memory and after its execution is over, the space is released. Such instructions can be
called as overlays, which are loaded and unloaded by the program.
Definition: An overlay is a part of an application, which has been loaded at same origin where
previously some other part(s) of the program was residing.
A program based on overlap scheme mainly consists of following:
(a) A roof piece which is a always memory resident; and
(b) Sort of overlays.
Overlay gives the program a way to extend limited main storage. An important aspect related to
overlays identification in program is concept of mutual exclusion i.e., routines which do not
invoke each and not loaded in memory simultaneously.
For example, suppose total available memory is 140K. Consider a program with four subroutines
named as: Read ( ),Funtion1 ( ),Funtion2 ( ) and Display ( ). First, Read is invoked that reads a
set of data. Based on this data set values, conditionally either one of routine Function1 or
Funtion2 is called. And then Display is called to output results. Here, Funtion1 and Funtion2 are
manually exclusive and are not required simultaneously in memory. The memory requirement
can be shown as in Figure 5.1.
96
Figure 5.1: Example of overlay
Without the overlay it requires 180K of memory and with the overlay support memory
requirement is 130K. Overlay manager/driver is responsible for loading and unloading on overlay
segment as per requirement. But this scheme suffers from following limitations:
Swapping
Swapping is an approach for memory management by bringing each process in entirety, running
it and then putting it back on the disk, so that another program may be loaded into that space.
Swapping is a technique that lets you use a disk file as an extension of memory. Lower priority
user processes are swapped to backing store (disk) when they are waiting for I/O or some other
event like arrival of higher priority processes. This is Rollout Swapping. Swapping the process
back into store when some event occurs or when needed (may be in a different partition) is
known as Roll-in swapping. Figure 5.2 depicts techniques of swapping.
Through swapping has these benefits but it has few limitation also like entire program must be
resident in store when it is executing. Also processes with changing memory requirements will
need to issue system calls for requesting and releasing memory. It is necessary to know exactly
how much memory a user process is using and also that it is blocked or swapping for I/O.
97
Transfer time = 100K/ 1,000 = 1/10 seconds
= 100 milliseconds
Access time = 10 milliseconds
Total time = 110 milliseconds
As both the swap out and swap in should take place, the total swap time is then about 220
milliseconds (above time is doubled). A round robin CPU scheduling should have a time slot size
much larger relative to swap time of 220 milliseconds. Also if process is not utilising memory
space and just waiting for I/O operation or blocked, it should be swapped.
The computer interacts via logical and physical addressing to map memory. Logical address is
the one that is generated by CPU, also referred to as virtual address. The program perceives this
address space. Physical address is the actual address understood by computer hardware i.e.,
memory unit. Logical to physical address translation is taken care by the Operating System. The
term virtual memory refers to the abstraction of separating LOGICAL memory (i.e., memory as
seen by the process) from PHYSICAL memory (i.e., memory as seen by the processor). Because
of this separation, the programmer needs to be aware of only the logical memory space while the
operating system maintains two or more levels of physical memory space.
ACTIVITY 5.1
In compile-time and load-time address binding schemes these two tend to be the same. These
differ in execution-time address binding scheme and the MMU Memory Management Unit
(MMU) handles translation of these addresses.
Definition: MMU (as shown in the Figure 5.3) is a hardware device that maps logical address to
the physical address. It maps the virtual address to the real store location. The simple MMU
scheme adds the relocation register contents to the base address of the program that is generated
at the time it is sent to memory.
In the simplest case of single-user system everything was easy as at a time was just one process
in memory and no address translation was done by the operating system dynamically during
98
execution. Protection of OS (or part of it) can be achieved by keeping it in ROM. We can also
have a separate space only accessible in supervisor mode as show in Figure 5.4.
The user can employ overlays if memory requirement by a program exceeds the size of physical
memory. In this approach only one process at a time can be in running state in the system.
Example of such system is MS-DOS which is a single tasking system having a command
interpreter. Such an arrangement is limited in capability and performance.
99
Figure 5.5: The hierarchical of memory allocation
The contents of a relocation register are implicitly added to any address references generated by
the program. Some systems use base registers as relocation register for easy addressability as
these are within programmer’s control. Also, in some systems relocation is managed and
accessed by Operating System only.
To summarise this, we can say, in dynamic relocation scheme if the logical address space range is
0 to Max the physical address space range is R+0 to R+Max (where R is relocation register
contents). Similarly, a limit resister is checked by hardware to be sure that logical address
generated by CPU is not bigger than size of the program.
100
This is also known as static partitioning schemes as shown in Figure 5.7. Simple memory
management scheme is to divide memory into n (possible unequal) fixed-sized partitions, each of
which can hold exactly one process. The degree of multiprogramming is dependent on the
number of partitions. IBM used this scheme for system 360 OS/MFT (Multiprogramming with a
fixed number of tasks). The partition boundaries are not movable (must reboot to move a job).
We can have one queue per partition or just a single queue for all the partitions.
Initially, whole memory is available for user processes and is like large block of available
memory. Operating System keeps details of available memory block and occupied blocks in
tabular form. OS also keeps track on memory requirements of each process. As processes enter
into the input queue and when sufficient space for it is available, process is allocated space and
loaded. After its execution is over it releases its occupied and OS fills this space with other
processes in input queue. The block of available memory is known as a Hole. Hole of various
sizes are scattered throughout the memory. When any process arrives, it is allocated memory
from a hole that is large enough to accommodate it. This example is shown in Figure 5.8.
101
a) One that is allocated to next process of input queue; and
b) Added with set of holes.
Within a partition if two holes are adjacent then they can be merged to make a single large hole.
But this scheme suffers from fragmentation problem. Storage fragmentation occurs either
because the user processes do not completely accommodate the allocated the allotted partition or
partition remains unused, if it is too small to hold any process form input queue. Main memory
utilisation is extremely inefficient.
Any program, no matter how small, occupies entire partition. In our example, process B takes
150K of partition2 (200K) size). We are left with 50K sized hole. This phenomenon, in which
there is wasted space internal to a partition, is known as internal fragmentation. It occurs because
initially process is loaded in partition that is large enough to hold it (i.e., allocated memory that is
internal to a partition, but is not in use.
Variable-sized Partition:
This scheme is also known as dynamic partitioning. I this scheme, boundaries are not fixed.
Processes accommodate memory according to their requirement. There is no wastage as partition
size is exactly same as the size of the user process. Initially when processes start this wastage can
be avoided but later on when they terminate they leave holes in the main storage. Other processes
can accommodate these, but eventually they become too small to accommodate new jobs as
shown in Figure 5.9.
IBM used this technique for OS/MVT (Multiprogramming with a Variable number of Tasks) as
the partitions are of variable length and number. But still fragmentation anomaly exists in this
scheme. As time goes on and processes are loaded and removed from memory, fragmentation
increase and memory utilisation declines. This wastage of memory, which is external to partition,
is known as external fragmentation. In this, though there is enough total memory to satisfy a
request but as it is not contiguous and it is fragmented into small holes, that can’t be utilised.
External fragmentation problem can be resolved by coalescing holes and storage compaction.
Coalescing holes is process of merging existing hole adjacent to a process that will terminate and
free its allocated space. Thus, new adjacent holes and existing holes can be viewed as a single
large hole and can be efficiently utilised. There is another possibility that holes are distributed
throughout the memory. For utilising such scattered, shuffle all occupied areas of memory to one
102
end and leave all free memory space as a single large block, which can further be utilised. This
mechanism is known as Storage Compaction, as shown in Figure 5.10.
In a multiprogramming system memory is divided into a number of fixed size or variable sized
partitions or regions, which are allocated to running process. For example: a process needs m
words of memory may run in a partition of n words where n is greater than or equal to m. The
variable size partition scheme may result in a situation where available memory is not
contiguous, but fragmentation and external fragmentation. The difference (n-m) is called internal
fragmentation, memory which is internal to a partition but is not being use. If a partition is
unused and available, but too small to be used by any waiting process, then it is accounted for
external fragmentation. These memory fragments cannot be used.
In order to solve this problem, we can either compact the memory making large free memory
blocks, or implement paging scheme which allows a program’s memory to be non-contiguous,
thus permitting a program to be allocated physical memory wherever it is available.
ACTIVITY 5.2
103
5. List three limitions of the storage compaction.
6.6 Paging
We will see the principles of operation of the paging in the next section.
Paging scheme solves the problem faced in variable sized partitions like external fragmentation.
Now, question arises which strategy is likely to be used? In practice, best-fit and first-fit are
better than worst-fit. Both these are efficient in terms of time and storage requirement. Best-fit on
the other hand requires least overheads in its implementation because of its simplicity. Possibly
worst-fit also sometimes leaves large holes that could further be used to accommodate other
processes. Thus all these policies have their own merits and demerits.
This is known as Address Translation scheme. For example, a 16-bit address can be divided as
given in Figure 5.12.
15 10 0
00110 00000101010
Page No. (p) Displacement (d)
Here, as page number takes 5bits, so range of values is 0 to 31(i.e. 25 - 1). Similarly, offset value
uses 11-bits, so range is 0 to 2023 (i.e., 211 - 1). Summarizing this we can say paging scheme
uses 32 pages, each with 2024 locations.
The table, holds virtual address to physical address translations, is called the page table. As
displacement is constant, so only translation of virtual page number to physical page is required.
This can be seen diagrammatically in Figure 5.12.
Page number is used as an index into a page table and the latter contains base address of each
corresponding physical memory page number (Frame). This reduces dynamic relocation efforts.
The Paging hardware support is shown diagrammatically in Figure 5.13.
105
Figure 5.13: Address translation scheme
This is the case of direct mapping as page table sends directly to physical memory page. This is
shown in Figure 5.14. But disadvantage of this scheme is its speed of translation. This is because
page table is kept in primary storage and its size can be considerably large which increases
instruction execution time (also access time) and hence decreases system speed. To overcome
this additional hardware support of registers and buffers can be used. This is explained in next
section.
106
Paging Address Translation with Associative Mapping
This scheme is based on the use of dedicated registers with high speed and efficiency. These
small, fast-lookup cache help to place the entire page table into a content-addresses associative
storage, hence speed-up the lookup problem with a cache. These are known as associative
registers or Translation Look-aside Buffers (TLB’s). Each register consists of two entries:
It is similar to direct mapping scheme but here as TLB’s contain only few page table entries, so
search is fast. But it is quite expensive due to register support. So, both direct and associative
mapping scheme can also be combined to get more benefits. Here, page number is matched with
all associative registers simultaneously, the percentage of the number of time the page is found in
TLB’s is called hit ration. If it is not found, it is searched in page table and added into TLB. But
if TLB is already full then page replacement policies can be used. Entries in TLB can be limited
only. This combined scheme is shown in Figure 5.15.
107
physical page, the processes share the memory. If one process writes the data other process will
see the changes. It is a very efficient way to communicate. Sharing must also be controlled to
protect modification and accessing data in one process by another process. For this programs are
kept separately as procedures and data, share procedures and data that are non-modifiatable
(pure/reentrant code) can be shared. Reentrant code cannot modify itself and must make sure that
it has a separates copy of per-process global variable. Modifiable procedures are also known as
pure procedures or reetrant codes (can’t change during execution). For example, only one copy of
editor or compiler code can be kept in memory, and all editor or compiler processes can execute
that single copy of the code. This helps memory utilisation. Major advantages of paging scheme
are:
(a) Virtual address space must be greater than main memory size i.e., can execute program
with large logical address space as compared with physical address space;
(b) Avoid external fragmentation and hence storage compaction; and
(c) Full utilisation of available main storage.
Disadvantages of paging include internal fragmentation problem i.e., wastage within allocated
page when process is smaller than page boundary. Also, extra resource consumption and
overheads for paging hardware and virtual address to physical address translation takes place.
ACTIVITY 5.3
6.7 Segmentation
In the earlier section we have seen the memory management scheme called as paging. In general,
a user or a programmer prefers to view system memory as a collection of variable-sized segment
rather than as a linear array of words. Segmentation is a memory management scheme that
supports this view of memory.
108
5.7.2 Address Translation
This mapping between two is done by segment table, which contains segment base and its limit.
The segment base has starting physical address of segment, and segment limit provides the length
of segment. This scheme is depicted in Figure 5.16.
The offset d must range between 0 and segment limit/length, otherwise it will generate address
error. For example, consider situation shown in Figure 5.17.
This scheme is similar to variable partition allocation method with improvement that the process
is divided into parts. For fast retrieval we can use registers as paged scheme. This is known as a
segment-table length register (STLR). The segments in a segmentation scheme correspond to
logical division of the process and are defined by program names. Extract the segment number
and offset from logical from logical address first. Then use segment number as index into
109
segment table to obtain segment base address and its limit/length. Also, check that the offset is
not greater than given limit in segment table. Now, general physical address is obtained by
adding the offset to the base address.
Sharing of segments can be done by making common/same entries in segment tables of two
different processes which point to same physical location. Segmentation may suffer from external
fragmentation i.e., when blocks of free memory are not enough to accommodate a segment.
Storage compaction and coalescing can minimise this drawback.
ACTIVITY 5.4
6.8 Summary
• In this unit, we have learnt memory resource is managed and how processes are protected
from each other.
• The previous two sections covered memory allocation techniques like swapping and overlays,
which tackle the utilisation of memory. Paging and segmentation was presented as memory
management schemes. Both have their own merits and demerits.
• We have also seen how paging is based on physical form of process and is independent of the
programming structures, while segmentation is dependent on logical structure of process as
viewed by user.
• We have also considered level of fragmentation (internal and external) problems and ways to
tackle them to increase level of multiprogramming and system efficiency. Concept of
relocation and compaction helps to overcome external fragmentation.
110
KEY TERMS
REFERENCES
Deital, H., M. (1984). An introduction to operating systems. Peterson: Addition Wesley
Publishing Company.
Ritchie, C. (2003). Operating system incorporation UNIX and Windows (4th ed.). New Delhi:
BPB Pulbication.
Silberschatz, A., & Galvin, P. B. (1997). Operating system concepts. (5th ed.). New Delhi: Wiley
and Sons (Asia) Publication.
Tanenbaum, A., S., & Woodhull, A., S. (2009). Operating system design and implementation.
UK: Pearson.
111
Unit 6 Virtual Memory
6.0 Introduction
In the earlier unit, we have studied Memory Management covering topics like the overlays,
contiguous memory allocation, static and dynamic partitioned memory allocation, paging and
segmentation techniques. In this unit, we will study an important aspect of memory management
known as Virtual memory.
Storage allocation has always been an important consideration in computer programming due to
the high cost of the main memory and the relative abundance and lower cost of secondary
storage. Program code and data required for execution of a process must reside in the main
memory but the main memory may not be large enough to accommodate the needs of an entire
process. Early computer programmers divided programs into the sections that were transferred
into the main memory for the main memory and replaced sections that were not needed at that
time. In this early era of computing, the programmer was responsible for devising this overlay
system.
As higher-level languages became popular for writing more complex programs and the
programmer became less familiar with the machine, the efficiency of complex programs suffered
from poor overlay systems. The problem of storage allocation became more complex.
Two theories for solving the problem of inefficient memory management emerged - static and
dynamic allocation. Static allocation assumes that the availability of memory resources and the
memory references string of a program can be predicted. Dynamic allocation relies on memory
usage increasing and decreasing with actual program needs, not on predicting memory needs.
Programming objectives and machine advancements in the 1960s made the predictions required
for static allocation difficult, if not impossible. Therefore, the dynamic allocation solution was
generally accepted, but opinions about implementation were still divided. One group believed the
programmer should continue to be responsible for storage allocation, which would be
accomplished for storage allocation, which would be accomplice by system calls to allocate or
deal locate memory. The second group supported automatic storage allocation performed by the
operating system, because of increasing complexity of storage allocation and emerging
importance of multiprogramming. In 1961, two groups memory to alleviate any need for storage
allocation. This solution was not possible due to its very high cost. The second proposal is known
as virtual memory.
The most common way of doing this is a technique called virtual memory, which has been known
since the 1960s but has become common on computer systems since the late 1980s. The virtual
memory scheme divides physical memory into blocks and allocates blocks to different processes.
Of course, in order to do this sensibly it is highly desirable to have a protection scheme that
restricts a process to be able to access only those blocks that are assigned to it. Such a protection
scheme is thus a necessary, and somewhat involved, aspect of any virtual memory
implementation. One other advantage of using virtual memory that may not be immediately
apparent is that it often reduces the time taken to lunch a program, since not all the program code
and data need to be in physical memory before the program execution can be started. Although
sharing the physical address space is a desirable end, it was not all the sole reason that virtual
memory became common on contemporary systems. Until the late 1980s, if a program became
too large to fit in one piece in physical memory, it was the programmer’s job to see that it fit.
Programmers typically did this by breaking programs into pieces, each of which was mutually
exclusive in its logic. When a program was launched, main piece that initiated the execution
would first be loaded into physical memory, and then the other parts, called overlays, would be
loaded as needed.
It was the programmer’s task to ensure that the program never tried to access more physical
memory than was available on the machine, and also to ensure that the proper overlay was loaded
into physical memory whenever required. These responsibilities made for complex challenges for
programmers, who had to be able to divide their programs into logically separate fragments, and
specify a proper scheme to load the right fragment at the right time. Virtual memory came about
as a means as a means to relieve programmers creating large pieces of software of the wearisome
burden of designing overlays.
Virtual memory automatically managers two levels of the memory hierarchy, representing the
main memory and the secondary storage, in a manner that is a invisible to the program that is
113
running. The program itself never has to bother with the physical location of any fragment of the
virtual address space. A mechanism called relocation allows for the same program to run in any
location in physical memory, as well. Prior to the use of virtual memory, it was common for
machines to include a relocation register just for that purpose. An expensive and messy solution
to the hardware resolution of a virtual memory would increase the running times of programs
significantly, among other things.
Virtual memory enables a program to ignore the physical location of any desired black of its
address space; a process can simply seek to access block of its address space without concern for
where that block might be located. If the block happens to be located in the memory, access is
carried out smoothly and quickly; else, the virtual memory has to bring the block in from
secondary storage and allow it to be accessed by the program.
The technique of virtual memory is similar to a degree with the use of processor caches.
However, the differences lie in the block size of virtual memory being typically much larger (64
kilobytes and up) as compared with the typical processor cache (128 bytes and up). The hit time,
the miss penalty (the time taken to retrieve an item that is not in the cache or primary storage),
and the transfer time are all larger in case of virtual memory. However, the miss rate is typically
much smaller. (This is no accident-since a secondary storage device, typically a magnetic storage
device with much lower access speeds, has to be read in case of a miss, designers of virtual
memory make every effort to reduce the miss rate to a level even much lower than that allowed in
processor caches).
Virtual memory systems are of two basic kinds - those using fixed-size block called pages, and
those that use variable-size block called segments. Suppose, for example, that a main memory of
64 megabytes is required but only 32 megabytes is actually available. To create the illusion of the
larger memory space, the memory manager would divide the required space into units called
pages and store the contents of these pages in mass storage. A typical page size is no more than
four kilobytes. As different pages are actually required in main memory, the memory manager
would exchange them for pages that are no longer required, and thus the other software units
could execute as though there were actually 64 megabytes of main memory in the machine. In
brief we can say that virtual memory is a technique that allows the execution of processes that
may not be completely in memory. One major advantage of this scheme is that the program can
be larger than physical memory. Virtual memory can be implemented via demand paging and
demand segmentation.
To facilitate copying virtual memory into real memory, the operating system divides virtual
memory into pages, each of which contains a fixed number of addresses. Each page is stored on a
disk until it is needed. When the page is needed, the operating system copies it from disk to main
memory, translation the virtual addresses into real addresses.
114
Addresses generated by programs are virtual addresses. The actual memory cells have physical
addresses. A piece of hardware called a memory management unit (MMU) translates virtual
addresses to physical addresses at run-time. The process of translating virtual addresses into real
addresses is called mapping. The copying of virtual pages from disk to main memory is known as
paging or swapping.
Some physical memory is used to keep a list of references to the most recently accessed
information on an I/O (input/output) device, such as the hard disk. The optimisation it provides is
that it is faster to read the information from physical memory than use the relevant I/O channel to
get that information. This is called caching. It is implemented inside the OS.
Before going into the details of the management of the virtual memory, let us see the functions of
the virtual memory manager. It is responsible to:
(a) Make portions of the logical address space resident in physical RAM;
(b) Make portions of the logical address space immovable in physical RAM;
(c) Map logical to physical addresses; and
(d) Defer execution of application-defined interrupt code a safe time.
Before considering the methods that various operating systems use support virtual memory, it is
useful to consider an abstract model that is not cluttered by too much detail.
As the processor executes a program it reads an instruction from memory and decodes it. In
decoding the instruction it may need to fetch or store the contents of a location of operands in the
memory. The processor then executes the instruction and moves onto the next instruction in the
program, in this way the processor is always accessing memory either to fetch instructions or to
fetch and store data.
In a virtual memory system all of these addresses are virtual are virtual addresses and not
physical addresses. These virtual addresses are converted into physical addresses by the processor
based on information held in a set of tables maintained by the operating system.
To make this translation easier, virtual and physical memory are divided into small block called
pages. These pages are all of the same size. (It is not necessary that all the pages should be of
same size but they were not, the system would be very hard to administer). Linux on Alpha AXP
systems uses 8 Kbytes page and on Intel x86 systems it uses 4 Kbytes pages. Each of these pages
is given a unique number; the page frame number (PFN) as shown in the Figure 6.1.
In this paged model, a virtual address is composed of two parts, an offset and a virtual page frame
number. If the page size is 4 Kbytes, bits 11:0 of the virtual address contain the offset and bits 12
and above are the virtual page frame number. Each time the processor encounters a virtual
address it must extract the offset and the virtual physical page frame number. The processor must
115
translate the virtual page frame number into a physical one and then access the location at the
correct offset into the physical page. To do this processor uses page tables.
The Figure 6.1 shows the virtual address spaces of two processes, process X and process Y, each
with their own page tables. These page tables map each processes virtual pages into physical
page in memory. This shows that process X’s virtual page number 0 is mapped into memory in
physical page frame number 1 and that process Y’s virtual page frame number 1 is mapped into
physical page frame number 4. Each entry in the theoretical page table contains the following
information:
The page table is accessed using the virtual page frame number as an offset. Virtual page frame
would be the sixth element of the table (0 is the first element).
To translate a virtual address into a physical one, the processor must first work out the virtual
addresses page frame number and the offset within that virtual page. By making the page size a
power of 2 this can be easily done by masking and shifting. Looking again at the Figure 6.1 and
assuming a page size of 0x2000 bytes (which is decimal 1892) and an address of 0x2194 in
116
process Y’s virtual address space then the processor would translate that address into offset
0x194 into virtual page frame number 1.
The processor uses the virtual page frame number as an index into the processes page table to
retrieve its page table entry. If the page table entry at that offset is valid, the processor takes the
physical page frame number from this entry is invalid, the process has accessed a non-existent
area of its virtual memory. In this case, the processor cannot resolve the addresses and must
control to the operating system so that it can fix things up.
Just how the processor notifies the operating system that the correct process has attempted to
access a virtual address for which there is no valid translation is specific to the processor.
However, the processor delivers it, this is known as a page fault and the operating system is
notified of the faulting virtual address and the reason for the page fault. A page fault is serviced
in a number of steps:
Assuming that this is a valid page table entry, the processor takes that physical page frame
number and multiplies it by the page size to get the address of the base of the page in physical
memory. Finally, the processor adds in the offset to the instruction or data that it needs.
Using the above example again, process Y’s virtual page frame number 1 is mapped to physical
page frame number 4 which starts at 0x8000 (4x0x2000). Adding in the 0x194 byte offset gives
us a final physical address of 0x8194. By mapping virtual to physical addresses the way, the
virtual memory can be mapped into the system’s physical pages in any order.
117
the operating system. Valid/invalid bits are also used to allow and disallow to the corresponding
page. This is shown in Figure 6.2.
The paging scheme supports the possibility of sharing common program code. For example, a
system that support 0 users, each of them execute a text editor. If the text editor consists of 30 KB
of code and 5 KB of data, we need 1400 KB. If the code modifying code) it could be shared as
presented in Figure 6.3.
118
Figure 6.3: Paging scheme supporting the sharing of program code
Only once copy of the editor needs to be stored in the physical memory. In each page table, the
included editor page is mapped onto the same physical copy of the editor, but the data pages are
mapped onto different frames. So, to support 40 users, we only need one copy of the editor, i.e.,
30 KB, plus 40 copies of the 5 KB of data pages per user; the total required space is now 230 KB
instead of 1400 KB.
Other heavily used programs such as assembler, compiler, database systems etc. can also be
shared among different users. The only condition for it is that the code must he re-entrant. It is
crucial to correct the functionality of shared paging scheme so that the pages are unchanged. If
one user wants to change a location, it would be changed for all other users.
In order to solve this problem, we can either compact the memory making large free memory
blocks, or implement paging scheme which allows a program’s memory to be non-contiguous,
thus permitting a program to be allocated to physical memory.
ACTIVITY 6.1
1. Explain, what is meant by virtual memory?
2. List four functions of the virtual memory manager.
3. List eight steps of a page fault.
Physical memory is divided into fixed size block called frames. Logical memory is also divided
into block of the same, fixed size called pages, when program is to be executed, its pages are
loaded into any available memory frames from the disk. The disk also divided into fixed sized
blocks that are the same size as the memory frames.
A very important aspect of paging is the clear separation between the user’s view of memory and
the actual physical memory. Normally, a user believes that memory is one contiguous space
continuing only his/her program. In fact, the logical memory is scattered through the physical
memory that also contains other programs. Thus, the user can work correctly with his/her own
view of memory because of the address translation or address mapping. The address mapping,
which is controlled by the operating system and transparent to users, translates logical memory
addresses into physical addresses.
Because the operating system is managing the memory, it must be sure about the nature of
physical memory, for example; which frames are available, which are allocated; how many total
frames there are, and so on. All these parameters are kept in a data structure called frame table
that has one entry for each physical frame of memory indicating whether it is free or allocated,
and if allocated, to which page of which process.
As there is much less physical memory than virtual memory the operating system must be careful
that it does not use the physical memory inefficiently. One way to save physical memory is to
load only virtual pages that are currently being used by the executing program. For example, a
database program may be run to query a database. In this case not the entire database needs
program may be run to query a database. In this case not the entire database needs to be loaded
into memory, just those data records that are being examined. Also, if the database query is a
search query then it is not necessary to loading virtual pages into memory as they are accessed is
known as demand paging.
When a process attempts to access a virtual address that is not currently in memory the CPU
cannot find a page table entry for the virtual page referenced, for example, in Figure 6.1, there is
no entry i Process X’s page table for virtual PFN 2 and so if Process X attempts to read form an
address within PFN 2 the CPU cannot translate the address into a physical one. At this point the
CPU cannot cope and needs the operating systems to fix things up. It notifies the operating
system that a page fault has occurred and the operating system makes the process wait whilst it
120
fixes things up. The CPU must bring the appropriate page into memory form the image on disk.
Disk access takes a long time, and so the processes that could run then the operating system will
select one of them to run. The fetched page is written into a free physical page frame and an entry
for the virtual PFN is added to the processes page table. The process continues to run. This is
known as demand paging and occurs when the system is busy bit also when an image is first
loaded into memory. This mechanism means that a process can execute an image that only
partially resides in physical memory at any one time.
The valid/invalid bit of the page table entry for a page, which is swapped in, set as valid.
Otherwise it is set as invalid, which will have no effect as long as the program never attempts to
access this page. If all and only those pages actually needed are swapped in, the process will
execute exactly as if all pages were brought in.
If the process tries to access a page, which was not swapped in, i.e., the valid/invalid bit of this
page table, entry is set to invalid, then a page fault trap will occur. Instead of showing the invalid
“invalid address error” as usually, it indicates the operating system’s failure to bring a valid part
of the program into memory at the right time in order to minimise swapping overhead.
In order to continue the execution of process, the operating system schedules a disk read
operation to bring the desired page into a newly allocated frame. After that, the corresponding
page table entry will be modified to indicate that the page is now in memory. Because the state
(program counter, registers etc.) of the interrupted process was saved when the page fault trap
occurred, the interrupted process can be restarted at the same place and state. As shown, it is
possible to execute programs even though parts of it are not (yet) in memory.
In the extreme case, a process without pages in memory could be executed. Page fault trap would
occur with the first instruction. After this page was brought into memory, the process would
continue to execute. In this way, page fault trap would occur further until every page that is
needed was in memory. This kind of paging is called pure demand paging. Pure demand paging
says that “never bring a page into memory until it is required.”
There are many approaches to the problem of deciding which page is to replace but the object is
the same for all-the policy that selects the page that will not be referenced again for the longest
time. A few page replacement policies are described below.
121
Belady’s Anomaly
Normally, as the number of page frames increase, the number of page faults should decrease.
However, for FIFO there are cases where this generalisation will fail! This is called Belady’s
Anomally. Notice that OPT’s never suffers from Belady’s anomaly.
Example:
1, 2, 3, 4, 5, 1, 2, 3, 4, 5
1 1 1 1 1 1
2 2 2 2 2
3 3 3 3
4 5 5
122
The Least Frequently Used (LFU) replacement policy selects a page for replacement if the page
had not been used often in the past. This policy keeps count of the number of times that a page is
accessed. Pages with the lowest counts are replaced while pages with higher counts remain in
primary memory.
ACTIVITY 6.2
Explain five page replacement policies of virtual memory.
6.5 Thrashing
Thrashing occurs when a system spends more time processing page faults than executing
transactions. While processing page fault it is necessary to be in order to appreciate the benefits
of virtual memory, trashing has a negative effect on the system.
As the page fault rate increase, more transactions need processing from the paging device. The
queue at the paging device increase, resulting in increased service time for a page fault. While the
transactions in the system are waiting for the paging device, CPU utilisation, system throughput
and system response time decrease, resulting in below optimal performance of a system.
Thrashing become a greater threat as the degree of multiprogramming of the system increases.
The graph in Figure 6.4 shows that there is a degree of multiprogramming that is optimal for
system performance. CPU utilisation reaches a maximum before a swift decline ad the degree of
multiprogramming increase and thrashing occurs in the over-extended system. This indicates that
controlling the load on the system is important to avoid thrashing. In the system represented by
the graph, it is important to maintain the multiprogramming degree that corresponds to the peak
of the graph.
123
The selection of a replacement policy to implement virtual memory plays an important [art in the
elimination of the potential or trashing. A policy based on the local mode will tend to limit the
effect of trashing. In local mode, a transaction will replace pages from assigned partition. Its need
to access memory will not affect transactions using other partitions. If other transactions have
enough page frames in the partitions they occupy, they will continue to be processed efficiently.
A replacement policy based on the global mode is more likely to cause thrashing. Since all pages
of memory are available to all transactions, a memory-intensive transaction may occupy a large
portion of memory, making other transactions susceptible to page faults and resulting in a system
that thrashes. To prevent thrashing we must provide a processes as many frames as it needs.
There are two techniques for this - Working-Set Model and Page-Fault Rate.
Principle of Locality
Pages are not accessed randomly. At each instant of execution a program tends to use only a
small set of pages. As he pages in the set change, the program is said to move from one phase to
another. The principle of locality states that most references will be to the current small set of
pages in use. The examples are:
(a) Instructions are fetched sequentially (except for branches) from the same page.
(b) Array processing usually proceeds sequentially though the array functions repeatedly,
access variables in the top stack frame.
Ramification
If we have locality, we are unlikely to continually suffer page-faults. If a page consists of 1000
instructions in self-contained loop, we will only fault once (at most) to fetch all 1000 instructions.
124
Note: Exact computation of the working set of each process is difficult, but it can be estimated,
by using the references bits maintained by the hardware to implement an aging for pages.
When loading a process for execution, pre-load certain pages. This prevents a process from
having to “fault into” its working set. May be only a rough guess at start-up, but can be quite
accurate on swap-in.
Programs generally divide up their memory usage by function. Some memory holds instruction,
some static data, some dynamically allocated data, some execution frames. All of these memory
types have different protection, growth, and sharing requirements. In the monolithic memory
allocation of classic
125
Segmentation addresses this by providing multiple sharable, protectable, growtable address space
that processes can access.
In pure segmentation architecture, segments are allocated like variable partitions, although the
memory management hardware is involved in decoding addresses. Pure segmentation addresses
replace the page identifier in the virtual address with a segment identifier, and find the proper
segment (not page) to which to apply or explicit access control lists to apply them.
Of course, the segment name space must be carefully managed, and thus OS must provide a
method of doing this. The file system can come to the rescue here - a process can ask for a file to
be mapped into a segment and have the OS return the segment register to use. This is known as
memory mapping files. It is slightly different from memory mapping devices, because one file
system abstraction (a segment) is providing an interface to another (a file). Memory mapped files
may be reflected into file system or not and may be shared or not at t he process’s discretion.
The biggest problem with segmentation is the same as with variable sized real memory
allocation: managing variable sized partitions can be very inefficient, especially when the
segments are large compared to physical memory. External fragmentation can easily result in
expensive when a large segment is loaded, and swapping large segments (even when compaction
is not required) can be costly.
Demand Segmentation
Same idea as demand paging applied to segments.
If a segment is loaded, base and limit are stored in the Segment Table Entry (STE) and the valid
bit is set in the Page Table Entry (PTE). The PTE is accessed for each memory reference. If the
segment is not loaded, the valid bit is unset. The base and limit as well as the disk address of the
segment is stored in the OS table. A reference to a non-loaded segment generates a segment fault
(analogous to page fault). The load a segment, we must solve both the placement question and the
replacement question (for demand paging, there is no placement question).
126
virtual address as specifying a segment, and the low-order bits as specifying an offset within the
segment. If we have tree - structured page tables in which we use k bits to select a child of the
roof, and we always start segments at some multiple of 2^(word size-k), then the top level of the
tree looks very much like a segment registers, than by some explicit architecturally visible
mechanism like segment registers. Basically all modern operating systems on page-based
machines use segmented paging.
In MULTICS, there was a separate page table for each segment. The segment offset was
interpreted as consisting of a page number and page offset. The base address in the segment table
entry is added to the segment offset, and looked up in the page table in the normal way. Note that
in a machine with pure segmentation, given a fast way to find base/bound pairs (e.g., segment
registers), there is no need for a TLB. One we go to paged segmentation lies in the user’s
programming model, and in the addressing modes of the CPU. On a segmented architecture, the
user generally specifies addresses using an effective address that includes a segment register
specification. On a paged architecture, there are no segment registers. In practical terms,
managing segment registers (loading them with appropriate values at appropriate times) is a bit of
a nuisance to the assembly language programmer or compiler writer.
On the other hand, since it only takes a few bits to indicate a segment register, while the base
address in the segment table entry can have many bits, segments provide a means of expanding
the virtual address space beyond 2(word size). We can’t do segmented paging on a machine with 16-
bit addresses. It’s beginning to get problematical on machine with 32-bit addresses. We certainly
can’t accessed with ordinary loads and store, in which every file is a segmented architectures
provide a way to get the effect we want (lots of logically separate segments that can grow without
practical bound) without requiring that we buy into very large addresses. As 64-bit architectures
become more common, it is possible that segmentation will advantage over segmented paging:
protection information is logically associated with a segment, and could perhaps be specified in
the segment table and then left out of the page table. Unfortunately, protection bits are used for
lots of purposes other than simply making sure we cannot write your code or execute your data.
ACTIVITY 6.3
1. What are the steps that are followed by the Operating System in order to handle the page
fault?
2. What is demand paging?
3. How can you implement the virtual memory?
127
4. When the following do occurs?
5. What should be the features of the page swap algorithm?
6. What is a working set and what happens when the working set is very different from the
set of pages that are physically resident in memory?
6.8 Summary
With previous schemes, all the code and data of a program have to be in main memory
when the program is running. With virtual memory, only some of the code and data have
to be in main memory, the size of a program (including its data) can thus exceed the
amount of available main memory.
There are two main approaches to virtual memory: paging and segmentation. Both
approaches rely on the separation of the concepts virtual address and physical address.
Addresses generated by programs are virtual addresses. The actual memory cells have
physical addresses. A piece of hardware called a memory management unit (MMU)
translated virtual addresses to physical addresses at run-time.
In this unit we have discussed the concept of virtual memory, its advantaged, demand
paging, demand segmentation. Page replacement algorithms and combined systems.
KEY TERMS
REFERENCES
Deital, H., M. (1984). An introduction to operating systems. Peterson: Addition Wesley
Publishing Company.
Dhamdhere, D., M. (2006). Operating systems: A concept-based approach. New Delhi: Tata
McGraw-Hill.
Gadbole, A., S. (2005). Operating systems (2nd ed.). New Delhi: Tata McGraw-Hill.
128
Milenkovic, M. (2000). Operating systems: Concept and design. New York: McGraw-Hill
International Education.
Silberschatz, A., & Galvin, P. B. (1997). Operating system concepts. (5th ed.). New Delhi: Wiley
and Sons (Asia) Publication.
Stalling, W. (2001). Operating system (4th ed.). New Jersey: Prentice Hall.
Tanenbaum, A., S (2003). Operating system design and implementation (2nd ed.). New Delhi:
Prentice Hall of India Pvt. Ltd.
Definition: A file is a collection of related information defined by its creator. Commonly, files
represent programs (both source and object forms) and data. Data files may be numeric,
alphabetic or alphanumeric. Files may be free-form, such as text files, or may be rigidly
formatted. In general, a file is a sequence of bits, bytes, lines or records whose meaning is defined
by its creator and user.
129
File management is one of the most visible services of an operating system. Computers can store
information in several different physical forms among which magnetic tape, disk, and drum are
the most common forms. Each of these devices has their own characteristics and physical
organisation.
Normally files are organised into directories to ease their use. When multiple users have access to
files, it may be desirable to control by whom an in what ways files may be accessed. The
operating system is responsible for the following activities in connection with file management:
(a) The creation and deletion of files;
(b) The creation and deletion of directory;
(c) The support of primitives for manipulating files and directories;
(d) The mapping of files onto disk storage; and
(e) Back up of files on stable (non volatile) storage.
The most significant problem in I/O system is the speed mismatch between I/O devices and the
memory and also with the processor. This is because I/O system involves both hardware and
software support and there is large variation in the nature of I/O devices, so they cannot compete
with the speed of the processor and memory.
A well-designed file management structure makes the file access quick and easily movable to a
new machine. Also it facilitates sharing of files and protection of non-public files. For security
and privacy, file system may also provide encryption and decryption capabilities. This makes
information accessible to the intended user only.
In this unit we will study the I/O and the file management techniques used by the operating
system in order to manage them efficiently.
130
The basic idea is to organise the I/O software as a series of layers with the lower ones hiding the
physical H/W and other complexities from the upper ones that present simple, regular interface
interaction with users. Based on this I/O software can be structured in the following four layers
given below with brief descriptions:
ACTIVITY 7.1
Descricbe four layers of I/O software structure.
131
In case if single-buffered transfer, blocks are first read into a buffer and then moved to the user’s
work area. When the move is complete, the next block is read into the buffer and processed in
parallel with the first block. This helps in minimizing speed mismatch between devices and the
processor. Also, this allows process computation in parallel with input/output as shown in Figure
7.2.
Double buffering is an improvement over this. A pair of buffers is used; blocks/records generated
by a running process are initially stored in the first buffer until it is full. Then from this buffer it is
transferred to the secondary storage. During this transfer the other blocks generated are deposited
in the second buffer and when this second buffer is also full and first buffer transfer is complete,
then transfer from the second buffer is initiated. This process of alteration between buffers
continues which allows I/O to occur in parallel with a process’s computation. This scheme
increases the complexity but yields improved performance as shown in Figure 7.3.
132
Figure 7.4: Hard Disk with three platters
The disk shown in the Figure 7.4 has three platters and 6 recording surfaces (two on each platter).
A separate read head is provided for each surface. Although the disks are made of continuous
magnetic material, there is a limit to the density of information which can be stored on the disk.
The heads are controlled by a stepper motor which moves them in fixed-distance intervals across
each surface. i.e., there is a fixed number of tracks on each surface. The tracks on all the surfaces
are aligned, and the sum of all the tracks at a fixed distance from the edge of the disk is called a
cylinder. To make the disk access quicker, tracks are usually divided up into sectors - or fixed
size regions which lie along tracks.
When writing to a disk, data are written in units of a whole number of sectors. (In this respect,
they are similar to pages or frames in physical memory). On some disks, the sizes of sectors are
decided by the manufacturers in hardware. On the other systems (often microcomputers) it might
be chosen in software when the disk is prepared for use (formatting). Because the heads of the
disk move together on all surfaces, we can increase read-write efficiency by allocating blocks in
parallel across all surfaces. Thus, if a file is stored in consecutive blocks, on a disk n surfaces and
n heads, it could read n sectors per-track without any head movement.
When a disk is supplied by a manufacturer, the physical properties of the disk (number of tracks,
number of heads, sectors per track, speed of revolution) are provided with the disk. An operating
system must be able to adjust to different types of disk. Clearly sectors per track is not a constant,
nor is the number of tracks. The numbers given are just a convention used to work out a
consistent set of addresses on a disk and may not have anything to do with the hard and fast
physical limits of the disk. To address any portion of a disk, we need a three component address
consisting of (surface, track and sector).
133
head motor movements, while others contain small decision making computers of their own. The
most popular type of drive for larger personal computers and workstations is the SCSI drive.
SCSI (pronounced skuzzy) (Small Computer System Interface) is a protocol and now exists in
four variants SCSI 1, SCSI 2 and SCSI 3. SCSI disks live on a data bus which is a fast parallel
data link to the CPU and memory, rather like a very short network. Each drive coupled to the bus
identifies itself by a SCSI address and each SCSI controller can address up to seven units. If more
disks are required, a second controller must be added. SCSI is more efficient at multiple accesses
sharing than other disk types for microcomputers. In order to talk to a SCSI disk, an operating
system must have a SCSI device driver. This is a layer of software which translates disks requests
from the operating system’s abstract command-layer into the language of signals which the SCSI
controller understands.
On more intelligent drives, like the SCSI drives, the disk itself keeps a defect list which contains
a list of all bad sectors. A new disk from the manufacturer contains a starting list and this is
updated as time goes by, if more defects occur. Formatting is a process by which the disk are:
(a) (If necessary) created by setting out Âsignpost’ along the tracks; and
(b) Labelled with an address, so that the disk controller knows when it has found the correct
sector.
On simple disks used by microcomputers, formatting is done manually. On other types, like SCSI
drives, there is low-level formatting already on the disk when it comes from the manufacturer.
This is part of the SCSI protocol, in a sense. High level formatting on top of this is not necessary,
since an advanced enough file system will be able to manage the hardware sectors.
Data consistency is checked by writing to disk and reading back the result. If there is
disagreement, an error occurs. This procedure can best be implemented inside the hardware of the
disk - modern disk drives are small computers in their own right. Another cheaper way of
checking data consistency is to calculate a number for each sector, based on what data are in the
sector and store it in the sector. When the data are read back, the number is recalculated and if
there is disagreement then an error is signaled. This is called a cyclic redundancy check (CRC) or
error correcting code. Some device controllers are intelligent enough to be able to detect bad
sectors and move data to a spare Âgood’ sector if there is an error. Disk design is still a subject of
considerable research and disks are improving both in speed and reliability by leaps and bounds.
ACTIVITY 7.2
1. Explain how a hard disk is composed of several physical disks stacked on top of each other
2. What are the purposes of formatting?
The seek time is the time required for the disk arm to move the head to the cylinder with the
desired sector. The rotational latency is the time required for the disk to rotate the desired sector
until it is under the read-write head. The disk bandwidth is the total number of bytes transferred
per unit time.
Both the access time and the bandwidth can be improved by efficient disk I/O requests
scheduling. Disk drivers are large single dimensional arrays of logical blocks to be transferred.
Because of large usage of disks, proper scheduling algorithms are required.
A scheduling policy should attempt to maximize throughput (defined as the number of requests
serviced per unit time) and also to minimize mean response time (i.e., average waiting time plus
service time). These scheduling algorithms are discussed below.
Also assume that the disk head is initially at cylinder 50 then it moves to 91, then to 150 and so
on. The total head movement in this scheme is 610 cylinders, which makes the system slow
because of wild swings. Proper scheduling while moving towards a particular direction could
decrease this. This will further improve performance. FSFC scheme is clearly depicted in Figure
7.5.
135
and hence improved the performance. Like SJF (Shortest Job First) for CPU scheduling SSTF
also suffers from starvation problem. This is because requests may arrive at any time. Suppose
we have the requests in disk queue for cylinders 18 and 150, and while servicing the 18-cylinder
request, some other request closest to it arrives and it will be serviced next. This can continue
further also making the request at 150-cylinder wait for long. Thus a continual stream of requests
near one another could arrive and keep the far away request waiting indefinitely. The scheduling
is shown in Figure 7.6.
As the arm acts like an elevator in a building, the SCAN algorithm is also known as elevator
algorithm sometimes. The limitation of this scheme is that few requests need to wait for a long
time because of reversal of head direction. This scheduling algorithm results in a total head
movement of only 200 cylinders. Figure 7.7 shows this scheme.
136
Figure 7.7: SCAN scheduling
Similar to SCAN algorithm, C-SCAN also moves head from one end to the other servicing all the
request in its way. The difference here is that after the head reaches the end it immediately
returns to beginning, skipping all the requests on the return trip. The servicing of the requests is
done only along one path. Thus comparatively this scheme gives uniform wait time because
cylinders are like circular lists that wrap around from the cylinder to the first one.
The performance and choice of all these scheduling algorithms depend heavily on the number
and type of requests and on the nature of disk usage. The file allocation methods like contiguous,
linked or indexed, also affect the requests. For example, a contiguously allocated file will
generate nearby requests and hence reduce head movements whereas linked or indexed files may
generate requests from blocks that are scattered throughout the disk and hence increase the head
movements. While searching for files the directories will be frequently accessed, hence location
of directories and also blocks of data in them are also important criteria. All these peculiarities
force the disk scheduling algorithms to be written as a separate module of the operating system,
so that these can easily be replaced. For heavily used disks the SCAN / LOOK algorithms are
well suited because they take care of the hardware and access requests in a reasonable order.
There is no real danger of starvation, especially in the C-SCAN case. The arrangement of data on
a disk plays an important role in deciding the efficiency of data-retrieval.
ACTIVITY 7.3
1. Explain five techniques of scheduling algorithms.
2. What is the difference between SCAN and LOOK scheduling?
137
7.6. RAID
Disks have high failure rates and hence there is the risk of loss of data and lots of downtime for
restoring and disk replacement. To improve disk usage many techniques have been implemented.
One such technology is RAID (Redundant Array of Inexpensive Disks). Its organisation is based
on disk striping (or interleaving), which uses a group of disks as one storage unit. Disk Striping is
a way of increasing the disk transfer rate up to a factor of N, by splitting files across N different
disks. Instead of saving all the data from a given file on one disk, it is split across many. Since the
N heads can now search independently, the speed of transfer is, in principle, increased manifold.
Logical disk data/blocks can be written on two or more separate physical disks which can further
transfer their sub-blocks in parallel. The total transfer rate system is directly proportional to the
number of disks. The larger the number of physical disks striped together, the larger the total
transfer rate of the system. Hence, the overall performance and disk accessing speed is also
enhanced. The enhanced version of this scheme is mirroring or shadowing. In this RAID
organisation a duplicate copy of each disk is kept. It is costly but a much faster and more reliable
approach. The disadvantage with disk striping is that, if one of the N disks becomes damaged,
then the data on all N disks is lost. Thus striping needs to be combined with a reliable form of
backup in order to be successful.
Another RAID scheme uses some disk space for holding parity blocks. Suppose, three or more
disks are used, then one of the disks will act as a parity block, which contains corresponding bit
positions in all blocks. In case some error occurs or the disk develops a problem all its data bits
can be reconstructed. This technique is known as disk striping with parity or block interleaved
parity, which increases speed. But writing or updating any data on a disk requires corresponding
recalculations and changes in parity block. To overcome this the parity blocks can be distributed
over all disks.
ACTIVITY 7.4
1. Indicate the major characteristicswhich differentiate I/O devices.
2. Explain the term device independence. What is the role of devices in this context?
138
3. Describe the ways in which a driver can be implemented?
4. Which is the best advantage of the double buffering scheme over single buffering?
5. What are the key objectives of the I/O system?
The operating system allows users to define named objects called files which can hold
interrelated data, programs or any other thing that the user wants to store/save.
139
(d) The mapping of files onto disk; and
(e) Backup of files on stable storage media (non-volatile).
The coming sub-sections cover these details as viewed by the operating system.
7.10.1 Directories
A file directory is a group of files organised together. An entry within a directory refers to the file
or another directory. Hence, a tree structure/hierarchy can be formed. The directories are used to
group files belonging to different applications/users. Large-scale time sharing systems and
distributed systems store thousands of files and bulk of data. For this type of environment a file
system must be organised properly. A File system can be broken into partitions or volumes. They
provide separate areas within one disk, each treated as separate storage devices in which files and
directories reside. Thus directories enable files to be separated on the basis of user and user
applications, thus simplifying system management issues like backups, recovery security,
integrity, name-collision problem (file name clashes), housekeeping of files etc.
The device directory records information like name, location, size and type for all the files on
partition. A root refers to the part of the disk from where the root directory begins, which points
to the user directories. The root directory is distinct from sub-directories in that it is in a fixed
position and of fixed size. So, the directory is like a symbol table that converts file names into
their corresponding directory entries. The operations performed on a directory or file system are:
The most common schemes for describing logical directory structure are:
(i) Single-level Directory
All the files are inside the same directory, which is simple and easy to understand; but the
limitation is that all files must have unique names. Also, even with a single user as the
number of files increases, it is difficult to remember and to track the names of all the files.
This hierarchy is depicted in Figure 7.8.
140
Figure 7.8: Single-level directory
(ii) Two-level Directory
We can overcome the limitations of the previous scheme by creating a separate directory
for each user, called User File Directory (UFD). Initially when the user logs in, the
system’s Master File Directory (MFD) is searched which is indexed with respect to
username/account and UFD reference for that user. Thus different users may have same file
names but within each UFD they should be unique. This resolves name-collision problem
up to some extent but his directory structure isolates one user from another, which is not
desired sometimes when users need to share or cooperate on some task. Figure 7.9 shows
this scheme clearly.
(iii)Tree-structured Directory
The two-level directory structure is like a 2-level tree. Thus to generalise, we can extend
the directory structure to a tree of arbitrary height. Thus the user can create his/her own
directory and subdirectories and can also organise files. One bit in each directory entry
defines entry as a file (0) or as a subdirectory (1).
The tree has a roof directory and every file in it has a unique path name (path from root,
through all subdirectories, to a specified file). The pathname prefixes file filenames, helps
to reach the required file traversed from a base directory. The pathnames can be of 2
types: absolute path names or relative path names. The pathnames can be of 2 types:
absolute path name begins at the root and follows a path to a particular file. It is a full
pathname and uses the root directory. Relative defines the path from the current directory.
For example, if we assume that the current directory is /Hello/Hello2/Test2/F4.doc and
the relative pathname is /Test2/Fa.doc. The pathname is used to simplify the searching of
a tree-structured directory hierarchy. Figure 7.10 shows the layout:
141
Figure7.10: Tree-structured directory
As the name suggests, this scheme has a graph with no cycles allowed in it. This scheme
added the concept of shared common subdirectory/file which exists in file system in two
(or more) places at once. Having two copies of a file does not reflect changes in one copy
corresponding to change made in the other copy.
The limitations of this approach are the difficult in traversing an entire file system
because of multiple absolute path names. Another issue is the presence of dangling
pointer to the files that are already deleted, tough we can overcome this by preserving the
file until all references to it are deleted. For this, every time a link or a copy of directory is
established to the file-reference list. But in reality as the list is too lengthy, only a count of
the number of references is kept. This count is then incremented or decremented when the
reference to the file is added or it is deleted respectively.
142
(v) General Graph Directory
General graph does not allow cycles in it. However, when cycles exist, the references
count may be non-zero, even when the directory or file not referenced anymore. In such
situation garbage collection is useful. This scheme requires the traversal of the whole file
system and marking accessible entries entries only. The second pass then collects
everything that is unmarked on a free-space list. This is depicted in Figure 7.12.
SELF-CHECK 7.4
ACTIVITY 7.5
Explain five schems for describing logical directory structure.
(a) Continuous
This also known as contiguous allocation as each file in this scheme occupies a set of
contiguous blocks on the disk. A linear ordering of disk addresses is seen on the disk. It is
used in VM/CMS - an old interactive system. The advantage of this approach is that
successive logical records are physically adjacent and require no head movement. So disk
seek time is minimal and speeds up access of records. also, this scheme is relatively simple
to implement. the technique in which the operating systems provides units of file space on
demand by user running processes, is known as dynamic allocation of disk space. Generally
space is allocation units of a fixed size, called an allocation unit or 'cluster' in MS-DOS.
Cluster is a simple multiple of the disk physical sector size, usually 512 bytes. Disk space
143
can be considered as a one-dimensional array of data stores each store being a cluster. A
large cluster size reduces the potential for fragmentation, but increase the likelihood that
cluster will have unused space. Using clusters larger than one sector reduces fragmentation,
and unused areas on the disk.
Contiguous allocation merely retains the disk address (start of file) and length (in block
units) of the first block. if a file is n blocks long and it begins with location b (blocks), then
it occupies b, b+1, b+2,..., b+n-1 blocks. First-fit and-fit strategies can be used to select a
free hole from available ones. But the major problem here is searching for sufficient space
for a new file. Figure 7.13 depicts this scheme.
144
relocation is not required. but the disadvantages here is that it is potentially inefficient for
direct-accessible files since block to the next. It can be used effectively for sequential
access only but there also it may generate long seeks between blocks. Another issue is the
extra storage space required for pointers. Yet the reliability problem is also there due to
loss/damage of any pointer. The use of doubly linked lists could be a solution to this
problem but it would more overheads for each files. A doubly linked list also facilities
searching as blocks are threaded both forward and backward. The figure 7.14 depicts
linked/chained allocation where each block contains the information about the next block
(i.e., pointer to next block).
MS-DOS and OS/2 use another variation on linked list called FAT (File Allocation
Table). The beginning of each partition contains a table having on entry for each disk lock
and is indexed bye the block number. The directory entry contains the block number of
the first block of file. The table entry indexed by block number contains the block number
of the next block in the file. The Table pointer of the last block in the file has EOP pointer
value. this chain continues until EOF (end of file_ table entry is encountered. We still
have to linearly traverse next pointers, but a t least we don't have to go to the disk for each
of them. O (Zero) table value indicates an unused block. So, allocation of free blocks with
FAT scheme is straightforward, just search for the first block with 0 table pointer. MS-
DOS and OS/2 use this scheme. This scheme is depicted in Figure 7.15.
145
Figure7.15: File-allocation table (FAT)
(ii) Indexed Allocation
In this each file has its own index block. each entry of the index points to the disk blocks
containing actual file data i.e., the index keeps an array of block addresses. The i th entry
in the index block points to the i th block of the file. Also, the main directory contains the
address where the index block is on the disk. Initially, all the pointers in index block are
set to NIL. The advantage of this scheme is that it supports both sequential and random
access. The searching may take place in index blocks themselves. The index blocks may
kept close together in secondary storage to minimise seek time. Also space is wasted only
on the index which is not very large and there's no external fragmentation. But a few
limitations of the previous scheme overflow scheme of the file larger than the predicted
value. insertions can require complete reconstruction of index blocks also. The indexed
allocation scheme is diagrammatically shown in Figure 7.16.
146
Figure 7.16: Indexed Allocation on the Disk
The MS-DOS file systems allocates storage in clusters, where a cluster is one or more contiguous
sectors. MS-DOS bases the cluster size on the partition. As a file is written on the disk, the file
system allocates the appropriate numbers of clusters to store the file's data. For the purpose of
isolating special areas of the disk, most operating systems allow the disk surface to be divided
into partitions. A partition (also called cylinder group) is just that: a group of cylinders, which lie
next to each other. By defining partitions we divide up the storage of data to special areas, for
convenience. Each partition is assigned a separate logical device and each device can only write
to the cylinders, which are defined as being its own. To access the disk the computer needs to
convert physical disk geometry (the number of cylinders on the disk, number of heads per
cylinder, and sectors per track) to a logical configuration that is compatible with the operating
147
system. This conversion is called translation. Since sector translation works between the disk
itself and the system BIOS or firmware, the operating system is unaware of the actual
characteristics of the disk, if the number of cylinders, heads, and sectors per track the computer
need is within the range supported by the disk. MS-DOS presents disk devices as logical volumes
that are associated with a drive code (A, B, C and so on) and have a volume name (optional), a
roof directory, and from zero to many additional direction and files.
a) Online-services
Most operating systems provide interactive facilities to enable the on-line users to work with
files. Few of these facilities are built-in commands of the system while others are provided by
separate utility programs. But basic operating systems like MS-DOS with limited security
provisions can be potentially risky because of these user owned powers. So, these must be
used by technical support staff pr experienced users only. For example, DEL *.* Command
can erase all the files in the current directory. Also, FORMAT c: can erase the entire contents
of the mentioned drive/disk. Many such services provided by the operating system related to
directory operations are listed below:
b) Programming services
The complexity of the file services offered by the operating system vary from one operating
system to another but basic set of operations like: open (make the file ready for processing),
close (make a file unviable for processing), read (input data from the file), write (output data
to the file), seek (selection a position in file for data transfer).
All these operations are used in the form of language syntax procedures or built-in library
routines, of high-level language used like C, Pascal, and Basic etc. More complicated file
operations supported by the operating system provided wider range of facilities/services.
These include facilities like reading and writing records, locating a record with respect to a
primary key value etc.
148
In addition to file functions described above the operating system must provide directory
operation support also like:
These are not always implemented in a high level language can be supplied with these procedure
libraries. For example, UNIX uses C language as system programming language, so that all
system calls requests are implemented as C procedures.
Each I/O is handled by using a device-status table. This table hold entry for each I/O device
indicating device's type, address, and its current status like bust or idle. When any I/O device
needs service, it interrupts the operating system. After determining the device, the Operating
System checks its status and modifies table entry reflecting the interrupt occurrence. Control is
then retuned back to the user process.
ACTIVITY 7.5
149
4. List few file attributes?
5. Why is SCAN scheduling also called Elevator Algorithm?
6. In an MS-DOS disk system, calculate the number of entries (i.e., No. of cluster) required
inthe FAT table. Assume te following parameters:
Disk Capacity - 40 Mbytes
Block Size - 512 Bytes
Blocks/Cluster - 4
7. Assuming a cluster size of 512 bytes calculate the percentage wastage in file space due to
incomplete filling offast cluster, for the files sizes below:
7.11. Summary
• This unit briefly describes the aspects of I/O and File Management. Wednesday by
looking at I/O controllers, organisation and I/O buffering is effecting in smoothing out the
speed mismatch between I/O rates and processor speed. We also looked at the four level
of I/O software: the interrupt handlers, device drivers, the device independent I/O
software, and the I/O libraries and user-level software.
• A well-designed file system should provide a sure-friendly interface. The file system
generally includes access methods, file management issues like file integrity, storage,
sharing, security and protection etc. We have discussed the services provided by the
operating system to the user to enable fast access and processing of files.
• The important concept related to the file system are also introduced in this unit like file
concepts, attributes, directories, tree structures, root directory, pathnames, file services
etc. Also a number of techniques applied to improve disk system performance have been
discussed and in summary these are: disk caching, disk scheduling algorithms (FIFO,
SSTF, SCAN CSCAN, LOOK etc.), types of disk space management (contiguous and
non-contiguous-linked and indexed), disk address translation, RAID based on interleaving
concept etc. Auxiliary storage management is also considered as it is mainly concerned
with allocation of space for files.
150
KEY TERMS
REFRENCES
151
Unit 8 Introduction to Networking
8.0. Introduction
Computer networks are essential components of modern computer systems. Though the
architecture of networks and the protocols used for communication are not directly related to
operating systems, we cover them for two reasons. The first is that distributed file systems, which
are important components of operating systems, depend on them. Much of their design is
governed by what standard network protocols do and do not provide. The second is that network
protocols are usually implemented in the operating system. Doing this well presents a number of
challenges to the operating-system designer. We first look at network protocols, concentrating on
the Internet’s TCP/IP. We then look at remote procedure calls, a notion introduced in Chapter 4.
Together these give us the foundations for the next chapter, on distributed file systems.
What exactly is a computer network? For our purposes, it’s a way to interconnect computers so
they can communicate data with one another. Examples range from a home network
interconnecting two or three computers to the Internet, interconnecting most of the computers on
earth (and perhaps beyond).
To better appreciate how networks work, let’s look at their components. We say that two
computers are directly connected if they can send each other data without involving any other
parties. They might be connected by cable or by radio. And a number of computers might be
directly connected, since they are connected to a broadcast medium: anything sent by one can be
received by all. We call such directly connected networks base networks. Base networks can be
combined to form larger composite networks. A computer that is attached to two separate base
networks can forward data from one to the other. Thus a computer on the first network can send
data to one on the second by sending it to the computer in the middle, which will forward it to the
intended recipient. We can extend this forwarding technique to allow communication among
computers on any number of base networks, as long as there is a forwarding path from one
network to another.
152
In general, a message going from one computer to another goes through a sequence of
intermediate base networks, each with a computer in common with the networks immediately
before and after it. Two approaches are commonly used to arrange this sort of communication.
One, known as circuit switching, is to set up this network sequence before communicating,
forming a path known as a virtual circuit. This is much like placing a telephone call. First you
dial the other party’s number; once connected, you can talk or send as much data over the
connection as you like. In ages past such a telephone connection really was made by an electric
circuit between the two parties. Today’s virtual circuit is implemented by arranging for each
computer in the path to forward messages on to the next one. The computers in such a setup are
called switches and are often specialized for this chore.
A real circuit has a definite bandwidth that’s constantly available to the two parties
communicating over it. A virtual circuit makes an approximation of this constant bandwidth by
reserving capacity on each of the switches and on the “wires” that connect them. Thus messages
sent over a virtual circuit have a high probability of reaching their destination. Furthermore, since
they all take the same path, they arrive at the destination in the order they were sent. This
reserved capacity is both a feature and a problem: a feature because you can be sure that the
capacity is always available; a problem because you must pay for it even when you’re not using
it.
The other approach is known as packet switching: data is divided into pieces called packets that
are sent independently across the composite network. Each packet is tagged with the address of
its destination. The computers it passes through on its way to the destination are known as routers
(and, like switches, are often dedicated to this purpose). With packet switching, no path is set up
through the routers ahead of time. Instead, each router forwards packets on the basis of its latest
information on the best route to the destination. Since this information can change over time,
consecutive packets might take different routes to the destination and may well arrive out of
order. In general there is no reservation of capacity in the routers (particularly since it’s not
necessarily known ahead of time which routers will be used). The collective routers of the
network make a “best effort” to get each packet to its destination. Due to overloaded routers and
other problems, some packets may not make it to their destinations. So, unlike circuit switching,
with packet switching there is no guaranteed capacity but, on the other hand, you aren’t paying
for capacity you aren’t using.
The Internet is essentially a large version of the networks described above. In general, packet
switching is used for communicating data over the Internet, though some portions, typically
longhaul networks provided by phone companies, use circuit switching.2 Thus from the point of
view of messages being sent over it, the Internet is simply a large collection of routers. The
circuit-switched components are usually made to appear to the packet-switching world as point-
to-point links.
In this book we ignore the details of computing routes. However, this is where the Internet can no
longer be thought of as simply a large network. It consists of a number of networks, each with
separate administration. These autonomous systems each handle its own internal routing, using a
variety of routing protocols. Routing among autonomous systems is currently handled by a rather
complicated protocol known as BGP (border gateway protocol).
153
8.2.1 Network Protocols
We’ve defined networks as collections of interconnected base networks. But we need to look at
many more details to see how we communicate over them. In particular, we need to address the
following concerns:
The base networks are not homogeneous; different sorts have different characteristics such as the
packet sizes that can be transmitted, the size of network addresses, bandwidth, and so forth.
Routes through the network must be computed and utilized.
Data passing through the network can be lost or reordered. Too much traffic can overwhelm the
routers and switches. These concerns and others were addressed by a committee operating under
the auspices of the
International Organization for Standardization (known as ISO4) in the 1970s by defining a
network model consisting of seven layers, known as the Open Systems Interconnect (OSI) model
(Figure 8.1). Each layer is built on top of the next lower layer and provides certain types of
functionality. Protocols can then be designed and implemented to provide the functionality of a
particular layer. Here’s a brief description of the OSI model’s layers.
1. The physical layer corresponds to the “wire.” Concerns here have to do with
electromagnetic waves and the medium through which they are propagating.
2. The data link layer provides the means for putting data on the wire (and for taking it off).
An example is the Ethernet. Concerns here include how to represent bits as
electromagnetic waves. Data is represented as sequences of bits known as frames. If, as in
the Ethernet, the physical layer can be shared with potentially more than one other
computer, some means for sharing must be provided; doing this properly is known as
medium access control (MAC). The MAC address is used to indicate who should receive
a frame. Important parameters include the form of the MAC address and the maximum
and minimum frame sizes.
3. The network layer sees to it that the data travels to the intended destination (perhaps via a
number of intermediate points). It deals with data in units known as packets. Some notion
of a network address is needed here to identify other computers.
4. The transport layer is responsible for making sure that communication is reliable, i.e.,
that what’s sent is received unchanged.
154
FIGURE 8.1 ISO’s open systems interconnect (OSI) model.
5. The session layer builds on the reliable connection provided by the transport layer.
Among the services provided here can be dialog control, which indicates whose turn it is
to transmit, and synchronization, which tracks progress for error recovery. For example, if
the transport connection fails, a new one can be established under the same session as the
original.
6. The presentation layer deals with the representation of data. It copes with the different
ways machines represent basic data items (such as integers and floating-point numbers)
and provides a means for communicating more complicated data items, such as arrays and
structure.
7. The application layer is not necessarily where the application resides, but rather where
high-level software used by the application for network access resides. For example, the
HTTP protocol used for web browsing can be considered to reside here.
The bottom three layers (layers 1–3) are sometimes called the communications subnet. Data that
must pass through a number of machines on its way to the destination is forwarded by an
implementation of protocols in these lower layers on each intermediate machine. The distinctions
among the top three layers are in general pretty much ignored. Many applications use remote-
procedure-call and similar protocols that are built on top of the transport layer and incorporate all
the functionality of layers 5 through 7.
The OSI model is useful in helping us understand a number of networking issues, but as a model
it’s not strictly followed in practice, where “practice” means the Internet. The Internet’s model is
considerably simpler and more specific: while the OSI model was intended as the basis of any
number of network protocols, the Internet model was designed as a model for the Internet, period.
155
The OSI model can be considered an a priori model in the sense that it came first, with the idea
that protocols were to follow. With the Internet model, the reverse happened: first there were
protocols, then a model to describe them. This is an a posteriori model: the model came after the
protocols. The protocols used on the Internet are known as the Internet protocols (also called
TCP/ IP). They don’t fi t precisely into the OSI model (for example, there is no analog of the
session and presentation layers), but the rough correspondence is shown in Figure 8.2.
There was much speculation in the 1980s that protocols designed to fit in the OSI model not only
would be competitors of the Internet protocols, but would replace them. Today one hears very
little of the OSI protocols (though the OSI seven-layer terminology is much used); whatever
competition there was between the OSI protocols and the Internet protocols was definitely won
by the latter.
First, we say a few words about what’s called a protocol data unit (PDU). This is the information
sent as one unit by a protocol at a particular level. The PDU of IP, sitting at the network layer, is
known as a packet. In general, PDUs contain control information as well as data. This control
information is usually segregated from the data and placed at the beginning, where it’s called a
header; however, sometimes some or all of it may be at the end of the PDU, where it’s called a
trailer. What’s important is that the data portion of a PDU is the PDU of the next-higher-layer
156
protocol. In particular, the data portion of an IP packet is the PDU of the transport layer (where
the PDU is called a segment). Similarly, the network-layer packet is the data portion of the
datalink-layer PDU (a frame).
IP forms packets from the segments given it by the transport protocol. Ordinarily IP simply takes
the segment as is and puts a header in the front (see Figure 8.3, where the transport layer’s
segment is called data). However, if the resulting packet would be too large for the data-link layer
to handle (for example, Ethernet’s maximum transfer unit (MTU) is 1500 bytes, meaning that its
frames cannot be larger than that), IP breaks the segment into some number of fragments, so that
each, when combined with IP and Ethernet headers, is no larger than the MTU, and transmits
each as a separate packet. When forwarding packets on a router, IP ordinarily simply takes them
and forwards them on to their destination. However, if the packet is too large for the outgoing
data-link layer, it again breaks the packet data into appropriately sized fragments and forwards
each separately. It’s the responsibility of the ultimate destination’s IP to reassemble the fragments
into the segment expected by the transport protocol. The fragment offset field of the IP header
(Figure 8.3) indicates the byte offset relative to the beginning of the original segment of the data
portion of a fragmented IP packet. The identification field identifies the original segment of
which this is a fragment.
IP’s primary job is forwarding: getting packets to their destination. Each packet contains the
source and destination addresses. Based on the destination address, IP either keeps the packet,
sending it up to the higher-level protocol, or forwards it on to the next hop in its route.
Determining the route is the tough part. Routing information is not maintained by IP, but supplied
either by a separate routing protocol or by an administrator. In either case, routes are stored in a
table for IP’s use. If the destination is another host on a directly connected network, IP forwards
it directly to that host. Otherwise it checks the routing table for either an entry containing a route
to the given address or, if that’s not present, an entry giving a route to the base network of the
destination. If neither is present, then there should be a default entry giving a route to a router
with more information. Ultimately this leads to one of a set of routers (originally called core
routers) without default entries in their tables, but with routes to all known base networks.
157
FIGURE 8.3 IP packet showing header.
Internet addresses are structured 32-bit values usually written in dot notation, in which each byte,
from most significant to least, is written in decimal and is separated from the next by a dot. For
example, 0x8094400a is written as 128.148.64.10. The original idea was that these addresses
were split into three fields: a class identifier, a network number (identifying a base network), and
a host number (Figure 8.4). Three classes of addresses, known as A, B, and C, were defined, each
providing a different split between the bits identifying the network and the bits identifying a host
on that network. Each network uses just one class of addresses — networks are referred to as
class-A, class-B, or class-C. A fourth class of addresses, class D, was also defined, as explained
below.
Class-A networks have 7 bits to identify the network and 24 to identify the host on that network.
Thus there could be up to 27–2 class-A networks (the all-zeros and all-ones addresses are
special), each with 224–2 hosts. Class-B networks have 14 bits to identify the network and 16 bits
to identify the host. Class-C networks have 21 bits for the network and 8 bits for the host. Thus
there could be 214–2 class-B networks, each with 216–2 hosts, and 221–2 class-C networks, each
with 28–2 hosts. Class-D addresses are used for multicast: one-to-many communication.
Not counting the multicast addresses, this scheme allows a total of 2,113,658 networks and
3,189,604,356 hosts.
There are two problems with these address classes. The first is that the numbers are too big. The
other is that they’re too small. They’re too big because they require huge routing tables and too
small because there aren’t enough usable addresses. Since the core routers must have routes for
all networks, their routing tables potentially have 2,113,658 entries. What’s more, the routing
protocols require that the core routers periodically exchange their tables. The memory and
communication costs for this were prohibitive in the 1980s (though they are not that bad now).
Making the problem of not enough network addresses even worse was that a fair portion of the
class-A and class-B addresses were wasted. This was because few if any class-A networks had
anything close to 224–2 hosts on them, or could even contemplate having that many. Even with
class-B networks, 216–2 hosts were far more than was reasonable.
158
FIGURE 8.4 Class-based Internet addresses.
To cope with these issues, the definition of “network” was changed. The network portion of an
address was reinterpreted as identifying not a base network, but a collection of networks, perhaps
all those of one institution. So, instead of holding routes to base networks, the core routers hold
routes to these aggregated networks. Depending on how you look at it, this either allows smaller
routing tables, since each entry is a route to a number of base networks, or allows more base
networks.
What didn’t happen was an increase in the size of a network address. Thus the bits needed to
distinguish the base network from the others within an aggregate network had to be taken from
the bits previously dedicated to identifying a host. What had been the host number was split in
two and now identified both as a base network (known as a subnet) and a host. Which bits are in
each portion are indicated by a subnet mask, which needs to be known only by the hosts on the
subnet. The beauty of this technique is that the core routers (and any other part of the Internet)
don’t need to know about the subnet mask of an aggregate network. Only when packets reach the
aggregate does the subnet become relevant.
Despite the improvements from subnetting, however, full routing tables were still too large and
too many network addresses were still being wasted. It became clear that the class-based network
addresses weren’t such a good idea. So, at least in assigning new addresses, the class-based
approach was eliminated and a new approach was introduced, classless Internet domain routing
(CIDR, pronounced “cider”), in which any number of subnets could be combined into a single
aggregate network. Thus, rather than a class-dependent fixed boundary between the network and
subnet/host portions of the address, the boundary could be anyplace. This was accomplished by
giving routers extra information: a small integer indicating how many of the most significant bits
are the network number.
159
So, for example, one might assign a company 32 consecutive class-C networks: 198.17.160.000
through 198.17.191.255. Instead of having 32 routing-table entries, these would be aggregated
and represented as the single entry 198.17.160.000/19, meaning that the 19 most significant bits
represent the (aggregate) network number. A subnet mask would still show which of the
remaining 13 bits indicate the subnet and which indicate the host, but, as before, this subnet mask
needs to be known only locally.
IP version 6 promises to solve some of IP’s addressing problems (while perhaps adding others)
by using 128-bit addresses. If all are usable, there are enough for over 6x1022 unique addresses
per square foot of the earth’s surface, or over a billion unique addresses per cubic mile of a
sphere with radius equal to the mean distance from the Sun to Pluto. It probably won’t be
necessary for all addresses to be usable.
Despite the uncertainties of Internet communication, we’d like to make certain that data is
communicated acceptably well. What this means depends on the application. For our concerns
here — primarily distributed file systems — “acceptably well” means that data is communicated
reliably: nothing is lost, nothing is garbled, the receiver receives exactly what the sender sent. We
might also, of course, want communication to be extremely fast, but this is of secondary
importance. Alternatively, if the data being communicated were real-time video, absolute
reliability is of secondary importance to minimal delay — it is OK to lose packets here and there
as long as most of them are arriving at the receiver soon after they were sent. Attaining reliable
communication is pretty straightforward. The sender sends data to the receiver. The receiver
sends back an acknowledgment that it’s received the data. If the sender doesn’t get an
acknowledgment in a reasonable amount of time, it resends the data, and does so repeatedly until
it receives the acknowledgment. To make this work, we need a way to keep track of what’s been
sent and what’s been received. Note that, in general, communication is in both directions, so each
party is both a sender and a receiver.
The two primary Internet transport protocols are UDP (user datagram protocol) and TCP
(transmission control protocol). UDP provides no reliability other than a checksum on the data.
It’s useful for applications in which reliability is not important and for those that provide their
own implementation of reliability. TCP, on the other hand, does provide reliable communication.
Furthermore, it copes with network congestion problems. Both TCP and UDP augment the
network addresses in their headers with 16-bit port numbers indicating which entity on the
sending machine sent the segment and which entity on the receiving machine is to receive it. It
would be convenient to have some sort of permanent assignment of port numbers to applications,
but 16 bits isn’t a whole lot, so, except for the first 1024, these port numbers are assigned
dynamically. TCP keeps track of what has been sent as well as what has been successfully
received by using sequence numbers. Each byte5 of data, as well as instances of certain control
information, is numbered with consecutive 32-bit sequence numbers. Data is grouped into
segments, each containing the sequence number of the first byte of data (or an instance of control
information) and the number of data bytes and control information instances (though the latter is
always 1). The receiver responds with a segment containing an acknowledgment sequence
number that indicates that all bytes and control information with sequence numbers less than it
have been successfully received.
160
Since there are only 232 sequence numbers, however, what happens when we use them all up? It
takes a little less than two and a half hours at 4 million bits/second (the speed my ISP promises
me over broadband cable) to transmit 232 bytes. Clearly we must allow sequence numbers to wrap
around: the sequence number immediately after 232–1 must be 0. But this presents a problem if
different segments take different routes over the Internet. For example, we might transmit a
segment whose data starts with sequence number 1000. For some reason this segment gets lost.
The sender, after not receiving an acknowledgement in due time, retransmits the segment. It then
sends just under 232 more bytes of data and, around two and half hours later, it again sends a
segment whose data starts with sequence number 1000. At around this time the original segment
finds its way to the destination. How is the receiver to distinguish this late-arriving segment from
the most recently transmitted one?
It can’t. The only way to deal with this problem is to make sure it cannot happen. Built into the
Internet is what’s called the maximum segment lifetime (MSL): the maximum time a segment can
exist on the Internet before being discarded. MSL is set (pretty much by fiat) at two minutes. It’s
not strictly enforced — there really isn’t a mechanism for doing so6 — but in practice it’s rarely
exceeded. Thus, at a communication bandwidth of four megabits/second, this “duplicate
sequence number” problem won’t happen. However, at 100 megabits/second (“Fast Ethernet”
speed), 232 bytes are sent in less than six minutes, and at a gigabit/second, 232 bytes are sent in
less than 35 seconds. To cope with these speeds, the protocol can be augmented by effectively
increasing the number of sequence-number bits through time stamps. For our purposes, we
assume bandwidths are small enough that sequence-number wraparound can’t happen in less than
some reasonable multiple of the maximum segment lifetime. However, there is still another
problem, known as session reincarnation. Which sequence number is used to start a connection?
Making it always zero is another source of duplicate sequence numbers. Suppose we start
transmitting segments and then for some reason quit while some of the segments are still in
transit. Then we start again, “reincarnating” the session.
If we start with the same sequence number as we started with previously, the late-arriving
segments from the previous session may be mistaken for segments belonging to the current
session. So, when starting a new session, we need to choose an initial sequence number that we
know is not in a previously transmitted segment that’s still out on the Internet. It might seem
reasonable to keep track of the last sequence number used and use one greater than that for the
next connection to a particular destination. But this would mean keeping track of sequence
numbers of all recently terminated connections. This probably wouldn’t be all that unreasonable
today, but was considered so in 1981. It also wouldn’t completely solve the problem, as
explained below.
What was done instead was to guess the maximum speed at which sequence numbers are being
consumed and then assign initial sequence numbers assuming that any previous connection ran at
that speed. The speed chosen was 250,000 bytes/second. So, for example, a connection starting at
time 0 was given an initial sequence number of 0. If that connection terminated for some reason
and a new connection was made 1000 seconds after the first one started, since the old connection
must have transmitted less than 250,000,000 bytes, the new one was safely given an initial
sequence number of 250,000,000. This approach is known as using an initial sequence-number
(ISN) generator. Of course, if the actual communication speed was greater than 250,000
161
bytes/second, or if the ISN generator wrapped around completely (it had a period of 4.55 hours)
while a connection was still active, there still might be a duplicate sequence-number problem.
However, in 1981, communication that fast and that long-lasting didn’t happen, so this approach
worked.
There were other problems, though. If a system crashed and then restarted, the ISN generator
would be restarted as well. One might think that the generator could be based on a time-of-day
clock that survives crashes, but this has problems as well. So, the suggestion was to wait an MSL
after rebooting before starting up TCP. In practice, machines took far longer than an MSL to
reboot, so such a delay was not needed. A more serious problem has to do with sequence-
number-guessing attacks — see RFC 1948. Let’s look at how TCP actually works. Figure 8.5
shows a typical TCP segment, including its header. There’s no need for IP addresses, since these
are in the IP header; but the header does contain the sending and receiving port numbers. The
flags field contains various control bits. Two of these bits — the SYN and FIN flags — appear in
the sequence number space and thus are sent reliably if set. If the ACK flag is set, then the
acknowledgment sequence number field contains an acknowledgment as described above: all data
bytes and control bits with sequence numbers less than this have been received. The RST (reset)
bit, used to indicate something has gone wrong, generally means that the connection is being
unilaterally terminated. The PSH (push) bit indicates that the segment should be passed to the
application without delay. The URG (urgent data) bit indicates that the urgent pointer field is
meaningful. If the URG bit is set, then the urgent-pointer field contains the sequence number of
the end of “urgent data” that should be delivered to the application even before data with earlier
sequence numbers. (The urgent data begins at the beginning of the first segment with the URG
flag set.)
The checksum field is a checksum not only on the TCP header, but on the TCP data and on the
address portions of the IP header as well. If a receiver determines the checksum is bad, it discards
the packet and relies on the sender to retransmit it. The options field contains variable-length
optional information that we won’t discuss. Finally, the window-size field indicates how much
buffer space the sender has to receive additional data. We discuss this in more detail below.
162
TCP’s actions are controlled by the rather elaborate state machine shown in Figure 8.6. Here the
edges between states are labelled with the event causing a state transition as well as the action
performed when changing states. An entity using TCP starts with its connection in the closed
state. If it’s a server, it performs what’s called a passive open and goes to the listen state, where
it’s ready to receive connections from clients. It’s the client that actually initiates a connection. It
starts a “three-way handshake” in which both parties come up with a suitable initial sequence
number and reliably communicate it to the other. The client performs an active open, in which it
sends a special synchronize segment to the server and goes to the syn-sent state. This synchronize
segment has the SYN bit set in the header’s flags and contains the client’s initial sequence
number — thus the SYN bit itself has this number. When the synchronize segment reaches the
server, it responds with its own synchronize segment and goes to the syn-received state. Its
synchronize segment has the SYN and ACK bits set in the flags, contains the server’s initial
sequence number, and acknowledges the client’s initial sequence number (that is, its
acknowledgment-sequence-number field contains a value that’s one greater than the client’s
initial sequence number). When this reaches the client, the client goes to the established state and
responds by sending back a segment that acknowledges receipt by having the ACK bit set in the
flags and one greater that the server’s initial sequence number in the acknowledgment-sequence-
number field. Finally, on receipt of the acknowledgment segment, the server goes to the
established state.
Once in the established state, both parties may send segments containing data. We discuss this
below, but first let’s continue with the state machine by looking at how a connection is
terminated. One party in a TCP connection shouldn’t terminate a connection unilaterally, because
163
it doesn’t know if the other party has more to send. Thus safe termination is done by a very polite
sequence of messages that go essentially like this: “I have no more data to send to you, but I will
happily receive more data from you.” “Thank you for saying that, I do indeed have more data for
you.” “I now have no more data for you, please let me know when you’ve received everything
I’ve sent.” “I have received everything you’ve sent; it’s been nice communicating with you.”
In terms of the state machine, here is what happens: One party (it could be either client or server)
receives a close request from its application, meaning that it has no more data to send.
This party goes to the fin-wait-1 state and sends a finish segment to the other, which is a segment
in which the FIN bit is set in the header flags. This bit is sent reliably and hence is given the next
sequence number. On receipt of the segment, the other party goes to the close-wait state and
sends back an acknowledgment. On receipt of the acknowledgment, the first part enters the fin-
wait-2 state. At this point communication is one-way — from the second party to the first — and
can continue indefinitely. At some point the second party receives a close request from its
application. It sends a finish segment to the first party and goes to the last-ack state. The first
party, on receipt of the finish segment, sends an acknowledgment back to the second party. At
this point we have a problem.
Suppose this acknowledgement is lost — that is, suppose it never makes it to the second party.
After a while, the second party will time out (how long this takes to happen we discuss below)
and, assuming that its finish segment never made it to the first party will retransmit the segment.
If the first party, after sending the acknowledgment, disappeared (i.e., closed its connection and
removed all trace of it), the retransmitted finish segment will be rejected at the first party’s site,
since there is no trace of the connection it was terminating. The first party’s site will respond with
a reset segment, i.e., one with the RST bit set, which tells the second party that there’s a problem.
This is reflected to the application as an error, giving it reason to believe that the first party may
have crashed before receiving all the data that was sent.
So we see that the first party shouldn’t go away — it should keep its side of the connection active
long enough for it to receive a retransmitted acknowledgment. How long should it wait? One
approach might be for the second party to acknowledge receiving the acknowledgment, but then
that would have to be acknowledged as well, and so forth, resulting in an endless sequence of
acknowledgments. Instead, let’s rely on the maximum segment lifetime (MSL) again. Now the
first party waits long enough for the second party to time out waiting for the acknowledgement
and retransmit the finish segment. The mandated waiting time is twice the MSL — four minutes.
It’s not perfect — there is no perfect solution — but it works well in practice. However, since
connections are identified by the addresses and port numbers of their endpoints, this solution does
mean that a new connection from the first party to the second can’t be created until the first party
goes into the closed state, four minutes after it receives the finish segment from the second party.
We now turn our attention to reliability. Some segments might not make it to the destination.
Others might be delayed, and thus be received out of order. Thus the sender must pay attention to
acknowledgments and retransmit segments that appear to be lost. And the receiver must hold on
to early-arriving segments, delaying their delivery to the application until they can be delivered in
the correct order. We discuss below how long the sender should wait. The receiver must reserve
buffer space to hold incoming segments, both those arriving in order and those arriving out of
order, until they are consumed by the application.
164
FIGURE 8.7 TCP receive window.
Since the receiver can reserve only a limited amount of buffer space and it would waste network
bandwidth for the sender to transmit segments that the receiver has no room for, we need a means
for the receiver to tell the sender how much buffer space it has. Such a means is known as flow
control.
Through the window-size fi eld of the header, each side of a TCP connection tells the other how
much buffer space it has. For example, one side might indicate that it has 10,000 bytes reserved.
If the other side has sent 8000 bytes of data that have yet to be acknowledged, then it knows that
really only 2000 bytes are available, so it must not send any more than that. Let’s look at things
from the receiver’s point of view by describing its receive window (see Figure 8.7, where the
shaded regions represent received data that have not yet been consumed by the application). Let’s
say the receiver has 10,000 bytes of buffer space for incoming data.
It’s already received and acknowledged 15,000 data bytes with sequence numbers from 20,001
through 35,000. Moreover, the application has consumed the first 12,000 bytes of these. Thus
3000 bytes of data in its buffer have been acknowledged but not consumed — it must hold on to
this data until the application consumes it. It’s also received data with sequence numbers 35,001
through 36,000 and 37,001 though 38,000 that it hasn’t yet acknowledged (and, of course, it must
not acknowledge data in the latter range until the data in the range 36,001 through 37,000
arrives). Thus there are an additional 2000 bytes of data in its buffer, leaving room for 5000 more
bytes. If it receives new data with sequence numbers in the range 20,001 through 36,000 or in the
range 37,001 through 38,000, it may assume these are duplicates and discard them. However,
new data in the ranges 36,001 through 37,000 and 38,001 through 42,000 are not duplicates and
must be stored in the buffer. If it receives anything whose sequence number is greater than
42,000, it must discard it because it doesn’t have room for it.
Now for the sender’s point of view — we look at its send window (see Figure 8.8). It has sent out
and received acknowledgments for bytes with sequence numbers from 20,001 through 34,000.
The sender has also sent out but hasn’t received acknowledgments for data bytes with sequence
165
numbers from 34,001 through 39,000. It must retain a copy of these data bytes just in case it has
to resend them. The most recent segment the sender has received from the other side indicates the
receive window is 8000 bytes. However, since it knows 5000 bytes’ worth of data have been sent
but not acknowledged, it knows that it really can send no more than 3,000 additional bytes of data
— up through sequence number 42,000.
If the sender doesn’t receive an acknowledgment for a range of data bytes, it must resend them.
How long should it wait before deciding to resend? This retransmission timeout period (RTO)
must be chosen with care. If it’s too short, data is retransmitted needlessly and thus makes for
unnecessary network congestion. But if the RTO is too long, then there are needless delays in
getting the data to the receiving application.
The approach suggested in the TCP specification RFC 793 is to keep track of the average time
between sending a segment and getting its acknowledgment — this is known as the smoothed
roundtrip time (SRTT). If an acknowledgement hasn’t come after a period significantly longer
than SRTT, then the corresponding segment is retransmitted. What’s “significantly longer”? The
specification suggests an initial period of four times the average deviation of the roundtrip time.
How long should the sender wait for an acknowledgment after retransmitting a segment? If
there’s still no acknowledgment even after waiting the RTO period after the retransmission, it
could well be that some problem in the communication path has made the original RTO too short.
So an exponential backoff approach is used: the sender doubles the RTO after sending the first
retransmission. If it times out again, it doubles the RTO again and continues to double as needed
up to some maximum value before deciding that the connection is dead and giving up.
A more general question is: how fast should the sender be sending segments? It might seem
reasonable for it to transmit as fast as its directly attached network will accept them. But suppose
some router between it and the destination computer can’t handle the segments that quickly. This
router is likely to drop the packets it can’t handle right away — to simply get rid of them. The
sender, oblivious to the problems at this router, continues to transmit at top speed, but perhaps
166
most of these packets are being dropped. At some point the sending TCP times out waiting for
the acknowledgments and retransmits the dropped segments. But the router is still overloaded and
drops them again. Furthermore, retransmitting these dropped packets adds to the router’s
overload. Thus, though the sender’s transmission speed is high, real throughput is low — most of
the segments have to be retransmitted a number of times.
The problem is clearly that the sender is sending too quickly. Its retransmissions are adding to
network congestion and aggravating an already bad situation. Even though successive
retransmissions have an exponentially increasing interval between them, once an
acknowledgment is received, transmission reverts to the original speed and the problems continue
— the sender continues to be oblivious to network congestion. The only throttle on its speed is
the send window — it can’t transmit more data than what the receiver is prepared to receive.
What’s more, it’s not just one sender who’s doing this, but all senders — each is transmitting as
fast as it can, retransmitting again and again all the segments that were dropped by overloaded
routers. In the early days of the Internet, this was pretty much how things were done. The result
was a dramatically congested Internet and a clear need for improvement.
Such an improvement was devised by V. Jacobson (Jacobson 1988) in the slow-start procedure.
It’s useful to think of a network connection as being an oil pipeline. It is has a finite capacity
that’s determined by its length and diameter. In a real pipeline, back pressure prevents us from
sending too much oil. What we need, then, is some analog to this back pressure to prevent
senders from sending data too quickly. As we’ve seen, the only indication the sender gets that
something’s wrong is a retransmission timeout, indicating that a segment probably did not reach
its destination. We’ve been inclined to think of these RTOs as caused by acts of God —
lightning, solar activity, or some other seemingly random event that has caused our segment to
disappear. But, in fact, segments normally disappear not because of acts of God but because of
acts of routers. So, let’s use RTOs as back pressure.
If we start transmitting at high speed, an RTO will tell us that going too fast, but this comes at the
expense of having to retransmit a fair number of segments. Furthermore, we don’t get an
indication of how fast we should be transmitting, other than slower than we have been. The slow-
start approach involves carefully and deliberately discovering the current pipeline capacity. The
current estimate of this capacity is the congestion window, a quantity indicating how many
unacknowledged bytes may be in the pipeline. It starts with a value of one segment. Each time an
acknowledgment is received, indicating that the pipeline’s capacity hasn’t been exceeded, the
congestion window is increased by one segment. Though this is called slow-start, the rate of
increase is actually quick — as a congestion window’s worth of segments is acknowledged, the
congestion-window size doubles.
Of course, once the congestion-window size surpasses the pipeline’s capacity, some overloaded
router will drop a packet and eventually there will be a retransmission timeout. So the RTO
indicates that our transmission rate has exceeded the capacity of the pipeline. Halving the
congestion-window size reverts to a value giving us a transmission rate that’s probably somewhat
less than capacity, but not more. At this point slow-start increases the congestion window size
linearly rather than exponentially, dropping it again by a factor of two whenever there is an RTO.
167
One might think the result would be the saw-tooth pattern shown in Figure 8.9, but there are
additional factors. The congestion window gives us the capacity of the pipeline, but we still need
to know the rate of flow through it — how quickly may the sender put segments into it? A neat
trick known as ack clocking provides the answer. In a real pipeline, oil flows out at the same rate
it flows in. The sender does know how fast segments are being received by the receiver — each
acknowledgment indicates something came out at the other end. So, receipt of an
acknowledgment means that more may be transmitted and thus transmission speed is determined
by the rate at which acknowledgments are received.
FIGURE 8.9 Idealized congestion-window size vs. time using the slow-start approach.
FIGURE 8.10 Actual congestion-window size vs. time, taking into account the need to
reestablish ack clocking.
However, this strategy leads us to a small problem. Slow start involves increasing the congestion-
window size until there’s an RTO. But by the time the timeout occurs, all the segments that were
in the pipeline have been received, and thus there is no source of acknowledgment to drive ack
168
clocking. So, even though a reduced congestion-window size has been determined, the sender
must go through slow start once again to reestablish ack clocking — rather than increasing the
congestion-window size until it gets another RTO, it stops once it reaches half its previous value.
Thus the actual plot of congestion-window size is not as in Figure 8.9, but as in Figure 8.10.
What’s needed is a way to determine that a segment has been dropped without waiting for an
RTO. A simple addition to the TCP protocol provides this. When a segment is received whose
sequence number is greater than expected, it’s pretty clear (to the receiver) that one or more
segments are late or lost. So the receiver repeats its most recent acknowledgment. When the
sender receives two identical acknowledgments in a row, the segments are clearly reaching the
receiver — something is inducing the acknowledgments — yet at least one segment didn’t make
it. If the sender receives three identical acknowledgments in a row, it’s really clear that a segment
was lost. So the sender can immediately retransmit that segment, halve the congestion window
size, yet still retain ack clocking.
The full procedure as just described is known as slow start with fast retransmit. It’s been in use
on the Internet since the early 1990s. Further improvements for even better congestion control
have been added over the years — see (Floyd 2001).
Activity 8.1
5. Consider the size of the receive window used for one end of a TCP connection.
a. What problems are caused if it is too small?
b. Are there any problems if it is too large?
c. Two important parameters of a TCP connection are its bandwidth — the average
number of bytes/second transmitted over it, and its delay — the average time
required to receive an acknowledgment for a segment, measured from the moment
the segment was sent. In terms of these two quantities, what should be the size of
the receive window? Explain.
6. The window-size fi eld of the TCP header is used by one end of a TCP connection to
inform the other of the size of its receive window. A common special case of TCP is
169
when data is being sent in only one direction, such as when a fi le is transferred. Let’s say
a fi le is being transferred from A to B.
a. In such a one-way transfer, all segments from B to A (containing
acknowledgments and window sizes) have the same sequence number, since no
data is being transferred in that direction. If such segments arrive out of order,
how can A determine which contains the most recent window size?
b. Suppose A’s send window is full: A may not send more data at the moment,
though it has data to send. At some point B’s receive window will open up
(because its application has consumed data). May B immediately send a segment
containing an updated window size? Explain.
c. A’s send window is still full. Are there potential problems if A repeats its most
recent transmission to B so as to solicit a response containing the most recent
window size? Explain.
d. How should A fi nd out that B’s receive window is no longer full?
7. Suppose we have a 10 Gbit/second network with a round-trip time of 100 milliseconds;
the maximum segment size is 1500 bytes. How long will it take a TCP connection to
reach maximum speed after starting in slow-start mode?
8. Explain how ack clocking serves to help senders determine how fast to send data. Explain
how the modifi cation to the TCP protocol described at the end of Section 8.1.1.2
(repeating the most recent acknowledgement) allows ack clocking to work even though
packets are being lost and thus cannot be acknowledged.
9. You’d like to transmit one hundred one-megabyte fi les from computer A to computer B
over the Internet. Assume that the latency from A to B is very small, and that the
bandwidth on the intermediate connection from A to B is quite large. Also assume the
routers on the path from A to B have a large amount of buffer space. Three possible
approaches are:
i) Use 100 separate TCP connections that are set up and used sequentially: the
first is made, the first fi le is completely transmitted, and then the connection is
torn down, then the second connection is set up and the second fi le is
transmitted, etc.
ii) Use 100 separate TCP connections, all set up and used in parallel (i.e., all 100
fi les are opened, 100 connections are made, and then all fi les are transmitted
concurrently).
iii) Send all 100 fi les sequentially, but over the same connection. Which of the
three approaches should be the fastest? Explain.
The process of putting data, either arguments or return values, into a packet to be transmitted is
called marshalling. The reverse, extracting the data from packets, is called unmarshalling.
In addition to the concerns about reliable data transmission discussed in the previous section,
RPC protocols must deal with passing data between machines that represent data differently, as
well as providing correct operation in spite of computer crashes and communication breakdowns.
Though a number of RPC protocols have been developed over the past few decades, two of
171
particular importance are Open Network Computing RPC (ONC RPC), developed by Sun
Microsystems in the 1980s, and what we call Microsoft RPC.
8.3.1 Marshalling
Marshalling and unmarshalling data are not as trivial as they might seem. Marshalled data,
produced from a data representation in a language system on one sort of machine, must be
unmarshalled into the representation of that data in a possibly different language system on a
possibly different sort of machine. This requires a detailed description of the data’s type,
including size — information that is not available in standard programming languages such as C.
To make this more concrete, consider the C declarations below:
typedef struct {
int comp1;
fl oat comp2[6];
char *annotation;
} value_t;
typedef struct {
value_t element;
value_t *next;
} list_t;
We’ve declared a simple database interface with three procedures. From this declaration, we
would like to be able to create automatically a set of client and server stubs to do the appropriate
marshalling and unmarshalling for remote access to the database. But, as we will soon see, this
declaration does not contain enough information to do this. Here we rely on our knowledge of the
programmer’s intent to show how marshalling and unmarshalling are done; happily, both ONC
RPC and Microsoft RPC provide a separate language for describing remote procedural interfaces
so that stubs can be generated automatically.
Our database consists of a collection of elements of type value_t, each identified by an integer
key. The add routine adds a new element with a particular key, returning false or true depending
on whether or not an identical key/element was already there. The remove routine removes a
particular key/element combination. The query routine returns a list of all elements with a
particular key.
Figure 8.12 illustrates placing a call to add, and Figure 8.13 illustrates the return from add. To
marshal add’s arguments the client-side stub must put them into a form such that they can be
unmarshalled when they reach the server. The first argument, key, should be easy — it’s just an
integer. How should the stub marshal it?
172
There are two issues: how big the integer is and how it is represented. Though older versions of C
allowed the size of ints to depend on the architecture, modern C compilers set the size to 32 bits.
But the representation still depends on the architecture, which might be big-endian (the byte with
the lowest address contains the most significant bits) or little-endian (the byte with the lowest
address contains the least significant bits).
One approach to dealing with these two possibilities is to choose one, say big-endian, as the
standard and have the marshalling code in the client-side stub convert to that standard if
necessary. Then the unmarshalling code in the server-side stub knows the representation of the
incoming integer and converts, if necessary, to its architecture’s representation. This approach is
simple, though perhaps a bit inefficient if both sender and receiver are little-endian. It is what is
used in ONC RPC, however.
173
An alternative approach is for the sender to marshal the data using its own representation, but
also to supply a code indicating what that representation is. Then the receiver can determine if
conversion is necessary. This approach is used in Microsoft RPC.
Marshalling the second argument in the add routine above is a bit more complicated, since it’s a
structured type. But this is handled simply by marshalling each of the components in turn. The
second component is itself a structured type, a fixed-length array. Since the length (six) is known
by both parties, each element is marshalled in turn. Each of the elements is a fl oat, which has
similar issues to those involved in the int. The solution is the same: either use an agreed-upon
standard representation, or send it using the sender’s representation, along with a code indicating
what that representation is.
The third component of the structure, annotation, presents a problem: not enough information is
provided in the C declaration to marshal it. Though it’s probably clear to the human reader that
annotation is a null-terminated character string, the declaration allows other possibilities. For
example, it might be a pointer to a single character, in the same way as an int * would be a
pointer to a single integer. If the stub is to be produced automatically, more information about
annotation is needed. An interface language might have an explicit string type that could be used
to indicate unambiguously that annotation is a null-terminated string. In any event, to marshal
annotation, the stub must compute its length and send it along with the bytes of the string. Figure
8.14 shows the marshaled arguments of add (a standard data representation is assumed).
174
Similarly, it expects the next component to be a fixed-length array of fl oats, and interprets the 28
bytes as that array. Finally, it expects a string and expects that the next four bytes will be the
length of that string. It then creates a null-terminated string from that number of remaining bytes.
The query routine from our database interface returns a linked list. While at first glance the
pointers in such a data structure might seem difficult to marshal, remember that the desired result
of unmarshalling the marshalled linked list is a copy of the original list. One approach to doing
this, used in Microsoft RPC, is to marshal the list as an array and marshal the pointers as array
indices, as in Figure 8.15.
The linked list in Figure 8.15 can be easily reconstructed by the receiver. It might either allocate
storage for all of it at once, converting the indices into pointers, or allocate storage for each
value_t separately, linking them together with pointers.
To achieve this reliability, it makes sense to layer RPC on top of a reliable transport protocol
such as TCP. Thus an RPC request is sent over a TCP connection, and the response comes back
over that connection. TCP handles all the details of ensuring that everything gets where it’s
supposed to and gets there in the correct order. As long as the TCP connection is operational, the
RPC exchanges running on top of it are perfectly reliable.
But what happens if we lose the TCP connection? This might happen because a temporary
communication problem makes the connection time out, or the server might crash, then restart.
175
Can we simply create a new TCP connection and resume the RPC activity? Consider an example
from the client’s point of view. It sends a request, but before it receives a response, perhaps even
before its TCP gets an acknowledgment, the connection times out. The client might successfully
create a new connection to the server, but should it retransmit its original request or not?
Maybe the original request was received by the server and acted upon, but either the server
crashed before the response could be transmitted, or a network problem terminated the
connection while the response was in transit. Or perhaps such a problem occurred before the
request arrived at the server in the first place. Thus the client is in a quandary: it doesn’t know
whether or not the server has acted on the original request (see Figure 8.17).
FIGURE 8.16 How RPC is supposed to work: requests are followed by responses.
176
FIGURE 8.17 Uncertainty when a request is not followed by the response: was the request lost
or the response lost?
There are two rather obvious things for the RPC client to do. One is to give up and admit to the
application that it doesn’t really know what happened. Thus, rather than guaranteeing that it
executes the remote procedure exactly once, it has to fall back on guaranteeing at most once.
Alternatively, the client portion of RPC might try again, and, if necessary, continue trying again,
until it finally gets a response from the server. This, of course, has the danger that the server
might end up executing the remote procedure more than once. Thus the reliability guarantee
drops from exactly once to at least once.
In many cases, this at-least-once semantics isn’t a problem. Some procedures are idempotent,
meaning that the effect of executing them twice in succession is no different from the effect of
executing them just once. Consider a request that writes data to a certain location in a file.
Performing it twice is inefficient, but the net result is just the same as doing it once (see Figure
8.18). But certainly not all procedures are idempotent — consider one that transfers money from
one account to another.
One possibility for getting around this uncertainty of whether or not the server has executed the
remote procedure is that the server keeps track of the requests it has executed. Of course, if it
receives a repeat of a request, it can’t simply ignore it, since the client is probably still waiting for
the response. So, the server might hold onto responses from past requests and resend a response if
it receives a retransmission of the request. This is the approach NFS uses with ONC RPC. But
note that it requires the server to hold state information in addition to what’s held by the transport
177
protocol, and also requires this state information to continue to exist even if the transport
connection is lost.
So, since the functionality of the reliable transport protocol must be augmented by RPC, why not
dispense with the transport protocol entirely, run RPC on an unreliable protocol, such as UDP,
and have RPC do all the work of providing reliability? This was the preferred approach in the
early days of RPC, where UDP was often used as the transport, and it is still used in some
situations today. For the next part of our discussion, we assume that UDP is the transport. This
allows us to see more clearly what the issues peculiar to RPC are. We then go back to TCP and
discover that, even though some RPC problems are less likely to occur on TCP than on UDP,
they are still possible.
One reason for using UDP rather than TCP is that, in principle, RPC can provide reliability more
efficiently on its own, because it can take advantage of its own simplicity. A simple exchange of
a request from a client followed by a response from the server in general requires four packets if
TCP is used: the request, followed by an acknowledgment from the server, and then the response,
followed by an acknowledgment from the client. But over UDP only two packets are needed,
because the acknowledgments are not necessary. Rather, the response serves as the
acknowledgment of the request; if the client is single-threaded, i.e., it has just one request in
progress at a time, then the client’s next request serves as the acknowledgment of the response.
However, the important case is the multithreaded client. For example, in most implementations of
NFS (see Chapter 10) the client is the operating system, acting on behalf of all threads on the
machine — many requests are in progress at once. A great deal of work has been done to make
the multithreaded case just as efficient (in terms of the number of messages exchanged) as the
single-threaded case. The focus of all this attention is the server’s response to the client’s request:
making its transmission reliable is the key to reliability of RPC.
So let’s fill in a few more details of an RPC protocol layered on top of UDP, in an environment in
which clients have multiple requests in progress at once. We follow the design of ONC RPC, as
used with NFS.
178
Requests and their matching responses are identified by unique integers called transmission IDs
(XIDs). It’s not necessary that they form a sequence, just as long as each request/response pair is
uniquely identified. A client’s task is simple. It sends a request. If it doesn’t get a response in due
time, it resends the request. If it receives a response for which it doesn’t have a matching,
unresponded-to request (which is thus probably a retransmission of a previous response), it
ignores it.
The server’s task is a bit more complicated. It differentiates between idempotent and non-
idempotent procedures, and maintains a cache, called the duplicate request cache (DRC), of
copies of responses to recently executed non-idempotent procedures. Responses stay in this cache
for a certain period of time, say a minute or so. When the server receives a request to execute an
idempotent procedure, it simply executes it and returns the response. Since the procedure is
idempotent, there’s no harm if the request is a duplicate. When it receives a request to execute a
non-idempotent procedure, it first checks if the request is in the cache, i.e., if the request is a
retransmission of one received earlier. If so, it doesn’t execute it again, but simply returns
(retransmits) the response it returned the first time. Otherwise it executes the request and sends
back (and saves a copy of) the response.
This design is simple and the protocol is efficient. Unfortunately, it doesn’t always work.
Suppose a request is sent by the client but gets delayed somewhere in the network. The client, not
receiving a response soon enough, retransmits the request. This time the request gets to the server
right away and is answered quickly. The client receives the response and then sends its next
request. This reaches the server, which handles it and sends back a response. At this point, the
first transmission of the client’s original request finally makes it through the network and arrives,
very late, at the server. If it’s an idempotent request, the server executes it, since it has no
information about whether it is a duplicate. If it’s a non-idempotent request, the server first
checks the DRC. But the request may have been delayed long enough so that the other instance of
it is no longer in the DRC. Thus the server executes it.
Does this cause any harm? It clearly is a problem in the non-idempotent case, but it’s also a
problem in the idempotent case. For example, suppose the original request was to write a current
account balance ($10,000) into the beginning of a file. The second request was to write the new
balance ($5) into the same location. Thus the final result should be a balance of $5. However, the
delayed first request overwrites the location with old data — $10,000. Particularly in
environments in which ONC RPC is commonly used, we often assume that the network and its
routers are well behaved and that such a scenario will not happen. But in general, on the other
hand, we have to assume that routers are conspiring against us to do whatever it takes to
invalidate the assumptions made about the network. Such malicious routers are known as
Byzantine routers.
Similar symptoms observed by (Juszczak 1989) were due not to Byzantine routers, but to a bug in
the implementation of the protocol in which the servers were, essentially, not thread-safe.
(Juszczak 1989) suggested a fi x in which the headers of both idempotent and non-idempotent
requests (enough to identify duplicates) are cached in the duplicate request cache along with the
complete responses to non-idempotent requests. As Figure 8.19 shows, when the server receives
179
a request, it immediately checks if it is a duplicate of a request still in the cache. If not, the
request is performed. If so, and if the original is still in progress, the duplicate is discarded, i.e.,
the client must have timed out and retransmitted prematurely. Otherwise, the request was
executed, but evidently the client did not receive the response.
If the request succeeded the first time it was executed, but was non-idempotent, then, as in the
original approach, the original reply is sent back without re-executing the request. But, if the
original request was unsuccessful (and no doubt will still be unsuccessful if re-executed) or if it
was successful and idempotent, then it makes more sense to simply discard it without re-
executing it. The rationale, backed up by experience, is that the response really did make it to the
client (or is on its way), but that the client has prematurely retransmitted the request. Thus the
client will soon (if not already) receive the response and be happy. However, if the response
really was lost (which is considered unlikely), the client will repeatedly retransmit the request
until it does get a response. At some point the original request will have timed out of the cache,
and the request will be treated as a new request (and a response will be sent to the client).
This fixed the implementation bug and made the bank-balance scenario outlined above less
likely, but did not prevent it completely. The protocol was still susceptible to Byzantine-router
problems — if the delayed request arrives after the other instance has been removed from the
cache — but, in the environments in which the protocol is used, such problems are exceedingly
rare. (We have more to say about this later in this section.)
A final issue is the effect of a crash. Client crashes are bad news for client applications, but are
not a real concern to RPC. On the other hand, a server crash is a concern. If a server crashes after
the client sends a request but before the response arrives, the client has the same uncertainty
mentioned at the beginning of this subsection about whether or not the request actually took place
on the server.
Furthermore, since any state information kept by the server in volatile storage is lost after a crash,
there is no direct way to find out. Let’s now bring back TCP. Its performance is better than in the
1970s and 1980s — there is much less reason than in the past to avoid it because of its
performance. In fact, because it provides congestion control (see Section 8.1.1), it’s a better
overall “network citizen” than is UDP and is in general preferred to UDP.
Because TCP is a reliable transport, and thus data is sequenced, using it might seem to eliminate
the need for a duplicate request cache and also nullify any problems from Byzantine routers. If
RPC could depend on never losing its TCP connection, then TCP’s sequencing would indeed
make all this so. But if the underlying TCP connection fails for some reason and a new one is
created, TCP’s sequencing information is lost.
For example, suppose a client sends a request and the server sends back a response. But the
response is lost when a router that’s forwarding it crashes. The TCP connection is lost as well, so
the server’s TCP doesn’t retransmit the response. The client creates a new connection. Since it’s
still waiting for the response, it retransmits the request. If the server doesn’t have a duplicate
request cache, it cannot determine that the retransmission is a duplicate. And, even if it has such a
cache, if the original request is no longer in the DRC, it still cannot determine whether the
retransmission is a duplicate, just as in the UDP case.
180
FIGURE 8.19: Juszczak’s algorithm for handling possibly duplicate incoming RPC requests
(adapted from (Juszczak 1989)).
As another example, suppose a client sends a request, but the underlying TCP connection to the
server times out before it receives a response. The client creates a new connection and then
retransmits its request over it. Assuming the request is idempotent, there should be no danger
from its being executed twice on the server. However, the server end of the original TCP
connection has not timed out and is still in the established state (this is unlikely, but possible).
The original request turns out to have been delayed at a Byzantine router. It finally arrives at the
server over the old connection, long after its retransmission, and perhaps long after the
retransmission has been removed from the duplicate request cache and after other requests from
the same client have been received. The server thus does not detect the original request as a
duplicate and happily executes it.
Despite its vulnerability to problems with duplicates, ONC RPC using Juszczak’s DRC algorithm
has worked well in practice — in most environments these problems simply don’t occur. We call
this form of ONC RPC the original ONC RPC in order to distinguish it from what we discuss
next. What is needed to make the original ONC RPC reliable in the face of dropped TCP
connections and Byzantine routers is a way to detect a duplicate request that isn’t so dependent
on the replacement policy and the finiteness of the duplicate request cache. So, let’s modify the
caching approach a bit. To distinguish the new approach from the old, servers have reply caches
rather than duplicate request caches. A client establishes a session with a server, and associated
with each session is a separate reply cache, holding a single request and its complete response.
181
FIGURE 8.20 Session-oriented ONC RPC.
Each request contains a sequence number assigned by the client. If the client is single-threaded,
i.e., has only one request in progress at a time, then requests should be arriving at the server in
sequence-number order. If a request arrives with the same sequence number as the previous one,
the response to that first request must not have arrived at the client and thus the server repeats its
previous response, which is guaranteed to be in the reply cache. If a request arrives with a
sequence number less than that of the previous one, it must be a duplicate whose response has
already been received by the client, and thus can be ignored — it may have come via a Byzantine
router.
To handle clients that have multiple concurrent requests, the client and server agree ahead of time
to a maximum number of concurrent requests, say n, and have the server give the client n slots in
its reply cache (see Figure 8.20, where client 1 has specified a maximum concurrency of 2 and
client 2 has specified a maximum concurrency of 3). The client then has effectively n channels
through which it can send requests, and only one request at a time is active on each channel.
The client provides a slot number and sequence number with each request, and the server holds
onto its most recent response for each channel. The server determines how to respond to a
request, just as in the previous paragraph.Versions of ONC RPC that use sequence numbers and
channels as just described are called session-oriented ONC RPC. It is used in NFS starting with
version 4.1 (see Section 8.5.1).
ACTIVITY 8.2
1. A simple technique for marshalling a pointer is to transmit just the target of the pointer.
Thus to marshal an int *, the sender transmits the pointed-at int. The receiver unmarshals
182
by allocating storage for an int, copying the received int into the storage, and providing a
pointer to that storage as the unmarshalled result. Does this marshalling/unmarshalling
technique work for all uses of pointers? Explain.
2. A remote procedure may be declared to have no return value, i.e., it returns void. Would it
be safe for clients to treat such procedures as asynchronous requests, i.e., once the call is
placed, the client application may immediately proceed to its next instruction?
3. Suppose we have a remote procedure that is called to delete the fi le whose name is
supplied as an argument.
a. Explain why it is not idempotent.
b. Assuming there is no duplicate request cache, will the file system be damaged if a
client retransmits its request to delete a fi le after the response to its previous
request was lost (and the retransmitted request is executed)? Explain.
c. Explain how the scenario of part b might harm a client application program.
4. An RPC request is said to have “at-least-once” semantics if, once the request returns to
the calling application, it is guaranteed that the remote procedure was executed at least
once on the server. This makes sense for idempotent procedures. The other extreme is “at-
most-once” semantics, where the remote procedure is guaranteed not to be executed more
than once. This makes sense for a non-idempotent procedure (though one certainly hopes
that most of the time it is in fact executed once). Of course, what we really want is
“exactly-once” semantics: a guarantee that the remote procedure is executed exactly one
time.
a. Explain how exactly-once semantics can be achieved if we can guarantee that the
server never crashes.
b. Is exactly-once semantics possible if we no longer have the no-crash guarantee?
What additional mechanisms would be needed? Explain.
8.4. Summary
This chapter has covered some of the details of the TCP/IP protocols so as to understand how
the reliable communication required for RPC protocols is made to work. The intent has been to
remove the magic behind reliable communication and, perhaps, motivate the reader to take a
course on network protocols. Similarly, we explained RPC in depth, since it is used to support
distributed fi le systems, the topic of the next chapter.
REFERENCES
183
Unit 9 Distributed Systems
9.0. Introduction
Computer systems have developed from standalone machines, to direct connections between two
machines, to networks where one machine can communicate with any other networked machine.
But even with this, the user is always aware of the connection and has to issue explicit commands
for the movement of data.
Now we are on the verge of the next development, building on networking. This involves groups
of machines acting together as one. A distributed system is a collection of individual computers
which are networked together not just to share data, but to cooperate, to distribute computation
among several physical machines. A distributed operating system looks to its users like an
ordinary centralized operating system but runs on multiple, independent central processing units
(CPUs). Distribution must be transparent, both to the user and to programs at the system call
interface. This means that the user or programmer should not be able to tell that a remote
machine is involved. Ideally, a distributed system should look like a conventional system to
users. Software for this is just emerging.
The simplest possible architecture used to structure a distributed system is to arrange for some
processes to provide services to others. Those which provide services are known as servers,
naturally enough. Those which use these services are known as clients. The whole arrangement is
known as a client/server system. What the server does, and what it sends back to the client (if
anything) can vary enormously. But all of the different models of distributed computing can be
reduced to this.
184
9.2. Features of distributed systems
Distributed systems have some features not found in standalone systems, and these are
influencing the pace of development.
9.2.1 Economy
Probably the single most important argument for the move towards distributing computing
resources is the economic one. At present the ratio between price and performance is in favour of
multiple small machines. A microcomputer can only provide limited performance; but if
microcomputers can be added together to provide a cumulative performance, they will do so at a
fraction of the cost of a mainframe.
Sharing the workload over idle workstations is one of the long-term goals of distributed systems.
The ideal here is that when a personal workstation is idle, it would make itself available and
undertake work for other, busier, machines. But it would always be fully available for its owner
when required.
9.2.2 Reliability
Distributed systems can offer the high reliability and fault tolerance needed by critical
applications. This is achieved by redundancy in processing power, other hardware and storage of
data. If one machine in a group of 10 crashes, 90% of the processing power is still available. One
copy of a database may be destroyed, but with proper systems in place it can be ensured that
other up-to-date copies exist and are immediately available. The user need not even be aware that
there was a problem.
This is another factor driving the development of distributed systems. Such sharing can be for
purely economic reasons, for example to use an expensive printer or other specialized hardware.
A multi-user licence for one shared copy of a piece of software is cheaper than many single-user
licences.Apart from economic reasons, it can often be very convenient to share resources. It is not
really feasible to share a company’s database by distributing copies on floppy disks to individual
machines – it would never be up to date! The whole area of computer-supported cooperative
work relies heavily on distributed systems.
9.2.4 Performance
Obviously performance can be improved by using more machines. But there is a downside.
Communication is the great bottleneck here. So processes and the resources they use should still
be on the same machine as much as possible. Another critical factor is the ability to adapt to
increased load. The system should not just collapse under load. Performance should degrade
gracefully.
185
9.2.5 Incremental growth.
Distributed systems allow for the incremental growth of a computer installation. It is not
necessary to buy all the processing power, memory or disk drives at the one time. It is possible to
install just what is necessary to begin with, knowing that the system can expand to keep pace
with growing demand into the foreseeable future.
9.3. Naming
In a distributed system, it is necessary to be able to identify uniquely all of the resources –
individual machines, processes, files, printers. At the system level, these identifiers are binary
numbers. There are two problems with such identifiers.
One is that humans find such binary numbers, or even their decimal equivalents, difficult to
remember and input. We are much more at home with names.
The second problem is even more serious. A client needs to know the identifier of the server
machine and also the identifier of the process on that machine which is providing the required
service. Suppose this server process crashes and restarts. It will now have a different process id,
and clients will not be able to reach it. Another possibility is that the machine itself may crash
and the system administrator may move the server process to another machine. Clients would
continue sending requests to the old server, with no results.
There is one solution to both of these problems, and that is to provide a name server in the
system. At the human level, resources are identified by meaningful names, such as ‘timeserver’
or ‘laserprinter’. At the machine level they are still identified by binary numbers. The link
between a name and a number is called a binding. Then we add in a name server process,
somewhere in the system. This maintains a database of such bindings and performs translations
on behalf of clients. For example, a process sends the name server the string ‘laserprinter’, and it
sends back the unique identifier of that printer. If the identifier changes, it is only necessary to
inform the name server. As long as a process can find the name server, it can find any other
resources in the system.
Each computer attached to the Internet is known by a name and also by a number.
All of the names given to computers on the Internet are arranged in a tree structure, just like the
directory structure of a file system. Each non-leaf node in this tree is known as a domain.
Names can be relative (to the local domain), or they can be absolute. Absolute names always
terminate with a dot.
186
There is a root, then first-level domains, such as com, org and edu. There are also first-level
domains for individual countries, such as us and ie etc. Each of these domains is divided into
subdomains, and so on. Figure 9.1 gives an example of a tiny fraction of the namespace. A
typical name for a computer attached to the Internet would be shannon.cs.ul.ie.
While we are familiar with these human-readable names, we must remember that the Internet
works with binary addresses. All of the data passing around the Internet is directed to its
destination by means of these numeric addresses. Such Internet protocol (IP) addresses are 32 bit
integers. When humans deal with them, we usually break them into four bytes and write the
decimal equivalent of each byte separated by a dot, e.g. 136.201.24.2. This is known as dotted
decimal format.
The Internet domain name system (DNS) is the system which keeps track of all of the computers
attached to the Internet. Its role is to translate a name such as shannon.cs.ul.ieinto an IP address,
such as 136.201.24.2.
Figure 9.2 illustrates the procedure. A client knows the name of a server. But in order to send a
request to that server, it must know its address. So it first sends a request to the name server,
asking it to translate the name to an address. The name server looks up the required address in its
database and sends it back to the client. The client is then able to send a request for service.
187
Figure 9.2 Using a name server
Distributed database
Having one name server to cater for the whole of the Internet just would not work. So the
designers came up with the idea of breaking up the overall database of name/number pairs, and
keeping parts of it on many different machines around the world. This may seem to make the
problem even worse, but in fact it is a very elegant solution, as we shall see.
Each domain of the Internet has an administrator who undertakes to assign names and numbers to
machines in that domain. The administrator also undertakes to provide a name translation server
for that domain.
To tie the whole system together, there are a number of root servers in different parts of the world
which know the addresses of the servers for all of the first-level domains – org, com, us, ie etc.
All servers the world over have the addresses of these root servers.
If a name server is presented with a name in its own domain, then it can give a reply immediately.
However, if the name is in some other domain, then there are further steps involved. For
example, a server presented with the name shannon.cs.ul.ie will first of all query a root server for
the address of the name server for the ie domain. Then it will query that server for the address of
the name server for the ul.ie domain. Then it will query that server for the address of the name
server for the cs.ul.ie domain. Finally it will query that server for the address of the machine
shannon. Like a good librarian, the server may not have some particular information, but it must
know where to look for it.The full procedure is outlined in Figure 9.3.
188
The whole system is rather like phone books. We know the name of the person we want to phone,
but we need a number. All of the phone numbers in the world are not listed in the one book. Each
subscriber has a book of local numbers (the name server for the local domain). If they need a
number not in the local book, they can apply to directory enquiries (higher level domain server).
If directory enquiries do not have the number, they know where to get it (the root servers).
Caching
The system as it has just been described would work, but there would be an unacceptable number
of queries travelling around the world. Instead, it relies heavily on caching. Every time a server
sends a query for the address of another machine, it saves the name/address combination in
memory. It always tries to satisfy a request from this cache before sending a query to another
machine. When a server has been running for a while it will have all of the most frequently used
addresses cached, and only rarely will it have to send queries over the network.
Two different types of underlying operating system have developed for use in distributed
environments.
The most common use of a NOS is to attach file systems from a remote server on to a local
machine, where they appear as part of the local directory structure. The user sees no difference
between local and remote files. All of this is run on the server side by a network operating
system, which is really just a general-purpose operating system with enhancements. It only
transfers those portions of a file which are actually in use. If the file is modified, then the changes
are written back to the server. There is a similarity between this and paging. Unix provides all of
these services, as does Windows NT Server.
Such a NOS also allows other remote resources, such as a printer, to appear as if they were
actually attached to the local machine.
A true distributed operating system must, at the very least, begin to blur the boundaries between
machines.
Obviously it will be responsible for managing all local resources, such as the CPU and peripheral
devices, including network interfaces. As well as this, it is responsible for advertising resources
which are free and available, as well as exporting and importing processes to and from other
machines. And it should do all of this transparently.
189
9.4.3 Hybrid systems
Fully distributed operating systems are not yet in widespread use. The current state of the art is to
use specially adapted single-machine operating systems. These communicate among themselves,
advertising resources which are free. So they are not fully transparent.
At present, the way forward is not clear. Several different alternatives have been proposed as the
basis for the distributed systems of the future. We will consider two of these, but only time will
tell exactly how the whole area will develop.
CORBA
The most important aspect of distributed design is the interface between the different
components. Object orientation has been proposed as the best technique for defining such
interfaces.
The Object Management Group has published a specification for a Common Object Request
Broker Architecture. As its name implies, CORBA provides access to objects distributed across a
system by matching up requests with objects. It is envisaged that CORBA would be used both for
building distributed systems and for integrating existing and new applications.
In a distributed system, a server typically manages resources on behalf of clients. Like all object-
oriented systems, CORBA encapsulates these resources in modules and makes them available
only through interface procedures. This allows an application to be broken into components
which can communicate with each other very easily, no matter where they are in the distributed
system.
CORBA has been described as acting like a software bus. Just as hardware components all
communicate with each other over the system bus, so software components communicate with
each other through CORBA.
Within the Microsoft Windows environment, the Distributed Component Object Model (DCOM)
performs a role similar to that of CORBA. Objects have DCOM interfaces which can be
exported.
The Open Software Foundation has developed DCE, which is an attempt to build a distributed
computing environment on top of existing operating systems. It has been ported for example to
Windows, OS/2, Tru64 Unix, and OpenVMS. So in theory you can take a number of existing
machines, of different architectures, each running its own operating system, and just by putting
the DCE software on top of these you have an instant distributed system, without disturbing any
of the existing applications. In most cases this can be done at user level, without affecting the
operating system itself.
190
In essence, DCE provides tools for building distributed applications, such as its own threads
facility and services for running such distributed systems, including security and protection, a
name service and a time server. The advantage is that all of these are integrated and do as much
work as possible for the programmer.
Activity 9.1
1. Outline five features which are specific to distributed systems and are influencing the
pace of their development.
2. Explain the role of a nameserver in a distributed system
3. Explain how the Internet DNS translates a name to an address.
4. Distinguish between a network operating system and a fully distributed operating system.
9.5. Sockets.
Any distributed system relies totally on the ability of different machines to communicate with
one another. So we will now examine how such communication is implemented by the operating
system on each machine. Then we will go on to consider how a distributed system could be built
on top of the facilities provided by this layer.
The whole area of communication between machines is the province of computer networks.
Despite the best efforts of the standards organisations, there are many different protocols or rules
in use for communication between computers.
While there is no getting away from such differences, attempts have been made to provide one
standard interface to all of these communication domains. One such mechanism is known as the
socket interface. The socket facility is a set of system calls which allow processes to send and
191
receive data across a network without having to worry about any of the underlying protocols.
While it has not yet been standardized as part of POSIX, it is a de facto industry standard
available on most systems, and is used in many network applications.
A top level view of the socket system is as follows. A process creates a socket, which begins life
as just an anonymous data structure. Next it is uniquely identified within the whole system. Then
this socket has to be connected with another socket in a different process. This can be done
passively or actively. A process can wait to be contacted by another process or it can take the
initiative. After this, data can be transferred across the connection. Finally a socket is closed and
removed from the system. This sequence of operations, for both a server and a client, is outlined
in Figure 9.4.
While sockets can be created in several different flavours, there are two really important types.
The first is connection-oriented, where it is assumed that a stream of data is going to flow
between the sender and receiver. This is similar to the telephone service, where a connection is
first set up, and after that you just talk. In the Internet domain these sockets are implemented by
the Transmission Control Protocol (TCP). The other type of socket is connectionless, or
datagram, where the communication is going to be one or more individual messages. This is
similar to the postal service, where each letter must be individually addressed. In the Internet
domain these sockets are first of all query a root Datagram Protocol (UDP). A socket is normally
created as part of the I/O subsystem. It has a file descriptor and an entry in both the local and
global file tables. It may or may not have a directory entry.
In the Internet domain, messages are sent over a network to a destination machine which is
identified by an IP address. But each message also has an identifier for the protocol which should
receive it, such as UDP or TCP. Then within the particular Internet protocol, a 16 bit number,
called a port number, is used to identify the specific socket. So a connection between two sockets
is fully specified by source IP address, source port, protocol in use, destination port and
destination IP address.
A stream socket can wait passively to be contacted or it can actively connect to another socket.
Accepting connections
Let us first look at the passive situation. This is typical with servers, where they set themselves
up, and wait for clients to contact them. First of all the process which created the socket specifies
a particular port number to be associated with it. This port number is usually well known in
network circles. Then it blocks, listening for requests to that port. When a request for connection
192
does come in, a new socket is created to handle it, and the original one continues to listen at the
well-known port number for further requests.
Setting up a connection
A client can ask the system to connect it to a remote socket. For this, it must be able to identify
the machine (e.g. IP address), the protocol (e.g. TCP) and the port number of the socket listening
at the other end. Once a connection has been established, data can then be transferred across it.
But programmers are probably even more familiar with function calls. Such a function call
diverts control to an out-of-line function, possibly passing parameters to it as well. When the
function finishes, control returns to the main program and a result value is also made available in
the main program. So another possible way to develop distributed systems is to implement a
mechanism which would allow programs to call functions on other machines. In this way, all the
details of how the network operates (even its existence) can be hidden from the application
program.
9.6.1 Overview
The remote machine has a module containing one or more procedures. This gives the mechanism
its name – remote procedure call, or RPC. We try to make it look like a normal function call as
far as possible – input and output parameters, and a return value. This is achieved by the client
having a dummy procedure in a library on its own machine, known as a stub. The client calls this
in the usual way, just like any other local procedure. As far as the client knows, this is the
procedure which is doing the work. But it is not. All it does is to format the parameters into a
message, add some identification of the procedure it wants executed on the remote machine, and
send the message off to that machine using an interface such as a socket. The client stub then
blocks, waiting for a reply.
When the message arrives at the remote machine, the RPC server process there unpacks the
message, and identifies which procedure is being requested. It then calls that procedure in the
normal way and the procedure returns as normal, passing back a return value to the RPC server
process. This in turn packages the result as a message, and sends it to the communication layer
for transmission back to the waiting stub on the requesting machine. This stub is then woken up,
unpacks the message and passes back the result to the client process in the normal way. Figure
9.5 illustrates the flow of control in this case.
193
9.6.2 Generating stubs
All of this could be done by hand, in a conventional programming language, but it would be very
prone to error. The order and type of the parameters must be the same in all four modules that
deal with them. If all four are coded independently, there is certainly room for inconsistencies to
creep in, particularly over time, as changes are made in one place but not in all.
The ideal is to automate as much of this as possible. The programmer only writes the client
program and the server procedure. Tools have been developed which generate C code to
implement RPC. They are supplied with the characteristics of remote procedures that are visible
to clients, such as the name of the procedure and the number and types of the parameters. From
this specification they generate a server stub and a client stub, which make the network calls.
These are then compiled and linked in with the code written by the programmer.
The concept of critical sections will extend to cover mutual exclusion in distributed systems. But
mechanisms such as semaphores are difficult to distribute. They rely on shared variables, which
by definition exist in one place. It is difficult to guarantee indivisible uninterruptible access to
194
them over a network.So other algorithms have been developed to control mutual exclusion in
distributed systems. These fall into two classes.
With this arrangement, there is one dedicated coordinator process somewhere in the system. Such
a process could control one critical section or many. When a process wishes to enter its critical
section, it asks permission from this coordinator, waits until it gets it, enters the critical section,
and then informs the coordinator when it leaves its critical section. The coordinator must ensure
that only one process has permission to be in its critical section at any time.
Such a system will work if there are no failures or lost messages. One problem with it, as with all
centralized algorithms, is that it introduces a communications bottleneck. Also the whole system
will deadlock if a process crashes while in its critical section. But the main problem with any
centralized algorithm is that the coordinator process may crash. Such systems are usually built so
that any process can act as coordinator, but only one does so at a time. So when the current
coordinator crashes, it is necessary to ensure that another one takes over – but only one other.
There must be some way of recognising that the coordinator is no longer functioning. This can
usually be detected by failure to receive an acknowledgement after a timeout period. The process
which detects the lack of a coordinator identifies a successor from among all of the others based
on the combination of process id and IP address, which we assume is globally unique. Generally,
the process with the highest id is selected as coordinator. Two processes can discover a dead
coordinator at the same time, but both will identify the one new coordinator.
Because of the drawbacks associated with centralized algorithms, fully distributed algorithms
have been developed. With these, each process takes its share of the responsibility for arranging
mutual exclusion on a critical section. Such an algorithm assumes that all processes know each
other.When a process wants to enter its critical section, it multicasts a request to all of the others.
A process will only reply to this request if it is not in, or wanting to go into, its critical section.
The requesting process waits until it has got permission from all of the others, then enters its
critical section.One of the drawbacks of this algorithm is the large number of messages it
requires. Another problem is the need to know all the processes involved. But it is suitable for
small, stable sets of cooperating processes.
Nor is deadlock avoidance used in distributed systems. Remember, avoidance algorithms need
advance knowledge of all the resources required by processes. This is difficult, if not impossible,
195
to know even in a standalone system, and is really only relevant to batch systems. Once we move
into distributed computing, it is not feasible to talk about advance knowledge of all resource
usage.
That leaves us with deadlock detection and recovery. Each individual machine maintains its own
local resource allocation graph, typically in a reduced form which only records dependencies
between processes, known as a wait-for graph. Then the problem is to check for cycles in the
union of all of these graphs. When such a cycle exists, one process is chosen and aborted. Of
course that presupposes that we know how to maintain and check a wait-for graph for a whole
distributed system.The algorithm to implement this can be centralized or distributed.
Each machine maintains a graph for its own resources, and could even implement local detection
and recovery. Then there is a coordinator process which maintains the union of these graphs.A
process could send a message to the coordinator each time a local graph is changed. Or it could
send messages about the state of the local graph periodically. Or the coordinator could ask for
information at fixed intervals.The coordinator examines the distributed graph periodically. There
could be a cycle in this distributed graph which is not in any local one, so indicating a distributed
deadlock.
For example, the resource allocator on machine A sees the graph on the left of Figure 9.6. The
resource allocator on machine B sees the graph in the centre of Figure 9.6. Neither of these has a
cycle, so there does not appear to be any deadlock. But the coordinator sees the union of the two
graphs, as shown on the right of Figure 9.6. Clearly there is a system-wide cycle, and these four
processes are deadlocked.
Every time a transaction has to wait for a resource, it sends a message to the process holding it.
This message contains the id of the blocked process. If the recipient is waiting on something, it
updates the message with its own id and forwards it to the process holding that resource. If the
message ever comes back to the original sender, then there is a cycle.
This scheme is attractive, but sending a message when you are blocked is not exactly trivial.
196
9.8.3 Recovery
Discovering a cycle in a distributed system is one thing. But then there is the question of how to
break a cycle when one is discovered. There are two options available: either kill one or more
processes or preempt some resources from one or more processes.One possibility is for the
blocked process to terminate itself. But this could be overkill if more than one process discovers
the cycle at about the same time. With the distributed algorithm, each blocked process could add
its id number to the message instead of replacing it. This way the ids of all processes involved in
the cycle would be known to all of the others. The highest or lowest could then choose itself as
victim.
Effort is now being directed towards allowing memory to be shared by processes on different
machines. This would allow a shared memory programming model to be used by cooperating
processes in a distributed system. With such a scheme, a standalone system could be distributed
with minimum effort.
9.9.1 Implementation
To a programmer, there should be no difference between distributed shared memory and shared
memory on a standalone machine. Each process has its own virtual address space. Some of the
physical memory backing this address space is also mapped into the address space of other
processes. Whether these processes are on the same machine or on remote machines should not
really be relevant to a programmer.On a standalone machine there is no question about where the
shared memory will be physically located – it will have to be somewhere in the physical memory
of that machine. But with a distributed system, it could physically exist on any of the
machines.The simplest way to implement distributed shared memory is to have one server
machine which manages the shared memory on behalf of client processes on remote machines.
These communicate with the server by means of RPC.
The initial mapping of a range of such distributed shared memory into the address space of a
process can be handled in a manner very similar to POSIX shared memory. An initial RPC
identifies the block of shared memory which is being requested. The server checks permissions
and access mode etc., and then returns a handle. This is a unique identifier which is used to
identify and authenticate all further accesses to that block of distributed shared memory.The
server exports two further procedures, one for reading, and one for writing. Finally there is a
procedure which lets the server know that the client has no further use for this block of shared
memory.
197
9.9.2 Local run-time support
While easy to understand, the foregoing scheme is very inefficient and can be improved on in a
number of ways. So far, the local memory manager is not involved at all – the client process has
to make an RPC for every read and write. Ideally, distributed shared memory should be
transparent to a process. Assignment to variables in distributed shared memory should be
identical to assignment to variables in local memory –the memory manager should take care of
all of the necessary overhead.
This can be accomplished by a slight extension to the local memory manager and to the mmap ( )
system service so that it can map this shared memory into the address map of the process,
possibly setting it up as a segment in its own right. Then it returns a pointer to this segment. From
here on the process accesses the distributed shared memory using standard local pointers.The first
time a program references an address which is not local a remote page fault occurs, and the
distributed shared memory run-time fetches the appropriate page. To a user, this looks exactly
like the traditional system. The main difference is that the backing store is a remote server, not a
local disk.
The simplest way to implement a distributed file system is to have the files on one machine (the
server) and the user on another machine (the client). In a realistic system there will be many
clients, and possibly more than one server. Servers can run on dedicated machines, or a machine
can be both server and client.Typically, a user would make normal system calls which the client
software translates to RPC calls to the server.
A server can track each file being accessed by each client. It can implement locks and perform
read-ahead for sequential reads, just like a standalone file system. All of this implies that it must
maintain information about every file that each client has opened, and that this information must
be maintained until the file is closed. Because of this, it is known as a stateful server. Such a
server can be in difficulties after a crash, when all of this information is lost. It will not know
about which files any particular client has open.Another possibility is that it can simply provide
blocks as requested. In this case the server does not keep track of which clients are accessing
which files. Such a stateless server is slower, but it is more robust, in that it simplifies the
recovery procedures after a server crashes and reboots. A client must provide a stateless server
with a filename and offset at each request. No open or close requests are sent to the server.
198
9.10.2 Naming schemes
One basic problem for any distributed file system is to provide a namespace that uniquely
identifies each file in the system. File systems have traditionally used directories for this,
organized in some form of tree or graph structure.
At one end of the scale, such a distributed system could be implemented by integrating all files
and directories into a single global namespace spanning all machines in the system. The
complexity of administering such a namespace, particularly if the file system is distributed over
many machines, outweighs any benefits it may bring.
At the other end of the scale, a two-part naming scheme, such as host:localname, would be
relatively easy to implement but would not be location transparent. A user would have to know
exactly where each file was in the system. A middle of the road approach is normally acceptable.
Remote file systems are attached to local systems and appear seamlessly as just another part of
the local system. A user should not be able to tell the difference between a local directory and a
remote one. Obviously both the exporting and the importing machine would need extra software.
Figure 9.7 shows two simple file systems, one on a local machine and the other on a remote
machine. Figure 9.8 shows the situation on the local machine after it has attached the contents of
199
the remote directory ‘projects’ to its directory ‘programs’. Note that both of these files still exist
on the remote machine; they only appear as if they were attached to the local ‘programs’
directory. Note also that the local machine must know where they are in order to attach them, but
after that all references to these files are transparent. Finally, the original contents of the
‘programs’ directory on the local machine are no longer visible. They are still there, but cannot
be accessed again until the remote directory has been detached.
The set of requests which the client can make on the server are normally very similar to the
standard POSIX system services for files, but are implemented as synchronous remote procedure
calls. This means that the client blocks until the server replies.On a client machine, the system
calls and all of the high-level processing are identical for local and remote files. The local
operating system determines whether a file is local or remote at open() . If it is a local file, then it
is handled by its own particular file system code. If the file is a remote one, then the operations
field in its global file table entry points to functions supplied by the distributed file system (DFS).
Figure 9.9 illustrates how the RPC mechanism is integrated with the virtual file system.
9.10.4 Caching
A server will normally cache directories or blocks of a file to save on disk accesses. This is just
normal file system buffering, and presents no extra problems.A client will also cache file blocks
to avoid the delays associated with using a network and to reduce the volume of traffic on the
network, as well as the load on the server.There is a problem keeping cached copies of a shared
file consistent with the original and with each other.
200
There are a number of approaches to this.
Write-through Any write to cached data is also written to the server. Write-through is
very reliable, but it implies heavy overheads. It really uses the cache only for reading.
Delayed-write A write to cached data is not passed on to the server immediately, but at
fixed intervals.
Write-on-cloose Writes are only visible to other processes when a file is closed. In this
case, if two processes are writing then the last one to close overwrites anything written by
the other one.
9.10.5 Replication
Multiple copies of the same file may be kept on different servers to increase reliability and to
distribute the workload over the system. This is not just caching, but permanently replicating
resources in their entirety.Replication should be transparent to the user, implemented by the
system. Each replica manager retains a physical copy of some or all files. These replica managers
can periodically swap updates.Replication of read only files is trivial, but it does ensure
availability when one of the servers is down. If requests are processed in parallel by all of the
replica servers, then it is even fault-tolerant. But once writable files are replicated there are
questions of consistency if some are updated and others not.
One of the most commonly used file systems is NFS. It was originally designed as part of SunOS,
the operating system for Sun workstations. The definitions of its protocols are in the public
domain, which led to its widespread adoption so that it has become a de facto standard. It has
been implemented on top of a range of different hardware and operating systems.NFS maintains
only one copy of a file, and the most recent update is the version visible to the system. Any
replication or caching is extra, not part of NFS. As several clients can import the same directory,
sharing is implicit in the system. This sharing is controlled by the access permission mechanism
on the server.
Implementations of distributed file systems will be required to scale well to the very large
systems of the future, possibly even worldwide systems. This rules out any mechanisms which
rely on centralisation. Broadcasting, or sending a message to all machines on the network, is
another mechanism which does not scale well to larger systems. So there will be more emphasis
placed on fully distributed algorithms.
ACTIVITY 9.2
1. Outline the approach to building distributed systems taken by CORBA and by DCE.
2. Outline the steps involved in setting up communication using a stream socket, both on the
server side and on the client side.
3. Give an overview of the RPC mechanism.
4. Explain both the centralized and distributed approach to mutual exclusion in a distributed
system.
201
5. Explain both the centralized and distributed approaches to deadlock in a distributed
system.
6. Describe how distributed shared memory could be implemented.
7. Explain the difference between stateful and stateless distributed file systems.
8. Explain how a remote file system can be integrated into the directory structure of a local
machine.
9. Outline some different approaches to keeping cached copies of shared files consistent
with the original and with each other.
9.11. Summary
A distributed system is a collection of networked computers which cooperate to distribute
computation among themselves.Network operating systems provide some distributed
applications, such as remote directory and printer sharing. A true distributed operating system
must go further than this, and at least begin to blur the boundaries between machines.
Interprocess communication between different machines is an essential prerequisite for any
distributed system. The socket facility allows processes to send and receive data across a network
without having to worry about any of the underlying mechanisms. Remote procedure call is a
mechanism which allows programs to call functions on remote machines in a transparent manner.
A server manages distributed shared memory for a group of clients. These send requests to the
server using RPC. On a local machine, the onus could be put on the memory manager to mask
any difference between local and remote memory.A distributed file system allows a file on one
machine to appear as if it actually existed on another machine. The system consists of a file
server and a client. A user makes normal system calls, which are translated to RPC calls to a
server. Such a distributed file system can provide stateful or stateless service. The Sun Network
File System is a commonly used distributed file system.
FURTHER READING.
Silberschatz, A. and Galvin, P. (1998) Operating System Concepts, 5th edn. Reading, MA:
Addison-Wesley.
Stallings, W. (1995) Operating Systems, 2nd edn. Englewood Cliffs, NJ: Prentice Hall. Stevens,
W. R. (1992) Advanced Programming in the Unix Environment. Reading, MA: Addison-Wesley.
Tanenbaum, A. (1992) Modern Operating Systems. Englewood Cliffs, NJ: Prentice Hall.
Tanenbaum, A. (1995) Distributed Operating Systems. Englewood Cliffs, NJ: Prentice Hall.
Tanenbaum, A. and Woodhull, A. (1997) Operating Systems: Design and Implementation, 2nd
edn. Upper Saddle River, NJ: Prentice Hall.
202
Unit10 Fault Tolerance and Security
10.0. Introduction
As more and more of this information is stored in computer systems, the need to protect it is
becoming increasingly important Protecting this information against unauthorized usage is
therefore a major concern of all operating systems. Unfortunately, it is also becoming
increasingly difficult due to the widespread acceptance of system bloat as being a normal and
acceptable phenomenon. In the following sections we will look at a variety of issues concerned
with security and protection, some of which have analogies to real-world protection of
information on paper, but some of which are unique to computer systems. In this chapter we will
examine computer security as it applies to operating systems.
Some people use the terms “security” and “protection” interchangeably. Nevertheless, it is
frequently useful to make a distinction between the general problems involved in making sure
that files are not read or modified by unauthorized persons, which include technical,
administrative, legal, and political issues on the one hand, and the specific operating system
mechanisms used to provide security, on the other. To avoid confusion, we will use the term
security to refer to the overall problem, and the term protection mechanisms to refer to the
specific operating system mechanisms used to safeguard information in the computer. The
boundary between them is not well defined, however. First we will look at security to see what
the nature of the problem is. Later on in the chapter we will look at the protection mechanisms
and models available to help achieve security. Security has many facets. Three of the more
important ones are the nature of the threats, the nature of intruders, and accidental data loss. We
will now look at these in turn.
203
the failure, as compared to a naïvely designed system in which even a small failure can cause
total breakdown. A fault-tolerant design enables a system to continue its intended operation,
possibly at a reduced level, rather than failing completely, when some part of the system fails.
A fault is a malfunction, the cause of an error. It can be a hardware or a software fault. It can be
transient, intermittent, or permanent.A transient fault occurs once only. It is usually a hardware
fault, such as a random cosmic ray flipping a bit in memory. There is not much that can be done
about identifying the fault and repairing it, as it does not occur again. The emphasis is on
detecting it and recovering from it.An intermittent fault occurs again and again, at unpredictable
intervals. It is the most difficult type of fault to diagnose, as you never know when it will occur
again. It can be a failing hardware component or a bug in software.A permanent fault means that
something is broken and must be replaced. This can be a hardware component or a piece of code
which is not doing what it is supposed to do.
The first requirement for a fault-tolerant system is that it be able to detect a fault. This should be
done as soon as possible, as the longer it goesundetected the more errors will be caused, reducing
the chance of identifying the underlying fault.The general approach to this is to use redundancy.
Data faults can be detected by using information redundancy, such as check-bits or checksums.
They can indicate that a data item has been corrupted. A step up from this is to use error-
correcting codes. There is a greater overhead involved here, but they make it possible to recreate
the correct value.Another possibility is to use time redundancy by repeating the operation. For
example, two different copies of a file can be read and compared. If they do not match exactly,
we know that an error has occurred. This more than doubles the time involved. Sometimes it is
possible to use physical redundancy by installing extra equipment, e.g. duplicate processing on
two different CPUs.
When a fault-tolerant system detects a fault, it either fails gracefully or masks it. Graceful failure
means that it informs the user, notifies any other processes it is communicating with, closes all
I/O streams it has open, returns memory, and stops the process. Masking a fault would include
retrying an operation on a different CPU, attempting to contact a different server on a network,
using a different software routine, or using an alternative source for data.
At the very least, faults should be confined to the process in which they occur. They must not
spread to other processes or to the operating system itself.
204
10.2.4 Faults in file systems
The following are some common faults which are specific to file systems.
Bad read. One or more sectors of a disk cannot be read. At best, the user loses part of a
file. At worst, a whole directory, an index block or a bitmap can be lost.
Bad write. Information is written to the wrong sector. Chains of pointers can be corrupted.
One file can be linked into another or, even worse, into the free list, resulting in chaos.
When a system uses a disk buffer cache, a power failure can leave the disk in an
inconsistent state.
Viruses can cause corruption or even total loss of data.
There are always faults attributable to humans, whether intentional or not.
Precautions
In designing safeguards against these faults, the following factors have to be balanced.
What is the mean time between failure (MTBF) for the hardware? In other words, what
are the odds against the system crashing?
What is the operational cost of making backup copies? If it only requires a click on an
icon or can be done automatically, then why not do it? But in some systems it may mean
shutting the computer down.
What is the cost of loss of information? The loss of the latest version of a student program
is very different from the loss of banking information.
One hundred per cent protection requires that everything be recorded in duplicate or triplicate.
This involves two or three similar drives, all writing in unison. Reads are also duplicated and
compared.Another security feature is to read after every write to check that the data has been
properly written. This involves a heavy time overhead.
Generally some degree of loss can be tolerated, and the policy adopted is regular backups.One
approach to this is to do a total backup at fixed intervals. The entire file system is copied to tape
or to another disk. Recovery from a crash is easy when using a total backup. The disk is
reformatted or replaced, and the backup tape is copied to the disk. We then have a copy of the file
system as it was a day, a week or a month ago.
205
10.3. Security
There may also be deliberate attempts to cause the system to malfunction, and this is the area of
security.Threats to security come at different levels. Leakage happens when confidential
information is accidentally made available to an unauthorized agent. Stealing is when such an
agent takes positive action to access the information. Tampering is when data in the system is
changed in such a way that it still appears to be valid. Vandalism is when data is changed so as to
be meaningless.Once there is more than one user, or even more than one process, on a machine, it
becomes necessary to provide some security protection.
A security policy is a statement of the rules and practices that regulate how a computer system
manages, protects and distributes sensitiveinformation. The security policy decides whether a
given subject (user, process etc.) can be permitted to gain access to a specific object (resource,
data, file etc.). A computer system should have sufficient hardware and software to enforce such
a security policy.Each user has a set of privileges which give rights to access certain objects
through operating system functions. These privileges are acquired when a user logs on to the
system, and are normally inherited by each new process that the user creates.
Protection can be implemented in many different ways, from an all-or-nothing level down to a
very fine granularity. Generally there is a tradeoff between the granularity of the protection and
the overhead of implementing it.
Physical exclusion. If the system can only be accessed from specific terminals in a designated
area, then traditional security methods such as locks or identity checks can be used to control
access to those terminals. This level of protection is normally only used in the highest security
sites.
Exclusion of unauthorized users.The traditional approach to this is to issue each user with an
account name and a password. The problems with passwords are well known. Many systems will
not allow a user to set a password which is easily cracked. It is common practice that only
encrypted versions of passwords are stored, and the encryption algorithm cannot be reversed.
Distinguishing between users. We do not, however, want even authorized users to have access
to everything. So we must make some further distinctions after access. The system maintains a
list of the privileges granted to each user and checks every request for resources against this.This
is certainly an improvement, but it still has the weakness that access rights remain unchanged
during the lifetime of a process. A process may need access to a particular resource just once, for
example at initialisation. But it retains that right, even though it is unneeded. This is a potential
security hole.
Access for current activity only. The most fine-grained protection is based on the idea of need-
to-know, or access rights for the current activity only.For example, consider two processes
206
performing two-way communication through a shared buffer. Access privileges for the buffer
segment should be dependent on whether a process is currently engaged in reading or writing. So
we need mechanisms for granting and revoking privileges while a process is running.
The most general way of tracking who can access what, and how, in a system is to use an access
matrix.
The rows of the matrix represent processes, also known as subjects or domains. This is the ‘who’
part. The columns represent the resources, or objects. This is the ‘what’. The entries in the array
represent the ‘how’ part.
The information in the access matrix should not be accessible to user-level processes. It itself is
highly protected by the operating system.
An access matrix for even a small system can grow quite large. Most of the entries will be empty.
Such a sparse matrix is very wasteful of space. So they are rarely implemented as actual matrices.
Other methods are used, purely to pack the relevant data in more concisely. We will now
consider some of these.
Global table
Each entry in such a table is a set of ordered triples <Domain, Object, RightsSet>. For example,
the information from the first row of Figure 8.1 would be encapsulated as <Domain1, File1,
Read>, <Domain1, File3, Read>.
If the triple corresponding to a particular operation exists, then that operation is valid. Otherwise
it is invalid.
Even though it does not take up as much space as a full access matrix, such a table can be quite
large. Also, if a particular object can be accessed by every subject, then it must have a separate
entry corresponding to each domain. This tends to inflate the size of the table.
207
Capability lists
One way of slicing the access matrix is to have a separate list for each domain. This would
consist of the couples <Object, RightsSet>. As such a list specifies what a subject operating in
that domain can do, not surprisingly it is known as a capability list. The capability list for
Domain1 in Figure 10.1 would be <File1, Read>, <File3, Read>.
Another way of compacting an access matrix is to store each column in the matrix as a separate
list of ordered pairs <Domain, RightsSet>. With this scheme, each object has its own access
control list. For example, the
access list for File1 in Figure 10.1 would be <Domain1, Read>, <Domain4, Read/Write>.
If there are any default access rights, these could be put at the head of the list, e.g. <AllDomains,
RightsSet>. So the defaults would be checked first before going on to scan the remainder of the
list.
Activity 10.1
10.4.1 Authentication
This involves the set of techniques whereby a server can identify a remote user or a client can
verify that it is dealing with a legitimate server.
Servers themselves are always in danger of being infiltrated. The best known method is still
password cracking. Viruses are another way to get control of a server. A virus (in this context) is
a program which can modify another program, such as the part of the operating system
implementing the security policy.
One-way authentication
The classic mechanism for a user to authenticate itself to a server has been passwords. An
improvement for standalone machines would be non-forgeable identifiers, such as fingerprints or
voice patterns. These are difficult to transmit over a network, and so are of less value in a
distributed system.
208
Two-way authentication
It is not sufficient for a machine to authenticate a user; the user must also be able to authenticate
the machine. It is important to know that it is the legitimate server and not a fake. A common
method is to have one trusted authentication server in the system. Each machine can agree a
password with this server.
For example, a user wishing to log on to a file server would send its request through the
authentication server. This can verify the user and authenticate the request to the file server. It
can also verify the file server and assure the user that it is dealing with the authorized server.
10.4.2 Cryptography
For example, authorisation to an ATM to issue money could be replayed later in the hope of
getting the machine to issue the money again.
The whole aim of cryptography is to conceal private information from unauthorized eyes. The
sender uses a rule to transform the data into an unintelligible encrypted form; the recipient uses
an inverse rule. However, if an intruder discovers the rule, then all secrecy is lost.
An improvement on a rule is a function with a key as a parameter. This relies on the secure
distribution and storage of keys. Cryptography using keys also involves authentication, as
possession of the appropriate encryption or decryption key can be used to authenticate the sender
or receiver. Modern systems are moving towards authentication servers which both authenticate
users and issue keys. Of course this is putting all of the eggs in one basket. If an intruder gains
access to this server, all security is broken. It is now accepted that such security systems need to
be rigidly designed using formal methods.
The key is issued in two forms. One is used to encrypt messages. Another, which can be sent
securely to the recipient, is used to decrypt them. But secure distribution of keys is a problem.
One possibility is to use an authentication server to distribute them. Such a server maintains a
table of <name, secret key> pairs, which is used to authenticate clients.
Suppose, for example, a client A wishes to communicate secretly with B. It first authenticates
itself to the server by sending a message encrypted with its secret key. This message asks for a
209
key to communicate with its destination B. This is the first of the sequence of messages
illustrated in Figure 10.2.
The server uses A’s secret key to decrypt this message, so proving that it must have come from
A. It then generates a one-off key for this communication between A and B, and sends this key,
as well as a copy of this key encrypted with B’s secret key, back to A. This whole message is
encrypted with A’s secret key, so A can decrypt it. This is the second message in Figure 10.2.
A then sends the encrypted version of the one-off key to B. As the server originally encrypted
this with B’s secret key, B is able to decrypt it and extract the one-off key. A never learns B’s
secret key.
Then A encrypts its message using the one-off key, and sends this encrypted message to B. B
uses the one-off key to decrypt it, thus proving that it must have come from A.
Each recipient has two keys, such that either can be used to decrypt a message encrypted with the
other. One, the private key, is kept very secret. The other, the public key, is freely available, e.g.
on the Web. Anyone can encrypt a message with the public key, but only the recipient can
decrypt it, using the private key.
210
10.5.3 Digital signatures
With the growth in the number of computer documents, we need to be able to authenticate such
documents. Again, either public or secret keys can be used for this.
Public keys
The document is encrypted with the private key. Anyone can decrypt it using the public key, but
only the originator could have encrypted it. So its authenticity is guaranteed.
Secret keys
This requires the use of an authentication server. The source process sends the message,
encrypted with its secret key, to this server, which verifies the sender. The server then adds a
certificate of authenticity, and encrypts the message with the secret key of the destination
process. The receiver has the assurance of the server it trusts that the message is authentic.
Activity 10.2
1. Explain the problem of mutual authentication in a distributed system, and some of the
approaches taken.
2. Outline how secret keys and public keys can be used to encrypt data passing over
communication lines.
3. Explain how digital signatures work.
UNIX security aims to protect users from each other and the system’s trusted computing base
(TCB) from all users. Informally, the UNIX TCB consists of the kernel and several processes that
run with the identity of the privileged user, root or superuser. These root processes provide a
variety of services, including system boot, user authentication, administration, network services,
etc. Both the kernel and root processes have full system access. All other processes have limited
access based on their associated user’s identity.
211
10.5.1. Unix Protection System
UNIX implements a classical protection system, not the secure protection system. As stated in
before, a UNIX protection system consists of a protection state and a set of operations that enable
processes to modify that state. Thus, UNIX is a discretionary access control (DAC) system.
However, UNIX does have some aspects of the secure protection system in Definition 2.4. First,
the UNIX protection system defines a transition state that describes how processes change
between protection domains. Second, the labelling state is largely ad hoc. Trusted services
associate processes with user identities, but users can control the assignment of permissions to
system resources (i.e., files). In the final analysis, these mechanisms and the discretionary
protection system are insufficient to build a system that satisfies the secure operating system
requirements.
Recall that a protection state describes the operations that the system’s subjects can perform on
that system’s objects. The UNIX protection state associates process identities (subjects) with their
access to files (objects). Each UNIX process identity consists of a user id (UID), a group id
(GID), and a set of supplementary groups. These are used in combination to determine access as
described below.
All UNIX resources are represented as files. The protection state specifies that subjects may
perform read, write, and execute operations on files, with the standard meaning of these
operations. While directories are not files, they are represented as files in the UNIX protection
state, although the operations have different semantics (e.g., execute means search for a
directory). Files are also associated with an owner UID and an owner GID which conveys special
privileges to processes with these identities. A process with the owner UID can modify any
aspect of the protection state for this file. Processes with either the owner UID or group GID may
obtain additional rights to access the file as described below.
The limited set of objects and operations enabled UNIX designers to use a compressed access
control list format called UNIX mode bits, to specify the access rights of identities to files. Mode
bits define the rights of three types of subjects: (1) the file owner UID; (2) the file group GID;
and (3) all other subjects. Using mode bits authorization is performed as follows. First, the UNIX
authorization mechanism checks whether the process identity’s UID corresponds to the owner
UID of the file, and if so, uses the mode bits for the owner to authorize access. If the process
identity’s GID or supplementary groups correspond to the file’s group GID, then the mode bits
for the group permissions are used. Otherwise, the permissions assigned to all others are used.
Example 10.1. UNIX mode bits are of the form {owner bits, group bits, others bits} where each
element in the tuple consists of a read bit, a write bit, and an execute bit. The mode bits:
rwxr--r--mean that a process with the same UID as the owner can read, write, or execute the file,
a process with a GID or supplementary group that corresponds to the file’s group can read the
file, and others can also only read the file. Suppose a set of files have the following owners,
groups, and others mode bits as described below:
212
Name Owner Group Mode Bits
foo alice faculty rwxr--r--
Bar bob students rw-rw-r--
baz charlie faculty rwxrwxrwx
Then, processes running as alice with the group faculty can read,write, or execute foo and baz,
but only read bar. For bar, Alice does not match the UID (bob), nor have the associated group
(students). The process has the appropriate owner to gain all privileges for foo and the
appropriate group to gain privileges to baz.
As described above, the UNIX protection system is a discretionary access control system.
Specifically, this means that a file’s mode bits, owner UID, or group GID may be changed by any
UNIX processes run by the file’s owner (i.e., that have the same UID as the file owner). If we
trust all user processes to act in the best interests of the user, then the user’s security goals can be
enforced. However, this is no longer a reasonable assumption. Nowadays, users run a variety of
processes, some of which may be supplied by attackers and others may be vulnerable to
compromise from attackers, so the user will have no guarantee that these processes will behave
consistently with the user’s security goals. As a result, a secure operating system cannot use
discretionary access control to enforce user security goals.
Since discretionary access control permits users to change their files owner UID and group GID
in addition to the mode bits, file labelling is also discretionary. A secure protection system
requires a mandatory labelling state, so this is another reason that UNIX systems cannot satisfy
the requirements of a secure operating system.
UNIX processes are labelled by trusted services from a set of labels (i.e., user UIDs and group
GIDs) defined by trusted administrators, and child processes inherit their process identity from
their parent. This mandatory approach to labelling processes with identities would satisfy the
secure protection system requirements, although it is rather inflexible.
Finally, UNIX mode bits also include a specification for protection domain transitions, called the
setuid bit. When this bit is set on a file, any process that executes the file with automatically
perform a protection domain transition to the file’s owner UID and group GID. For example, if a
root process sets the setuid bit on a file that it owns, then any process that executes that file will
run under the root UID. Since the setuid bit is a mode bit, it can be set by the file’s owner, so it is
also managed in a discretionary manner. A secure protection state requires a mandatory transition
state describe all protection domain transitions, so the use of discretionary setuid bits is
insufficient.
213
UNIX authorization occurs when files are opened, and the operations allowed on the file are
verified on each file access. The requesting process provides the name of the file and the
operations that will be requested upon the file in the open system call. If authorized, UNIX
creates a file descriptor that represents the process’s authorized access to perform future
operations on the file.
File descriptors are stored in the kernel, and only an index is returned to the process. Thus, file
descriptors are a form of capability (see Chapter 2 for the definition and Chapter 10 for a
discussion on capability-based systems). User processes present their file descriptor index to the
kernel when they request operations on the files that they have opened. UNIX authorization
controls traditional file operations by mediating file open for read, write, and execute
permissions. However, the use of these permissions does not always have the expected effect: (1)
these permissions and their semantics do not always enable adequate control and (2) some objects
are not represented as files, so they are unmediated. If a user has read access to a file, this is
sufficient to perform a wide-variety of operations on the file besides reading. For example,
simply via possession of a file descriptor, a user process can perform any ad hoc command on the
file using the system calls ioctl or fcntl, as well as read and modify file metadata. Further, UNIX
does not mediate all security-sensitive objects, such as network communications. Host firewalls
provide some control of network communication, but they do not restrict network communication
by process identity.
The UNIX authorization mechanism depends on user-level authentication services, such as login
and sshd, to determine the process identity (i.e., UID, GID, and supplementary groups, see
Section 4.2.1).When a user logs in to a system, her processes are assigned her login identity. All
subsequent processes created in this login session inherit this identity unless there is a domain
transition (see below). Such user-level services also need root privileges in order to change the
identity of a process, so they run with this special UID. However, several UNIX services need to
run as root in order to have the privileges necessary to perform their tasks. These privileges
include the ability to change process identity, access system files and directories, change file
permissions, etc.
Some of these services are critical to the correct operation of UNIX authorization, such as sshd
and passwd, but others are not, such as inetd and ftp. However, a UNIX system’s trusted
computing base must include all root processes, thus risking compromise of security critical
services and the kernel itself. UNIX protection domain transitions are performed by the setuid
mechanism. setuid is used in two ways: (1) a root process can invoke the setuid system call to
change the UID of a process 4 and (2) a file can have its setuid mode bit set, such that whenever
it is executed its identity is set to the owner of the file. In the first case, a privileged process, such
as login or sshd, can change the identity of a process. For example, when a user logs in, the login
program must change the process identity of the user’s first process, her shell, to the user to
ensure correct access control. In the second case, the use of the setuid bit on a file is typically
used to permit a lower privileged entity to execute a higher privileged program, almost always as
root.
For example, when a user wishes to change her password, she uses the passwd program. Since
the passwd program modifies the password file, it must be privileged, so a process running with
the user’s identity could not change the password file. The setuid bit on the root-owned, passwd
214
executable’s file is set, so when any user executes passwd, the resultant process identity
transitions to root.While the identity transition does not impact the user’s other processes, the
writers of the passwd program must be careful not to allow the program to be tricked into
allowing the user to control how passwd uses its additional privileges.
UNIX also has a couple of mechanisms that enable a user to run a process with a reduced set of
permissions. Unfortunately, these mechanisms are difficult to use correctly, are only available to
root processes, and can only implement modest restrictions. First, UNIX systems have a special
principal nobody that owns no files and belongs to no groups. Therefore, a process’s permissions
can be restricted by running as nobody since it never has owner or group privileges.
Unfortunately, nobody, like all subjects, has others privileges. Also, since only root can do a
setuid only a superuser process can change the process identity to nobody. Second, UNIX chroot
can be used to limit a process to a subtree of the file system [262]. Thus, the process is limited to
only its rights to files within that subtree. Unfortunately, a chroot environment must be setup
carefully to prevent the process from escaping the limited domain. For example, if an attacker can
create /etc/passwd and /etc/shadow files in the subtree, she can add an entry for root, login as this
root, and escape the chroot environment (e.g., using root access to kernel memory). Also, a
chroot environment can only be setup by a root process, so it is not usable to regular system
users. In practice, neither of these approaches has proven to be an effective way to limit process
permissions.
1. Complete Mediation: How does the reference monitor interface ensure that all security
sensitive operations are mediated correctly? The UNIX reference monitor interface
consists of hooks to check access for file or inode permission on some system calls. The
UNIX reference monitor interface authorizes access to the objects that the kernel will use
in its operations. A problem is that the limited set of UNIX operations (read, write, and
execute) is not expressive enough to control access to information. UNIX permits
modifications to files without the need for write permission (e.g., fcntl).
2. Complete Mediation: Does the reference monitor interface mediate security-sensitive
operations on all system resources? UNIX authorization does not provide complete
mediation of all system resources. For some objects, such as network communications,
UNIX itself provides no authorization at all.
3. Complete Mediation: How do we verify that the reference monitor interface provides
complete mediation? Since the UNIX reference monitor interface is placed where the
security-sensitive operations are performed, it difficult to know whether all operations
have been identified and all paths have been mediated. No specific approach has been
used to verify complete mediation.
4. Tamperproof: How does the system protect the reference monitor, including its
protection system, from modification? The reference monitor and protection system are
stored in the kernel, but this does not guarantee tamper-protection. First, the protection
system is discretionary, so it may be tampered by any running process. Untrusted user
processes can modify permissions to their user’s data arbitrarily, so enforcing security
215
goals on user data is not possible. Second, the UNIX kernel is not as protected from
untrusted user processes as the Multics kernel is. Both use protection rings for isolation,
but the Multics system also explicitly specifies gates for verifying the legality of the ring
transition arguments. While UNIX kernels often provide procedures to verify system call
arguments, such procedures are may be misplaced. Finally, user-level processes have a
variety of interfaces to access and modify the kernel itself above and beyond system calls,
ranging from the ability to install kernel modules to special file systems (e.g., /proc or
sysfs) to interfaces through netlink sockets to direct access to kernel memory (e.g., via the
device file/dev/kmem). Ensuring that these interfaces can only be accessed by trusted
code has become impractical.
5. Tamperproof: Does the system’s protection system protect the trusted computing base
programs? In addition to the kernel, the UNIX TCB consists of all root processes,
including all processes run by a user logged in as a root user. Since these processes could
run any program, guaranteeing the tamper-protection of the TCB is not possible. Even
ignoring root users, the amount of TCB code is far too large and faces far too many
threats to claim a tamperproof trusting computing base. For example, several root
processes have open network ports that may be used as avenues to compromise these
processes. If any of these processes is compromised, the UNIX system is effectively
compromised as there is no effective protection among root processes. Also, any root
process can modify any aspect of the protection system. As we show below, UNIX root
processes may not be sufficiently trusted or protected, so unauthorized modification of the
protection system, in general, is possible. As a result, we cannot depend on a tamperproof
protection system in a UNIX system.
6. Verifiable: What is basis for the correctness of the system’s TCB? Any basis for
correctness in a UNIX system is informal. The effectively unbounded size of the TCB
prevents any effective formal verification. Further, the size and extensible nature of the
kernel (e.g., via new device drivers and other kernel modules) makes it impractical to
verify its correctness.
7. Verifiable: Does the protection system enforce the system’s security goals? Verifiability
enforcement of security goals is not possible because of the lack of complete mediation
and the lack of tamper proofing. Since we cannot express a policy rich enough to prevent
unauthorized data leakage or modification, we cannot enforce secrecy or integrity security
goals. Since we cannot prove that the TCB is protected from attackers, we cannot prove
that the system will be remain able to enforce our intended security goals, even if they
could be expressed properly.
216
daemons. In order to maintain the integrity of the system’s trusted computing base, and hence
achieve the reference monitor guarantees, such process must protect themselves from such input.
However, several vulnerabilities have been reported for such processes, particularly due to buffer
overflows [232, 318], enabling remote attackers to compromise the system TCB. Some of these
daemons have been redesigned to remove many of such vulnerabilities (e.g., Postfix [317, 73] as
a replacement for sendmail and privilege-separated SSH [251]), but a comprehensive justification
of integrity protection for the resulting daemons is not provided. Thus, integrity protection of
network facing dameons in UNIX is incomplete and ad hoc.
Further, some network-facing daemons, such as remote login daemons (e.g.,telnet, rlogin,etc.)
ftpd, and NFS, puts an undo amount of trust in the network. The remote login daemons and ftpd
are notorious for sending passwords in the clear. Fortunately, such daemons have been obsoleted
or replaced by more secure versions (e.g., vsftpd for ftpd). Also, NFS is notorious for accepting
any response to a remote file system request as being from a legitimate server. Network-facing
daemons must additionally protect the integrity of their secrets and authenticate the sources of
remote data whose integrity is crucial to the process.
Rootkits Modern UNIX systems support extension via kernel modules that may be loaded
dynamically into the kernel. However, a malicious or buggy module may enable an attacker to
execute code in the kernel, with full system privileges. A variety of malware packages, called
rootkits, have been created for taking advantage of kernel module loading or other interfaces to
the kernel available to root processes. Such rootkits enable the implementation of attacker
function and provide measures to evade from detection. Despite efforts to detect malware in the
kernel, such rootkits are difficult to detect, in general.
Environment Variables UNIX systems support environment variables, system variables that are
available to processes to convey state across applications. One such variable is LIBPATH whose
value determines the search order for dynamic libraries. A common vulnerability is that an
attacker can change LIBPATH to load an attacker-provided file as a dynamic library. Since
environment variables are inherited when a child process is created, an untrusted process can
invoke a TCB program (e.g., a program file which setuid’s to root on invocation, under an
untrusted environment. If the TCB process depends on dynamic libraries and does not set the
LIBPATH itself, it may be vulnerable to running malicious code. As many TCB programs can be
invoked via setuid, this is a widespread issue.
Further, TCB programs may be vulnerable to any input value supplied by an untrusted process,
such as malicious input arguments. For example, a variety of program permit the caller to define
the configuration file of the process. A configuration file typically describes all the other places
that the program should look for inputs to describe how it should function, sometimes including
the location of libraries that it should use and the location of hosts that provide network
information. If the attack can control the choice of a program’s configuration file, she often has a
variety of ways to compromise the running process. Any TCB program must ensure their
integrity regardless of how they are invoked.
Shared Resources If TCB processes share resources with untrusted processes, then they may be
vulnerable to attack. A common problem is the sharing of the /tmp directory. Since any process
217
can create files in this directory, an untrusted process is able to create files in this directory and
grant other processes, in particular a TCB process, access to such files as well. If the untrusted
process can guess the name of TCB process’s /tmp file, it can create this file in advance, grant
access to the TCB process, and then have access itself to a TCB file. TCB processes can prevent
this problem by checking for the existence of such files upon creation (e.g., using the O_CREAT
flag). However, programmers have been prone to forget such safeguards. TCB process must take
care when using any objects shared by untrusted processes.
As a result of the discretionary protection system, the size of the system TCB, and these types of
vulnerabilities, converting a UNIX system to a secure operating system is a significant challenge.
Ensuring that TCB processes protect themselves, and thus protect a reference monitor from
tampering, is a complex undertaking as untrusted processes can control how TCB processes are
invoked and provide inputs in multiple ways: network, environment, and arguments. Further,
untrusted processes may use system interfaces to manipulate any shared resources and may even
change the binding between object name and the actual object.
218
Figure 10.3: Windows Access Control Lists (ACLs) and process tokens for Examples 10.2 and
10.3
Specifically, the Windows protection system differs from UNIX mainly in the variety of its
objects and operations and the additional flexibility it provides for assigning them to subjects.
When the Windows 2000 access control model was being developed, there were a variety of
security systems being developed that provided administrators with extensible policy languages
that permitted flexible policy specification, such as the Java 2 model. While these models address
some of the shortcomings of the UNIX model by enabling the expression of any protection state,
they do not ensure a secure system.
219
Subjects in Windows are similar to subjects in UNIX. In Windows, each process is assigned a
token that describes the process’s identity. A process identity consists of user security identifier
(principal SID, analogous to a UNIX UID), a set of group SIDs (rather than a single UNIX GID
and a set of supplementary groups), a set of alias SIDs (to enable actions on behalf of another
identity), and a set of privileges (ad hoc privileges just associated with this token). A Windows
identity is still associated with a single user identity, but a process token for that user may contain
any combination of rights.
Unlike UNIX, Windows objects can belong to a number of different data types besides files. In
fact, applications may define new data types, and add them to the active directory, the
hierarchical name space for all objects known to the system. From an access control perspective,
object types are defined by their set of operations. The Windows model also supports a more
general view of the operations that an object type may possess. Windows defines up to 30
operations per object type, including some operations that are specific to the data type [74]. This
contrasts markedly with the read, write, and execute operations in the UNIX protection state.
Even for file objects, the Windows protection system defines many more operations, such as
operations to access file attributes and synchronize file operations. In addition, application may
add new object types and define their own operations.
The other major difference between a Windows and UNIX protection state is that Windows
supports arbitrary access control lists (ACLs) rather than the limited mode bits approach of
UNIX. A Windows ACL stores a set of access control entries (ACEs) that describe which
operations an SID (user, group, or alias) can perform on that object 6. The operations in an ACE
are interpreted based on the object type of the target object. In Windows, ACEs may either grant
or deny an operation. Thus, Windows uses negative access rights, whereas UNIX does not,
generating some difference in their authorization mechanisms.
Example 10.2. Figure 10.3 shows an example ACL for an object foo. foo’s ACL contains three
ACEs. The field principal SID specifies the SID to which the ACE applies. These ACE apply to
the SIDs Alice, Bob, and Group1. The other two important fields in an ACE are its type (grant or
deny) and the access rights (a bitmask). The Alice and Bob ACEs grant rights, and the Group1
ACE denies access to certain rights. The access rights bitmask is interpreted based on the object
type field in the ACE.We describe how the ACL is used in authorization in the next section.
Windows authorization queries are processed by a specific component called the Security
Reference Monitor (SRM). The SRM is a kernel component that takes a process token, an object
SID, and a set of operations, and it returns a boolean result of an authorization query. The SRM
uses the object SID to retrieve its ACL from which it determines the query result.
Because of the negative permissions, the way that the SRM processes authorization queries is
more complicated than in the UNIX case. The main difference is that the ACEs in an ACL are
ordered, and the ACEs are examined in that order. The SRM searches the ACEs until it finds a
set of ACEs that permits the operation or a single ACE that denies the operation. If an ACE
220
grants the necessary operations 7, then the request is authorized. However, if a deny ACE is
encountered that includes one of the requested operations, then the entire request is denied.
Example 10.3. Returning to Example 10.2 above, the ACEs of the object’s ACL are ordered as
shown in Figure 10.3.Note that the ACE field for access rights is really a bitmap, but we list the
operations to simplify understanding. Further, we specify the process tokens for two processes,
P1 and P2. Below, we show the authorization results for a set of queries by these processes for
the target object.
P1, read: ok
P1, read, write: no
P2: read: ok
P2: read, write: no
Both P1 and P2 can read the target object, but neither can write the object. P1 cannot write the
object because the P1 token include Group1 which matches the deny ACE for writing. P2 cannot
write the object because the ACE for Bob does not permit writing.
Mediation in Windows is determined by a set of object managers. Rather than a monolithic set of
system calls to access homogeneous objects (i.e., files) in UNIX, each object type in Windows
has an object manager that implements the functions of that type. While the Windows object
managers all run in the kernel, the object managers are independent entities. This can be
advantageous from a modularity perspective, but the fact that object managers may extend the
system presents some challenges for mediation. We need to know that each new object manager
mediates all operations and determines the rights for those operations correctly. There is no
process for ensuring this in Windows.
In Windows, the trusted computing base consists of all system services and processing running as
a trusted user identity, such as Administrator 8.Windows provides a setuid-like mechanism for
invoking Windows Services that run at a predefined privilege, at least sufficient to support all
clients. Thus, vulnerabilities in such services would lead to system compromise. Further, the ease
of software installation and complexity of the discretionary Windows access control model often
result in users running as Administrator. In this case, any user program would be able to take
control of the system. This is often a problem on Windows systems. With the release of Windows
Vista, the Windows model is extended to prevent programs downloaded from the Internet from
automatically being able to write Windows applications and the Windows system, regardless of
the user’s process identity [152]. While this does provide some integrity protection, it does not
fully protect the system’s integrity. It prevents low integrity processes from writing to high
integrity files, but does not prevent invocation, malicious requests, or spoofing the high integrity
code into using a low integrity file.
Windows also provides a means for restricting the permissions available to a process flexibly,
called restricted contexts. By defining a restricted context for a process, the permissions
necessary to perform an operation must be available to both the process using its token and to the
restricted context. That is, the permissions of a process running in a restricted context are the
intersection of the restricted context and the process’s normal permissions. Since a restricted
context may be assigned an arbitrary set of permissions, this mechanism is much more flexible
221
than the UNIX option of running as nobody. Also, since restricted contexts are built into the
access control system, it less error-prone than and chroot .Nonetheless, restricted contexts are
difficult for administrators to define correctly, so they are not used commonly, and not at all by
the user community.
1. Complete Mediation: How does the reference monitor interface ensure that all security
sensitive operations are mediated correctly? In Windows, mediation is provided by object
managers. Without the source code, it is difficult to know where mediation is performed,
but we would presume that object managers would authorize the actual objects used in the
security-sensitive operations, similarly to UNIX.
2. Complete Mediation: Does the reference monitor interface mediate security-sensitive
operations on all system resources? Object managers provide an opportunity for complete
mediation, but provide no guarantee of mediation. Further, the set of managers may be
extended, resulting in the addition of potentially insecure object managers. Without a
formal approach that defines what each manager does and how it is to be secured, it will
not be possible to provide a guarantee of complete mediation.
3. Complete Mediation: How do we verify that the reference monitor interface provides
complete mediation? As for UNIX, no specific approach has been used to verify complete
mediation.
4. Tamperproof: How does the system protect the reference monitor, including its
protection system, for modification? Windows suffers from the same problems as UNIX
when it comes to tampering. First, the protection system is discretionary, so it may be
tampered by any running process. Untrusted user processes can modify permissions to
their user’s data arbitrarily, so enforcing security goals on user data is not possible. Since
users have often run as Administrator to enable ease of system administration, any aspect
of the protection system may be modified. Second, there are limited protections for the
kernel itself. Like UNIX, a Windows kernel can be modified through kernel modules. In
Microsoft Vista, a code signing process can be used to determine the certifier of a kernel
module (i.e., the signer, not necessarily the writer of the module). Of course, the
administrator (typically an end user) must be able to determine the trustworthiness of the
signer. Security procedures that depend on the decision-making of users are often prone to
failure, as users are often ignorant of the security implications of such decisions. Also,
like UNIX, the Windows kernel also does not define protections for system calls (e.g.,
Multics gates).
5. Tamperproof: Does the system’s protection system protect the trusted computing base
programs? The TCB of Windows system is no better than that of UNIX. Nearly any
program may be part of the Windows TCB, and any process running these programs can
modify other TCB programs invalidating the TCB. Like UNIX, any compromised TCB
222
process can modify the protection system invalidating the enforcement of system security
goals, and modify the Windows kernel itself through the variety of interfaces provided to
TCB processes to access kernel state. Unlike UNIX, Windows provides APIs to tamper
with other processes in ways that UNIX does not .For example, Windows provides the
CreateRemoteThread function, which enables a process to initiate a thread in another
process.Windows also provides functions for writing a processes memory via
OpenProcess and WriteProcessMemory, so one process can also write the desired code
into that process prior to initiating a thread in that process. While all of these operations
require the necessary access rights to the other process, usually requiring a change in
privileges necessary for debugging a process (via the AdjustTokenPrivileges). While such
privileges are typically only available to processes under the same SID, we must verify
that these privileges cannot be misused in order to ensure tamper-protection of our TCB.
6. Verifiable: What is basis for the correctness of the system’s trusted computing base? As
for UNIX, any basis for correctness is informal. Windows also has an unbounded TCB
and extensible kernel system that prevent any effective formal verification.
7. Verifiable: Does the protection system enforce the system’s security goals? The general
Windows model enables any permission combination to be specified, but no particular
security goals are defined in the system. Thus, it is not possible to tell whether a system is
secure. Since the model is more complex than the UNIX model and can be extended
arbitrarily, this makes verifying security even more difficult.
Not surprisingly given its common limitations, Windows suffers from the same kinds of
vulnerabilities as the UNIX system. For example, there are books devoted to constructing
Windows rootkits.Here we highlight a few vulnerabilities that are specific to Windows systems
or are more profound in Windows systems.
The Windows Registry The Windows Registry is a global, hierarchical database to store data for
all programs.When a new application is loaded it may update the registry with application
specific, such as security-sensitive information such as the paths to libraries and executables to be
loaded for the application. While each registry entry can be associated with a security context that
limits access, such limitations are generally not effectively used. For example, the standard
configuration of AOL adds a registry entry that specifies the name of a Windows library file (i.e.,
DLL) to be loaded with AOL software. However, the permissions were set such that any user
could write the entry.
This use of the registry is not uncommon, as vendors have to ensure that their software will
execute when it is downloaded. Naturally, a user will be upset if she downloads some newly-
purchased software, and it does not execute correctly because it could not access its necessary
libraries. Since the application vendors cannot know the ad hoc ways that a Windows system is
administered, they must turn on permissions to ensure that whatever the user does the software
runs. If the registry entry is later used by an attacker to compromise the Windows system, that is
not really the application vendor’s problem—selling applications is.
223
Administrator Users We mentioned in the Windows security evaluation that traditionally users
ran under the identity Administrator or at least with administrative privileges enabled. The reason
for this is similar to the reason that broad access is granted to registry entries: the user also wants
to be sure that they can use what function is necessary to enable the system to run. If the user
downloads some computer game, the user would need special privileges to install the game, and
likely need special privileges to run the device-intensive game program. The last thing the user
wants is to have to figure out why the game will not run, so enabling all privileges works around
this issue. UNIX systems are generally used by more experienced computer users who
understand the difference between installing software (e.g., run sudo) and the normal operation of
the computer. As a result, the distinction between root users and sudo operations has been utilized
more effectively in UNIX.
Enabled By Default Like users and software vendors, Windows deployments also came with full
permissions and functionality enabled. This resulted in the famous Code Red worms [88] which
attacked the SQL server component of the Microsoft IIS web server. Many people who ran IIS
did not have an SQL server running or even knew that the SQL server was enabled by default in
their IIS system. But in these halcyon times, IIS web servers ran with all software enabled, so
attackers could send malicious requests to SQL servers on any system, triggering a buffer
overflow that was the basis for this worm’s launch. Subsequent versions of IIS are now “locked
down” , such that software has to be manually enabled to be accessible.
Activity 10.3
224
10.7. Summary
Computer systems are particularly prone to faults. The first step is to detect them. Then a system
can try to mask them in some way, such as reconfiguring the system to work around the faulty
element. Users must have confidence that their files will be there when required. Disks do fail, so
operating systems have procedures built into them to recover from such failures. The simplest
method is frequent backups.Distributed systems are more prone to faults, but faulty components
can easily be masked by functioning ones.Threats to security do not just come from malicious
outsiders, but can also result from faults within the system itself. Modern computer systems are
expected to have a formal description of just how secure they are. Protection mechanisms can
range from physical exclusion, through exclusion of unauthorized users, then distinguishing
between users, to fine-grained protection domains.
This investigation of the UNIX and Windows protection systems shows that it is not enough just
to design an operating system to enforce security policies. Security enforcement must be
comprehensive (i.e., mediate completely), mandatory (i.e., tamperproof), and verifiable. Both
UNIX and Windows originated in an environment in which security requirements were very
limited. For UNIX, the only security requirement was protection from other users, and for
Windows, users were assumed to be mutually-trusted on early home computers. The connection
of these systems to untrusted users and malware on the Internet changed the security
requirements for such systems, but the systems did not evolve.
REFERENCES
Gray, J. (1997) Interprocess Communications in Unix. Upper Saddle River, NJ: Prentice Hall.
Robbins, K. and Robbins, S. (1996) Practical Unix Programming. Upper Saddle River, NJ:
Prentice Hall.
Silberschatz, A. and Galvin, P. (1998) Operating System Concepts, 5th edn. Reading, MA:
Addison-Wesley.
Stallings, W. (1995) Operating Systems, 2nd edn. Englewood Cliffs, NJ: Prentice Hall. Stevens,
W. R. (1992) Advanced Programming in the Unix Environment. Reading, MA: Addison-Wesley.
Tanenbaum, A. (1992) Modern Operating Systems. Englewood Cliffs, NJ: Prentice Hall.
Tanenbaum, A. (1995) Distributed Operating Systems. Englewood Cliffs, NJ: Prentice Hall.
Tanenbaum, A. and Woodhull, A. (1997) Operating Systems: Design and Implementation, 2nd
edn. Upper Saddle River, NJ: Prentice Hall.
225
Unit 11 Case Study 1: LINUX
11.0. Introduction
In this unit we will begin with Linux, a popular variant of UNIX, which runs on a wide variety of
computers. It is one of the dominant operating systems on high-end workstations and servers, but
it is also used on systems ranging from cell phones to supercomputers. It also illustrates many
important design principles well.
Why Linux? Linux is a variant of UNIX, but there are many other versions and variants of UNIX
including AIX, FreeBSD, HP-UX, SCO UNIX, System V, Solaris, and others. Fortunately, the
fundamental principles and system calls are pretty much the same for all of them (by design).
Furthermore, the general implementation strategies, algorithms, and data structures are similar,
but there are some differences. To make the examples concrete, it is best to choose one of them
and describe it consistently.
Since most readers are more likely to have encountered Linux than any of the others, we will use
it as our running example, but again be aware that except for the information on implementation,
much of this chapter applies to all UNIX systems.
226
Table 11.1 UNIX Linux History
227
11.3. Overview of LINUX
In this section we will provide a general introduction to Linux and how it is used, for the benefit
of readers not already familiar with it. Nearly all of this material applies to just about all UNIX
variants with only small deviations. Although Linux has several graphical interfaces, the focus
here is on how Linux appears to a programmer working in a shell window on X. Subsequent
sections will focus on system calls and how it works inside.
UNIX was always an interactive system designed to handle multiple processes and multiple users
at the same time. It was designed by programmers, for programmers, to use in an environment in
which the majority of the users are relatively sophisticated and are engaged in (often quite
complex) software development projects. In many cases, a large number of programmers are
actively cooperating to produce a single system, so UNIX has extensive facilities to allow people
to work together and share information in controlled ways. The model of a group of experienced
programmers working together closely to produce advanced software is obviously very different
from the personal computer model of a single beginner working alone with a word processor, and
this difference is reflected throughout UNIX from start to finish. Linux also inherited many of
these goals, even though the first version was for a personal computer.
A Linux system can be regarded as a kind of pyramid, as illustrated in Fig. 10-1. At the bottom is
the hardware, consisting of the CPU, memory, disks, a monitor and keyboard, and other devices.
Running on the bare hardware is the operating system. Its function is to control the hardware and
provide a system call interface to all the programs. These system calls allow user programs to
create and manage processes, files, and other resources.
228
Programs make system calls by putting the arguments in registers (or sometimes, on the stack),
and issuing trap instructions to switch from user mode to kernel mode. Since there is no way to
write a trap instruction in C, a library is provided, with one procedure per system call. These
procedures are written in assembly language, but can be called from C. Each one first puts its
arguments in the proper place, then executes the trap instruction. Thus to execute the read system
call, a C program can call the read library procedure. All versions of Linux supply a large
number of standard programs, which include the command processor (shell), compilers, editors,
text processing programs, and file manipulation utilities. It is these programs that a user at the
keyboard invokes. Thus we can speak of three different interfaces to Linux: the true system call
interface, the library interface, and the interface formed by the set of standard utility programs.
Popular desktop environments for Linux include GNOME (GNU Network Object Model
Environment) and KDE (K Desktop Environment). GUIs on Linux are supported by the X
Windowing System, or commonly Xll or just X, which defines communication and display
protocols for manipulating windows on bitmap displays for UNIX and UNIX-like systems. The
X server is the main component which controls devices such as keyboards, mouse, and screen
and is responsible for redirecting input to or accepting output from client programs.
Most programmers and sophisticated users prefer a command-line interface, called the shell. The
types of shell are bash, ksh, csh etc but, bash is the default shell in most Linux systems.
When the shell starts up, it initializes itself, then types a prompt character. When the user types a
command line, the shell extracts the first word from it, assumes it is the name of a program to be
run, searches for this program, and if it finds it, runs the program. The shell then suspends itself
until the program terminates, at which time it tries to read the next command. For example, the
command line cp src dest invokes the cp program with two arguments, src and dest. This program
interprets the first one to be the name of an existing file. It makes a copy of this file and calls the
copy dest. To make it easy to specify multiple file names, the shell accepts magic characters,
sometimes called wild cards. An asterisk, for example, matches all possible strings, so
ls *.c tells ls to list all the files whose name ends in .c. If files named x.c, y.c, and z.c all exist, the
above command is equivalent to typing Is x.c y.c z.c. Another wild card is the question mark,
which matches any one character. A list of characters inside square brackets selects any of them,
so Is [ape]* lists all files beginning with "a", "p", or "e".
229
The POSIX 1003.2 standard specifies the syntax and semantics of just under 100 of these,
primarily in the first three categories. A selection of the POSIX utility programs is listed in Fig.
10-2, along with a short description of each. All Linux systems have them and many more.
Figure 11.2. A few of the common Linux utility programs required by POSIX.
The kernel sits directly on the hardware and enables interactions with I/O devices and the
memory management unit and controls CPU access to them. At the lowest level, as shown in Fig.
10.3 it contains interrupt handlers, which are the primary way for interacting with devices, and
the low-level dispatching mechanism. This dispatching occurs when an interrupt happens. The
low-level code here stops the running process, saves its state in the kernel process structures, and
starts the appropriate driver. Process dispatching also happens when the kernel completes some
operations and it is time to start up a user process again. The dispatching code is in assembler and
is quite distinct from scheduling.
We divide the various kernel subsystems into three main components. The I/O, memory
management and process management components. The I/O is responsible for interacting with
devices and performing network and storage I/O operations. Memory management tasks include
maintaining the virtual to physical memory mappings, maintaining a cache of recently accessed
pages and implementing a good page replacement policy, and on-demand bringing in new pages
of needed code and data into memory. The process management component’s job is the creation
230
and termination of processes. It also includes the process scheduler, which chooses which process
or, rather, thread to run next. Finally, code for signal handling also belongs to this component.
The kernel itself consists of an interacting collection of components, with arrows indicating the
main interactions. The underlying hardware is also depicted as a set of components with arrows
indicating which kernel components use or control which hardware components. All of the kernel
components, of course, execute on the CPU but, for simplicity, these relationships are not shown.
• Signals: The kernel uses signals to call into a process. For example, signals are used to
notify a process of certain faults, such as division by zero. Table 2.6 gives a few examples
of signals.
• System calls: The system call is the means by which a process requests a specific kernel
service. There are several hundred system calls, which can be roughly grouped into six
categories: filesystem, process, scheduling, interprocess communication, socket
(networking), and miscellaneous.
• Processes and scheduler: Creates, manages, and schedules processes.
• Virtual memory: Allocates and manages virtual memory for processes.
• File systems: Provides a global, hierarchical namespace for files, directories, and other
file related objects and provides file system functions.
• Network protocols: Supports the Sockets interface to users for the TCP/IP protocol suite.
• Character device drivers: Manages devices that require the kernel to send or receive
data one byte at a time, such as terminals, modems, and printers.
• Block device drivers: Manages devices that read and write data in blocks, such as
various forms of secondary memory (magnetic disks, CDROMs, etc.).
• Network device drivers: Manages network interface cards and communications ports
that connect to network devices, such as bridges and routers.
• Traps and faults: Handles traps and faults generated by the CPU, such as a memory
fault.
• Physical memory: Manages the pool of page frames in real memory and allocates pages
for virtual memory.
• Interrupts: Handles interrupts from peripheral devices.
231
Figure 11-3. Structure of the Linux kernel
The main active entities in a Linux system are the processes. Each process runs a single program
and initially has a single thread of control. In other words, it has one program counter, which
keeps track of the next instruction to be executed. Linux allows a process to create additional
threads once it starts executing.
232
outgoing electronic mail, manage the line printer queue, check if there are enough free pages in
memory, and so forth. Each daemon is a separate process, independent of all other processes.
233
Figure 11.4. Linux Process/Thread Model
Processes are created in Linux in an especially simple manner. The fork system call creates an
exact copy of the original process. The forking process is called the parent process. The new
process is called the child process. The parent and child each have their own, private memory
images. If the parent subsequently changes any of its variables, the changes are not visible to the
child, and vice versa. Open files are shared between parent and child. That is, if a certain file was
open in the parent before the fork, it will continue to be open in both the parent and the child
afterward.
Processes are named by their PIDs. When a process is created, the parent is given the child's PID,
Both processes normally check the return value and act accordingly, as shown in Fig. 13-5. If the
child wants to know its own PID, there is a system call, getpid that provides it. PIDs are used in a
variety of ways. For example, when a child terminates, the parent is given the PID of the child
that just finished. This can be important because a parent may have many children. Since children
may also have children, an original process can build up an entire tree of children, grandchildren,
and further descendants.
234
Processes in Linux can communicate with each other using a form of message passing. It is
possible to create a channel between two processes into which one process can write a stream of
bytes for the other to read. These channels are called pipes. Synchronization is possible because
when a process tries to read from an empty pipe it is blocked until data are available.
Shell pipelines are implemented with pipes. When the shell sees a line like
When the signal occurs, the process has to tell the kernel what to do with it. There can be three
options through which a signal can be disposed:
1. The signal can be ignored. By ignoring we mean that nothing will be done when signal
occurs. Most of the signals can be ignored but signals generated by hardware exceptions like
divide by zero, if ignored can have weird consequences. Also, a couple of signals like
SIGKILL and SIGSTOP cannot be ignored.
2. The signal can be caught. When this option is chosen, then the process registers a function
with kernel. This function is called by kernel when that signal occurs. If the signal is non fatal
for the process then in that function the process can handle the signal properly or otherwise it
can chose to terminate gracefully.
3. Let the default action apply. Every signal has a default action. This could be process
terminate, ignore etc.
235
11.4.2 Process Management System Calls in Linux
Let us now look at the Linux system calls dealing with process management. The main ones are
listed in Fig. 11-7. A system call is how a program requests a service from the kernel. This may
include hardware related services (e.g. accessing the hard disk), creating and executing new
processes and communicating with integral kernel services (like scheduling). System calls
provide an essential interface between a process and the operating system. When a program
makes a system call, the arguments are packaged up and handed to the kernel, which takes over
execution of the program until the call completes.
Some system calls are very powerful and can exert great influence on the system. For instance,
some system calls enable you to shut down the Linux system or to allocate system resources and
prevent other users from accessing them. These calls have the restriction that only processes
running with superuser privilege (programs run by the root account) can invoke them. These calls
fail if invoked by a nonsuperuser process.
Figure 11-7. Some system calls relating to processes. The return code s is -1 if an error has
occurred, pid is a process ID, and residual is the remaining time in the previous alarm. The
parameters are what the names suggest.
236
Figure 11-8. A highly simplified shell.
Linux Threads
Traditional UNIX systems support a single thread of execution per process, while modern UNIX
systems typically provide support for multiple kernel-level threads per process. As with
traditional UNIX systems, older versions of the Linux kernel offered no support for
multithreading. Instead, applications would need to be written with a set of user-level library
functions, the most popular of which is known as pthread (POSIX thread) libraries, with all of
the threads mapping into a single kernel-level process. We have seen that modern versions of
UNIX offer kernel-level threads. Linux provides a unique solution in that it does not recognize a
distinction between threads and processes. Using a mechanism similar to the lightweight
processes of Solaris, user-level threads are mapped into kernel-level processes. Multiple user
level threads that constitute a single user-level process are mapped into Linux kernel-level
processes that share the same group ID. This enables these processes to share resources such as
files and memory and to avoid the need for a context switch when the scheduler switches among
processes in the same group.
A new process is created in Linux by copying the attributes of the current process. A new process
can be cloned so that it shares resources, such as files, signal handlers, and virtual memory. When
the two processes share the same virtual memory, they function as threads within a single
process. However, no separate type of data structure is defined for a thread. In place of the usual
fork() command, processes are created in Linux using the clone() command. The traditional
fork() system call is implemented by Linux as a clone() system call with all of the clone flags
cleared.
When the Linux kernel performs a switch from one process to another, it checks whether the
address of the page directory of the current process is the same as that of the to-be-scheduled
process. If they are, then they are sharing the same address space, so that a context switch is
basically just a jump from one location of code to another location of code. Although cloned
237
processes that are part of the same process group can share the same memory space, they cannot
share the same user stacks. Thus the clone() call creates separate stack spaces for each process.
The implementation of threads on GNU/Linux differs from the way on many other UNIX-like
systems in an important way: on GNU/Linux, threads are implemented as processes. Whenever
you call pthread_create to create a new thread, Linux creates a new process that runs that thread.
However, this process is not the same as a process you would create with fork; in particular, it
shares the same address space and resources as the original process rather than receiving copies.
The program thread-pid shown in Fig 10-9 demonstrates this. The program creates a thread; both
the original thread and the new one call the getpid function and print their respective PIDs and
then spin infinitely.
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
int main ()
{
pthread_t thread;
fprintf (stderr, "main thread pid is %d\n", (int) getpid ());
pthread_create (&thread, NULL, &thread_function, NULL);
/* Spin forever. */
while (1);
return 0;
}
Run the program in the background, and then invoke ps x to display your running processes.
Don't forget to kill the thread-pid program afterward—it consumes lots of CPU doing nothing.
Here's what the output might look like:
Notice that there are three processes running the thread-pid program. The first of these, with pid
14608, is the main thread in the program; the third, with pid 14610, is the thread we created to
execute thread_function. How about the second thread, with pid 14609? This is the "manager
thread," which is part of the internal implementation of GNU/Linux threads. The manager thread
is created the first.
239
11.5. Memory Management in Linux
Linux shares many of the characteristics of the memory management schemes of other UNIX
implementations but has its own unique features. Overall, the Linux memory-management
scheme is quite complex. In this section, we give a brief overview of the two main aspects of
Linux memory management: process virtual memory, and kernel memory allocation.
To use this three-level page table structure, a virtual address in Linux is viewed as consisting of
four fields (Figure 13-11) The leftmost (most significant) field is used as an index into the page
directory. The next field serves as an index into the page middle directory. The third field serves
as an index into the page table. The fourth field gives the offset within the selected page of
memory.
The Linux page table structure is platform independent and was designed to accommodate the
64-bit Alpha processor, which provides hardware support for three levels of paging. With 64- bit
addresses, the use of only two levels of pages on the Alpha would result in very large page tables
and directories. The 32-bit Pentium/x86 architecture has a two-level hardware paging
mechanism. The Linux software accommodates the two-level scheme by defining the size of the
page middle directory as one. Note that all references to an extra level of indirection are
optimized away at compile time, not at run time. Therefore, there is no performance overhead for
using generic three-level design on platforms which support only two levels in hardware.
240
Figure 11.11 Address Translation in Linux Virtual Memory Scheme
Page Allocation
To enhance the efficiency of reading in and writing out pages to and from main memory, Linux
defines a mechanism for dealing with contiguous blocks of pages mapped into contiguous blocks
of page frames. For this purpose, the buddy system is used. The kernel maintains a list of
contiguous page frame groups of fixed size; a group may consist of 1, 2, 4, 8, 16, or 32 page
frames. As pages are allocated and deallocated in main memory, the available groups are split
and merged using the buddy algorithm.
The Linux memory model is straightforward, to make programs portable and to make it possible
to implement Linux on machines with widely differing memory management units, ranging from
essentially nothing (e.g., the original IBM PC) to sophisticated paging hardware. This is an area
of the design that has barely changed in decades. It has worked well so it has not needed much
revision. We will now examine the model and how it is implemented.
(a) If the request is to the same on-disk sector or an immediately adjacent sector to a pending
request in the queue, then the existing request and the new request are merged into one
request.
(b) If a request in the queue is sufficiently old, the new request is inserted at the tail of the
queue.
242
(c) If there is a suitable location, the new request is inserted in sorted order.
(d) If there is no suitable location, the new request is placed at the tail of the queue.
An even more serious problem concerns the distinction between read and write requests.
Typically, a write request is issued asynchronously. That is, once a process issues the write
request, it need not wait for the request to actually be satisfied. When an application issues a
write, the kernel copies the data into an appropriate buffer, to be written out as time permits.
Once the data are captured in the kernel's buffer, the application can proceed. However, for many
read operations, the process must wait until the requested data are delivered to the application
before proceeding. Thus, a stream of write requests (for example, to place a large file on the disk)
can block a read request for a considerable time and thus block a process.
To overcome these problems, the deadline I/O scheduler makes use of three queues (Figure
13-12). Each incoming request is placed in the sorted elevator queue, as before. In addition, the
same request is placed at the tail of a read FIFO queue for a read request or a write FIFO queue
for a write request. Thus, the read and write queues maintain a list of requests in the sequence in
which the requests were made. Associated with each request is an expiration time, with a default
value of 0.5 seconds for a read request and 5 seconds for a write request. Ordinarily, the
scheduler dispatches from the sorted queue. When a request is satisfied, it is removed from the
head of the sorted queue and also from the appropriate FIFO queue. However, when the item at
the head of one of the FIFO queues becomes older than its expiration time, then the scheduler
next dispatches from that FIFO queue, taking the expired request, plus the next few requests from
the queue. As each request is dispatched, it is also removed from the sorted queue. The deadline
I/O scheduler scheme overcomes the starvation problem and also the read versus write problem.
243
Figure 11.12. The Linux Deadline I/O Scheduler
The original elevator scheduler and the deadline scheduler both are designed to dispatch a new
request as soon as the existing request is satisfied, thus keeping the disk as busy as possible.
Typically, an application will wait until a read request is satisfied and the data available before
issuing the next request. The small delay between receiving the data for the last read and issuing
the next read enables the scheduler to turn elsewhere for a pending request and dispatch that
request.
Because of the principle of locality, it is likely that successive reads from the same process will
be to disk blocks that are near one another. If the scheduler were to delay a short period of time
after satisfying a read request, to see if a new nearby read request is made, the overall
performance of the system could be enhanced. This is the philosophy behind the anticipatory
scheduler, and is implemented in Linux 2.6.
In Linux, the anticipatory scheduler is superimposed on the deadline scheduler. When a read
request is dispatched, the anticipatory scheduler causes the scheduling system to delay for up to 6
milliseconds, depending on the configuration. During this small delay, there is a good chance that
the application that issued the last read request will issue another read request to the same region
of the disk. If so, that request will be serviced immediately. If no such read request occurs, the
scheduler resumes using the deadline scheduling algorithm.
244
The page cache confers two benefits. First, when it is time to write back dirty pages to disk, a
collection of them can be ordered properly and written out efficiently. Second, because of the
principle of temporal locality, pages in the page cache are likely to be referenced again before
they are flushed from the cache, thus saving a disk I/O operation. Dirty pages are written back to
disk in two situations:
• When free memory falls below a specified threshold, the kernel reduces the size of the
page cache to release memory to be added to the free memory pool.
• When dirty pages grow older than a specified threshold, a number of dirty pages are
written back to disk.
The I/O system in Linux is fairly straightforward and the same as other UNICES. Basically, all
I/O devices are made to look like files and are accessed as such with the same read and write
system calls that are used to access all ordinary files. In some cases, device parameters must be
set, and this is done using a special system call. We will study these issues in the following
sections.
For convenience, the Linux file system is usually thought of in a tree structure. On a standard
Linux system you will find the layout generally follows the scheme presented below. This is a
layout from a RedHat system. Depending on the system admin, the operating system and the
mission of the UNIX machine, the structure may vary, and directories may be left out or added at
will. The names are not even required; they are only a convention. The tree of the file system
starts at the trunk or slash, indicated by a forward slash (/). This directory, containing all
underlying directories and files, is also called the root directory or "the root" of the file system.
Directories that are only one level below the root directory are often preceded by a slash, to
indicate their position and prevent confusion with other directories that could have the same
name. When starting with a new system, it is always a good idea to take a look in the root
directory. Let's see what you could run into:
245
emmy:~>cd /
emmy:/>ls
bin/ dev/ home/ lib/ misc/ opt/ root/ tmp/ var/
boot/ etc/ initrd/ lost+found/ mnt/ proc/ sbin/ usr/
246
Directory Content
/bin Common programs, shared by the system, the system administrator and the users.
The startup files and the kernel, vmlinuz. In some recent distributions also grub data.
/boot Grub is the GRand Unified Boot loader and is an attempt to get rid of the many
different boot-loaders we know today.
Contains references to all the CPU peripheral hardware, which are represented as
/dev
files with special properties.
Most important system configuration files are in /etc, this directory contains data
/etc
similar to those in the Control Panel in Windows
/home Home directories of the common users.
/initrd (on some distributions) Information for booting. Do not remove!
Library files, includes files for all kinds of programs needed by the system and the
/lib
users.
Every partition has a lost+found in its upper directory. Files that were saved during
/lost+found
failures are here.
/misc For miscellaneous purposes.
/mnt Standard mount point for external file systems, e.g. a CD-ROM or a digital camera.
/net Standard mount point for entire remote file systems
/opt Typically contains extra and third party software.
A virtual file system containing information about system resources. More
information about the meaning of the files in proc is obtained by entering the
/proc
command man proc in a terminal window. The file proc.txt discusses the virtual file
system in detail.
The administrative user's home directory. Mind the difference between /, the root
/root
directory and /root, the home directory of the root user.
/sbin Programs for use by the system and the system administrator.
Temporary space for use by the system, cleaned upon reboot, so don't use this for
/tmp
saving any work!
/usr Programs, libraries, documentation etc. for all user-related programs.
Storage for all variable files and temporary files created by users, such as log files,
/var the mail queue, the print spooler area, space for temporary storage of files
downloaded from the Internet, or to keep an image of a CD before burning it.
How can you find out which partition a directory is on? Using the df command with a dot (.) as
an option shows the partition the current directory belongs to, and informs about the amount of
space used on this partition:
247
sandra:/lib>df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/hda7 980M 163M 767M 18% /
As a general rule, every directory under the root directory is on the root partition, unless it has a
separate entry in the full listing from df (or df -h with no other options).
For most users and for most common system administration tasks, it is enough to accept that files
and directories are ordered in a tree-like structure. The computer, however, doesn't understand a
thing about trees or tree-structures.Every partition has its own file system. By imagining all those
file systems together, we can form an idea of the tree-structure of the entire system, but it is not
as simple as that. In a file system, a file is represented by an inode, a kind of serial number
containing information about the actual data that makes up the file: to whom this file belongs, and
where is it located on the hard disk.
Every partition has its own set of inodes; throughout a system with multiple partitions, files with
the same inode number can exist.Each inode describes a data structure on the hard disk, storing
the properties of a file, including the physical location of the file data. When a hard disk is
initialized to accept data storage, usually during the initial system installation process or when
adding extra disks to an existing system, a fixed number of inodes per partition is created. This
number will be the maximum amount of files, of all types (including directories, special files,
links etc.) that can exist at the same time on the partition. We typically count on having 1 inode
per 2 to 8 kilobytes of storage.
At the time a new file is created, it gets a free inode. In that inode is the following information:
The only information not included in an inode, is the file name and directory. These are stored in
the special directory files. By comparing file names and inode numbers, the system can make up
a tree-structure that the user understands. Users can display inode numbers using the -i option to
ls. The inodes have their own separate space on the disk.
ACTIVITY 11.1
1. What are the advantages and disadvantages of making only some of the symbols defined
inside a kernel accessible to a loadable kernel module?
248
2. The Linux scheduler implements soft real-time scheduling. What features necessary for
certain real-time programming tasks are missing? How might they be added to the kernel?
3. In what ways does the Linux setuid feature differ from the setuid feature in standard
Unix?
4. What socket type should be used to implement an intercomputer file-transfer program?
What type should be used for a program that periodically tests to see whether another
computer is up on the network? Explain your answer.
5. What scenarios would cause a page of memory to be mapped into a user program's
address space with the copy-on-write attribute enabled?
6. What extra costs are incurred in the creation and scheduling of a process, compared with
the cost of a cloned thread?
7. Linux runs on a variety of hardware platforms. What steps must Linux developers take to
ensure that the system is portable to different processors and memory-management
architectures and to minimize the amount of architecture-specific kernel code?
8. Multithreading is a commonly used programming technique. Describe three different
ways to implement threads, and compare these three methods with the Linux clone()
mechanism. When might using each alternative mechanism be better or worse than using
clones?
9. The Linux source code is freely and widely available over the Internet and from CD-
ROM vendors. What are three implications of this availability for the security of the
Linux system?
10. Under what circumstances would a user process request an operation that results in the
allocation of a demand-zero memory region?
11. What are the primary goals of the conflict-resolution mechanism used by the Linux kernel
for loading kernel modules?
12. What are the advantages and disadvantages of writing an operating system in a high-level
language, such as C?
13. In Linux, shared libraries perform many operations central to the operating system. What
is the advantage of keeping this functionality out of the kernel? Are there any drawbacks?
Explain your answer.
14. The Linux kernel does not allow paging out of kernel memory. What effect does this
restriction have on the kernel's design? What are two advantages and two disadvantages
of this design decision?
15. The directory structure of a Linux operating system could include files corresponding to
several different file systems, including the Linux /proc file system. How might the need
to support different file-system types affect the structure of the Linux kernel?
16. Would you classify Linux threads as user-level threads or as kernel-level threads? Support
your answer with the appropriate arguments.
17. At one time, UNIX systems used disk-layout optimizations based on the rotation position
of disk data, but modern implementations, including Linux, simply optimize for
sequential data access. Why do they do so? Of what hardware characteristics does
sequential access take advantage? Why is rotational optimization no longer so useful?
18. Discuss how the clone() operation supported by Linux is used to support both processes
and threads.
19. In what circumstances is the system-call sequence fork() exec() most appropriate? When
is vfork() preferable?
249
11.8. Summary
Linux began its life as an open-source, full production UNIX clone, and is now used on machines
ranging from notebook computers to supercomputers. Three main interfaces to it exist: the shell,
the C library, and the system calls themselves. In addition, a graphical user interface is often used
to simplify user interaction with the system. Process management in Linux is different compared
to other UNIX systems in that Linux views each execution entity-a single-threaded process, or
each thread within a multithreaded process or the kernel-as a distinguishable task. The memory
model consists of three segments per process: text, data, and stack. Memory management is done
by paging. The file system is hierarchical with files and directories. All disks are mounted into a
single directory tree starting at a unique root. Individual files can be linked into a directory from
elsewhere in the file system.
REFERENCES
250
Unit 12 Case Study: Windows Vista
12.0. Introduction
Windows is a modem operating system that runs on consumer and business desktop PCs and
enterprise servers. The most recent desktop version is Windows 8. In this Case Study we are
going to look at Windows Vista starting with a brief history, then moving on to its architecture.
After this we will look at processes, memory management, caching, I/O, the file system, and
finally, security.
DOS 2.0, was released in 1983 for the hard disk-based personal computer, the PC XT.This
eversion had support for the hard disk and provided for hierarchical directories. Before this, a
disk could contain only one directory of files, supporting a maximum of 64 files. This was too
limited for a hard disk, and the single-directory restriction was too clumsy. DOS 2.0 allowed
directories to contain subdirectories as well as files. It also contained a richer set of commands
embedded in the OS to provide functions that had to be performed by external programs provided
as utilities with Release 1.Among the capabilities added were several UNIXlike features, such as
I/O redirection, which is the ability to change the input or output identity for a given application,
and background printing. The memory-resident portion grew to 24 Kbytes.
In 1984, Microsoft introduced DOS 3.0 for the PC AT computer which had the Intel 80286
processor. The 80286 provided extended addressing and memory protection features, although
these were not used by DOS. To remain compatible with previous releases, the OS simply used
the 80286 as a “fast 8086.” The OS did provide support for new keyboard and hard disk
peripherals. Even so, the memory requirement grew to 36 Kbytes. There were several notable
upgrades to the 3.0 release.
In 1984 they released DOS 3.1, which contained support for networking of PCs. The size of the
resident portion did not change; this was achieved by increasing the amount of the OS that could
be swapped. DOS 3.3, released in 1987, provided support for the new line of IBM computers, the
PS/2. Again, this release did not take advantage of the processor capabilities of the PS/2,
251
provided by the 80286 and the 32-bit 80386 chips. The resident portion at this stage had grown to
a minimum of 46 Kbytes, with more required if certain optional extensions were selected.
The introduction of the 80486 and then the Intel Pentium chip provided power and features that
could not be exploited by the simple-minded DOS. To compete with Macintosh, whose OS was
unsurpassed for ease of use beginning in the early 1980s, Microsoft began development of a
graphical user interface (GUI) that would be interposed between the user and DOS. By 1990,
Microsoft had a version of the GUI, known as Windows 3.0, which incorporated some of the user
friendly features of Macintosh. However, it was still hamstrung by the need to run on top of
DOS.
By 1993, Microsoft developed a new OS called, Windows NT. Windows NT exploits the
capabilities of contemporary microprocessors and provides multitasking in a single-user or
multiple-user environment. The first version of Windows NT (3.1) had the same GUI as
Windows 3.1, another Microsoft OS (the follow-on to Windows 3.0). However, NT 3.1 was a
new 32-bit OS with the ability to support older DOS and Windows applications as well as
provide OS/2 support.
After several versions of NT 3.x, Microsoft released NT 4.0. NT 4.0 has essentially the same
internal architecture as 3.x.The most notable external change is that NT 4.0 provides the same
user interface as Windows 95 (an enhanced upgrade to Windows 3.1). The major architectural
change is that several graphics components that ran in user mode as part of the Win32 subsystem
in 3.x have been moved into the Windows NT Executive, which runs in kernel mode.
In 2000, Microsoft introduced the next major upgrade: Windows 2000 which had the underlying
Executive and Kernel architecture fundamentally the same as in NT 4.0, but new features have
been added. The emphasis in Windows 2000 is the addition of services and functions to support
distributed processing. The central element of Windows 2000’s new features is Active Directory,
which is a distributed directory service able to map names of arbitrary objects to any kind of
information about those objects. Windows 2000 also added the plug-and-play and power-
management facilities that were already in Windows 98, the successor to Windows 95
One final general point to make about Windows 2000 is the distinction between Windows 2000
Server and Windows 2000 desktop. In essence, the kernel and executive architecture and services
remain the same, but Server includes some services required to use as a network server. In 2001
Windows XP was relased. Both home PC and business workstation versions of XP were offered.
In 2003, Microsoft introduced a new server version, known as Windows Server 2003, supporting
both 32-bit and 64-bit processors. The 64-bit versions of Server 2003 was designed specifically
for the 64-bit Intel Itanium hardware. With the first service pack update for Server 2003,
Microsoft introduced support for the AMD64 processor architecture for both desktops and
servers. In 2007, the latest desktop version of Windows was released, known as Windows Vista.
Vista supports both the Intel x86 and AMD x64 architectures. The main features of the release
were changes to the GUI and many security improvements. The corresponding server release is
Windows Server 2008.
252
Table 12.1. A History of Windows starting from DOS up to Windows 8
Year Event
1983 Bill Gates announces MicrosoftWindows November 10, 1983.
1985 MicrosoftWindows 1.0 is introduced in November 20, 1985 and is initially sold for
$100.00.
1987 Microsoft Windows 2.0 was released December 9, 1987 and is initially sold for $100.00.
1987 Microsoft Windows/386 or Windows 386 is introduced December 9, 1987 and is initially
sold for $100.00.
1988 Microsoft Windows/286 or Windows 286 is introduced June, 1988 and is initially sold for
$100.00.
1990 MicrosoftWindows 3.0 was released May, 22 1990. Microsoft Windows 3.0 full version
was priced at $149.95 and the upgrade version was priced at $79.95.
1991 Following its decision not to develop operating systems cooperatively with IBM,
Microsoft changes the name of OS/2 to Windows NT.
1991 MicrosoftWindows 3.0 or Windows 3.0a with multimedia was released October, 1991.
1992 MicrosoftWindows 3.1 was released April, 1992 and sells more than 1 Million copies
within the first two months of its release.
1992 MicrosoftWindows for Workgroups 3.1 was released October, 1992.
1993 MicrosoftWindows NT 3.1 was released July 27, 1993.
1993 MicrosoftWindows 3.11, an update to Windows 3.1 is released December 31, 1993.
1993 The number of licensed users of Microsoft Windows now totals more than 25 Million.
1994 MicrosoftWindows for Workgroups 3.11 was released February, 1994.
1994 MicrosoftWindows NT 3.5 was released September 21, 1994.
1995 MicrosoftWindows NT 3.51 was released May 30, 1995.
1995 MicrosoftWindows 95 was released August 24, 1995 and sells more than 1 Million copies
within 4 days.
1995 MicrosoftWindows 95 Service Pack 1 (4.00.950A) is released February 14, 1996.
1996 MicrosoftWindows NT 4.0 was released July 29, 1996.
1996 MicrosoftWindows 95 (4.00.950B) aka OSR2 with FAT32 and MMX support is released
August 24, 1996.
1996 MicrosoftWindows CE 1.0 was released November, 1996.
1997 MicrosoftWindows CE 2.0 was released November, 1997.
1997 MicrosoftWindows 95 (4.00.950C) aka OSR2.5 is released November 26, 1997.
1998 MicrosoftWindows 98 was released June, 1998.
1998 MicrosoftWindows CE 2.1 was released July, 1998.
1998 In October of 1998 Microsoft announced that future releases of Windows NT would no
longer have the initials of NT and that the next edition would be Windows 2000.
1999 MicrosoftWindows 98 SE (Second Edition) was released May 5, 1999.
1999 MicrosoftWindows CE 3.0 was released 1999.
2000 On January 4th at CES Bill Gates announces the new version of Windows CE will be
called Pocket PC.
2000 MicrosoftWindows 2000 was released February 17, 2000.
2000 MicrosoftWindows ME (Millennium) released June 19, 2000.
2001 MicrosoftWindows XP is released October 25, 2001.
253
2001 MicrosoftWindows XP 64-Bit Edition (Version 2002) for Itanium systems is released
March 28, 2003.
2003 Microsoft Windows Server 2003 is released March 28, 2003.
2003 MicrosoftWindows XP 64-Bit Edition (Version 2003) for Itanium 2 systems is released on
March 28, 2003.
2003 MicrosoftWindows XP Media Center Edition 2003 is released on December 18, 2003.
2004 MicrosoftWindows XP Media Center Edition 2005 is released on October 12, 2004.
2005 MicrosoftWindows XP Professional x64 Edition is released on April 24, 2005.
2005 Microsoft announces its next operating system, codenamed "Longhorn" will be named
Windows Vista on July 23, 2005.
2006 Microsoft releases Microsoft Windows Vista to corporations on November 30, 2006.
2007 Microsoft releases Microsoft Windows Vista and Office 2007 to the general public
January 30, 2007.
2008 Microsoft releases Microsoft Windows Server 2008 to the public on February 27, 2008.
2009 Microsoft releases Windows 7 October 22, 2009.
2012 Microsoft releases Windows 8 October 26, 2012.
One of the most significant features of Windows operating systems is that, they are multitasking
operating systems. Two main developments have triggered the need for multitasking on personal
computers, workstations, and servers. The increased speed and memory capacity of
microprocessors, together with the support for virtual memory, applications have become more
complex and interrelated. For example, a user may wish to employ a word processor, a drawing
program, and a spreadsheet application simultaneously to produce a document. Without
multitasking, if a user wishes to create a drawing and paste it into a word processing document,
the following steps are required:
If any changes are desired, the user must close the word processing program, open the drawing
program, edit the graphic image, save it, close the drawing program, open the word processing
program, and insert the updated image. This becomes tedious very quickly. As the services and
capabilities available to users become more powerful and varied, the single-task environment
becomes more clumsy and user unfriendly. In a multitasking environment, the user opens each
application as needed, and leaves it open. Information can be moved around among a number of
applications easily. Each application has one or more open windows, and a graphical interface
with a pointing device such as a mouse allows the user to navigate quickly in this environment.
254
A second motivation for multitasking is the growth of client/server computing. With client/server
computing, a personal computer or workstation (client) and a host system (server) are used jointly
to accomplish a particular application. The two are linked, and each is assigned that part of the
job that suits its capabilities. Client/server can be achieved in a local area network of personal
computers and servers or by means of a link between a user system and a large host such as a
mainframe. An application may involve one or more personal computers and one or more server
devices. To provide the required responsiveness, the OS needs to support high-speed networking
interfaces and the associated communications protocols and data transfer architectures while at
the same time supporting ongoing user interaction.
The foregoing remarks apply to the desktop versions of Windows. The Server versions are also
multitasking but may support multiple users. They support multiple local server connections as
well as providing shared services used by multiple users on the network. As an Internet server,
Windows may support thousands of simultaneous Web connections.
12.2.2 Architecture
Figure 12.1 illustrates the overall structure of Windows 2000; later releases of Windows,
including Vista, have essentially the same structure at this level of detail. Its modular structure
gives Windows considerable flexibility. It is designed to execute on a variety of hardware
platforms and supports applications written for a variety of other operating systems. As of this
writing, desktop Windows is only implemented on the Intel x86 and AMD64 hardware platforms.
Windows server also supports theIntel IA64 (Itanium).
255
Lsass = local security authentication server Colored area indicates Executive
POSIX = portable operating system interface
GDI = graphics device interface
DLL = dynamic link libraries
As with virtually all operating systems, Windows separates application oriented software from
the core OS software. The latter, which includes the Executive, the Kernel, device drivers, and
the hardware abstraction layer, runs in kernel mode. Kernel mode software has access to system
data and to the hardware. The remaining software, running in user mode, has limited access to
system data.
256
responsible component using standard interfaces. Key system data can only be accessed through
the appropriate function. In principle, any module can be removed, upgraded, or replaced without
rewriting the entire system or its standard application program interface (APIs).
The Windows Executive includes components for specific system functions and provides an
API for user-mode software. Following is a brief description of each of the Executive
modules:
• I/O manager: Provides a framework through which I/O devices are accessible to
applications, and is responsible for dispatching to the appropriate device drivers for
further processing. The I/O manager implements all the Windows I/O APIs and enforces
security and naming for devices, network protocols, and file systems (using the object
manager).Windows I/O is discussed in Chapter 11.
• Cache manager: Improves the performance of file-based I/O by causing recently
referenced file data to reside in main memory for quick access, and by deferring disk
writes by holding the updates in memory for a short time before sending them to the disk.
• Object manager: Creates, manages, and deletes Windows Executive objects and abstract
data types that are used to represent resources such as processes, threads, and
synchronization objects. It enforces uniform rules for retaining, naming, and setting the
security of objects. The object manager also creates object handles, which consist of
access control information and a pointer to the object. Windows objects are discussed
later in this section.
• Plug-and-play manager: Determines which drivers are required to support a particular
device and loads those drivers.
• Power manager: Coordinates power management among various devices and can be
configured to reduce power consumption by shutting down idle devices, putting the
processor to sleep, and even writing all of memory to disk and shutting off power to the
entire system.
257
• Security reference monitor: Enforces access-validation and audit-generation rules. The
Windows object-oriented model allows for a consistent and uniform view of security,
right down to the fundamental entities that make up the Executive. Thus, Windows uses
the same routines for access validation and for audit checks for all protected objects,
including files, processes, address spaces, and I/O devices. Windows security is discussed
in Chapter 15.
• Virtual memory manager: Manages virtual addresses, physical memory, and the paging
files on disk. Controls the memory management hardware and data structures which map
virtual addresses in the process’s address space to physical pages in the computer’s
memory. Windows virtual memory management is described in Chapter 8.
• Process/thread manager: Creates, manages, and deletes process and thread objects.
Windows process and thread management are described in Chapter 4.
• Configuration manager: Responsible for implementing and managing the system
registry, which is the repository for both system wide and per-user settings of various
parameters.
• Local procedure call (LPC) facility: Implements an efficient cross-process procedure
call mechanism for communication between local processes implementing services and
subsystems. Similar to the remote procedure call (RPC) facility used for distributed
processing.
User-Mode Processes Four basic types of user-mode processes are supported by Windows:
• Special system processes: User mode services needed to manage the system, such as the
session manager, the authentication subsystem, the service manager, and the logon
process
• Service processes: The printer spooler, the event logger, user mode components that
cooperate with device drivers, various network services, and many, many others. Services
are used by both Microsoft and external software developers to extend system
functionality as they are the only way to run background user mode activity on a
Windows system.
• Environment subsystems: Provide different OS personalities (environments). The
supported subsystems are Win32/WinFX and POSIX. Each environment subsystem
includes a subsystem process shared among all applications using the subsystem and
dynamic link libraries (DLLs) that convert the user application calls to LPC calls on the
subsystem process, and/or native Windows calls.
• User applications: Executables (EXEs) and DLLs that provide the functionality users run
to make use of the system. EXEs and DLLs are generally targeted at a specific
environment subsystems; although some of the programs that are provided as part of the
OS use the native system interfaces (NTAPI). There is also support for running 16-bit
programs written for Windows 3.1 or MS-DOS.
The native NT API is a set of kernel-based services which provide the core abstractions used by
the system, such as processes, threads, virtual memory, I/O, and communication. Windows
provides a far richer set of services by using the client/server model to implement functionality in
user-mode processes. Both the environment subsystems and the Windows user-mode services are
implemented as processes that communicate with clients via RPC. Each server process waits for a
request from a client for one of its services (for example, memory services, process creation
services, or networking services).A client, which can be an application program or another server
program, requests a service by sending a message. The message is routed through the Executive
to the appropriate server. The server performs the requested operation and returns the results or
status information by means of another message, which is routed through the Executive back to
the client.
259
• It provides a suitable base for distributed computing. Typically, distributed computing
makes use of a client/server model, with remote procedure calls implemented using
distributed client and server modules and the exchange of messages between clients and
servers. With Windows, a local server can pass a message on to a remote server for
processing on behalf of local client applications.
Clients need not know whether a request is serviced locally or remotely. Indeed, whether a
request is serviced locally or remotely can change dynamically based on current load conditions
and on dynamic configuration changes.
260
• Polymorphism: Internally, Windows uses a common set of API functions to manipulate
objects of any type; this is a feature of polymorphism, as defined in Appendix B.
However, Windows is not completely polymorphic because there are many APIs that are
specific to specific object types.
The reader unfamiliar with object-oriented concepts should review Appendix B at the end of this
book. Not all entities in Windows are objects. Objects are used in cases where data are intended
for user mode access or when data access is shared or restricted. Among the entities represented
by objects are files, processes, threads, semaphores, timers, and windows. Windows creates and
manages all types of objects in a uniform way, via the object manager. The object manager is
responsible for creating and destroying objects on behalf of applications and for granting access
to an object’s services and data.
Each object within the Executive, sometimes referred to as a kernel object (to distinguish from
user-level objects not of concern to the Executive), exists as a memory block allocated by the
kernel and is directly accessible only by kernel mode components. Some elements of the data
structure (e.g., object name, security parameters, usage count) are common to all object types,
while other elements are specific to a particular object type (e.g., a thread object’s priority).
Because these object data structures are in the part of each process’s address space accessible
only by the kernel, it is impossible for an application to reference these data structures and read
or write them directly. Instead, applications manipulate objects indirectly through the set of
object manipulation functions supported by the Executive. When an object is created, the
application that requested the creation receives back a handle for the object. In essence a handle
is an index into a Executive table containing a pointer to the referenced object. This handle can
then be used by any thread within the same process to invoke Win32 functions that work with
objects, or can be duplicated into other processes.
Objects may have security information associated with them, in the form of a Security Descriptor
(SD).This security information can be used to restrict access to the object based on contents of a
token object which describes a particular user. For example, a process may create a named
semaphore object with the intent that only certain users should be able to open and use that
semaphore. The SD for the semaphore object can list those users that are allowed (or denied)
access to the semaphore object along with the sort of access permitted (read, write, change, etc.).
In Windows, objects may be either named or unnamed. When a process creates an unnamed
object, the object manager returns a handle to that object, and the handle is the only way to refer
to it. Named objects are also given a name that other processes can use to obtain a handle to the
object. For example, if process A wishes to synchronize with process B, it could create a named
event object and pass the name of the event to B. Process B could then open and use that event
object. However, if A simply wished to use the event to synchronize two threads within itself, it
would create an unnamed event object, because there is no need for other processes to be able to
use that event.
There are two categories of objects used by Windows for synchronizing the use of the processor:
• Dispatcher objects: The subset of Executive objects which threads can wait on to control
the dispatching and synchronization of thread-based system operations.
261
• Control objects: Used by the Kernel component to manage the operation of the processor
in areas not managed by normal thread scheduling. Table 12.1 lists the Kernel control
objects.
Asynchronous Procedure Call Used to break into the execution of a specified thread and
to cause a procedure to be called in a specified processor
mode.
Deferred Procedure Call Used to postpone interrupt processing to avoid delaying
hardware interrupts. Also used to implement timers and
inter-processor communication
Interrupt Used to connect an interrupt source to an interrupt service
routine by means of an entry in an Interrupt Dispatch
Table (IDT). Each processor has an IDT that is used to
dispatch interrupts that occur on that processor.
Process Represents the virtual address space and control
information necessary for the execution of a set of thread
objects. A process contains a pointer to an address map, a
list of ready threads containing thread objects, a list of
threads belonging to the process, the total accumulated
time for all threads executing within the process, and a
base priority.
Thread Represents thread objects, including scheduling priority
and quantum, and which processors the thread may run on.
Profile Used to measure the distribution of run time within a block
of code. Both user and system code can be profiled.
262
Accordingly, the native process structures and services provided by the Windows Kernel are
relatively simple and general purpose, allowing each OS subsystem to emulate a particular
process structure and functionality. Important characteristics of Windows processes are the
following:
Figure 12.2, illustrates the way in which a process relates to the resources it controls or uses.
Each process is assigned a security access token, called the primary token of the process. When a
user first logs on, Windows creates an access token that includes the security ID for the user.
Every process that is created by or runs on behalf of this user has a copy of this access token.
Windows uses the token to validate the user’s ability to access secured objects or to perform
restricted functions on the system and on secured objects. The access token controls whether the
process can change its own attributes. In this case, the process does not have a handle opened to
its access token. If the process attempts to open such a handle, the security system determines
whether this is permitted and therefore whether the process may change its own attributes.
263
Also related to the process is a series of blocks that define the virtual address space currently
assigned to this process. The process cannot directly modify these structures but must rely on the
virtual memory manager, which provides a memory allocation service for the process.
Finally, the process includes an object table, with handles to other objects known to this process.
One handle exists for each thread contained in this object. Figure 12.2 shows a single thread. In
addition, the process has access to a file object and to a section object that defines a section of
shared memory.
Each Windows process is represented by an object whose general structure is shown in Figure
12.3a. Each process is defined by a number of attributes and encapsulates a number of actions, or
services, that it may perform. A process will perform a service when called upon through a set of
published interface methods. When Windows creates a new process, it uses the object class, or
type, defined for the Windows process as a template to generate a new object instance. At the
time of creation, attribute values are assigned. Table 12.2 gives a brief definition of each of the
object attributes for a process object.
A Windows process must contain at least one thread to execute. That thread may then create
other threads. In a multiprocessor system, multiple threads from the same process may execute in
parallel. Figure 12.3b depicts the object structure for a thread object, and Table 4.4 defines the
thread object attributes. Note that some of the attributes of a thread resemble those of a process.
In those cases, the thread attribute value is derived from the process attribute value. For example,
the thread processor affinity is the set of processors in a multiprocessor system that may execute
this thread; this set is equal to or a subset of the process processor affinity.
Note that one of the attributes of a thread object is context. This information enables threads to be
suspended and resumed. Furthermore, it is possible to alter the behavior of a thread by altering its
context when it is suspended.
264
Figure 12.3: Windows Process and Thread Objects
12.3.2. Multithreading
Windows supports concurrency among processes because threads in different processes may
execute concurrently. Moreover, multiple threads within the same process may be allocated to
separate processors and execute simultaneously. A multithreaded process achieves concurrency
without the overhead of using multiple processes. Threads within the same process can exchange
information through their common address space and have access to the shared resources of the
process.
Threads in different processes can exchange information through shared memory that has been
set up between the two processes. An object-oriented multithreaded process is an efficient means
of implementing a server application. For example, one server process can service a number of
clients.
Thread States
An existing Windows thread is in one of six states (Figure12.4):
• Ready: May be scheduled for execution. The Kernel dispatcher keeps track of all ready
threads and schedules them in priority order.
265
• Standby: A standby thread has been selected to run next on a particular processor. The
thread waits in this state until that processor is made available. If the standby thread’s
priority is high enough, the running thread on that processor may be preempted in favor of
the standby thread. Otherwise, the standby thread waits until the running thread blocks or
exhausts its time slice.
• Running: Once the Kernel dispatcher performs a thread switch, the standby thread enters
the Running state and begins execution and continues execution until it is preempted by a
higher priority thread, exhausts its time slice, blocks, or terminates. In the first two cases,
it goes back to the ready state.
• Waiting: A thread enters the Waiting state when (1) it is blocked on an event (e.g., I/O),
(2) it voluntarily waits for synchronization purposes, or (3) an environment subsystem
directs the thread to suspend itself. When the waiting condition is satisfied, the thread
moves to the Ready state if all of its resources are available.
• Transition: A thread enters this state after waiting if it is ready to run but the resources
are not available. For example, the thread’s stack may be paged out of memory. When the
resources are available, the thread goes to the Ready state.
• Terminated: A thread can be terminated by itself, by another thread, or when its parent
process terminates. Once housekeeping chores are completed, the thread is removed from
the system, or it may be retained by the executive for future reinitialization.
266
Table 12.3 Windows Thread Object Attributes
267
12.3.4 Support for OS Subsystems
The general-purpose process and thread facility must support the particular process and thread
structures of the various OS clients. It is the responsibility of each OS subsystem to exploit the
Windows process and thread features to emulate the process and thread facilities of its
corresponding OS. This area of process/thread management is complicated, and we give only a
brief overview here.
Process creation begins with a request for a new process from an application. The application
issues a create-process request to the corresponding protected subsystem, which passes the
request to the Windows executive. The executive creates a process object and returns a handle to
that object to the subsystem. When Windows creates a process, it does not automatically create a
thread. In the case of Win32, a new process is always created with a thread. Therefore, for these
operating systems, the subsystem calls the Windows process manager again to create a thread for
the new process, receiving a thread handle back from Windows. The appropriate thread and
process information are then returned to the application. In the case of 16-bit Windows and
POSIX, threads are not supported. Therefore, for these operating systems, the subsystem obtains
a thread for the new process from Windows so that the process may be activated but returns only
process information to the application.
The fact that the application process is implemented using a thread is not visible to the
application. When a new process is created in Win32, the new process inherits many of its
attributes from the creating process. However, in the Windows environment, this process creation
is done indirectly. An application client process issues its process creation request to the OS
subsystem; then a process in the subsystem in turn issues a process request to the Windows
executive. Because the desired effect is that the new process inherits characteristics of the client
process and not of the server process, Windows enables the subsystem to specify the parent of the
new process. The new process then inherits the parent’s access token, quota limits, base priority,
and default processor affinity.
Windows supports an SMP hardware configuration. The threads of any process, including those
of the executive, can run on any processor. In the absence of affinity restrictions, explained in the
next paragraph, the microkernel assigns a ready thread to the next available processor. This
assures that no processor is idle or is executing a lower-priority thread when a higher-priority
thread is ready. Multiple threads from the same process can be executing simultaneously on
multiple processors.As a default, the microkernel uses the policy of soft affinity in assigning
threads to processors: The dispatcher tries to assign a ready thread to the same processor it last
ran on. This helps reuse data still in that processor’s memory caches from the previous execution
of the thread. It is possible for an application to restrict its thread execution to certain processors
(hard affinity).
268
12.4. Windows Concurrency Mechanisms
Each dispatcher object instance can be in either a signalled or unsignalled state. A thread can be
blocked on an object in an unsignalled state; the thread is released when the object enters the
signalled state. The mechanism is straight forward: A thread issues a wait request to the Windows
Executive, using the handle of the synchronization object. When an object enters the signalled
state, the Windows Executive releases one or all of the thread objects that are waiting on that
dispatcher object.
The event object is useful in sending a signal to a thread indicating that a particular event has
occurred. For example, in overlapped input and output, the system sets a specified event object to
the signalled state when the overlapped operation has been completed. The mutex object is used
to enforce mutually exclusive access to a resource, allowing only one thread object at a time to
gain access. It therefore functions as a binary semaphore. When the mutex object enters the
signalled state, only one of the threads waiting on the mutex is released. Mutexes can be used to
synchronize threads running in different processes. Like mutexes, semaphore objects may be
shared by threads in multiple processes. The Windows semaphore is a counting semaphore. In
essence, the waitable timer object signals at a certain time and/or at regular intervals.
269
Table 12.4Windows Synchronization Objects
Note: Shaded rows correspond to objects that exist for the sole purpose of synchronization
The process is responsible for allocating the memory used by a critical section.
Typically, this is done by simply declaring a variable of type CRITICAL_SECTION.
Before the threads of the process can use it, initialize the critical section by using the
InitializeCriticalSection or InitializeCriticalSectionAndSpinCount function.
A thread uses the EnterCriticalSection or TryEnterCriticalSection function to request ownership
of a critical section. It uses the LeaveCriticalSection function to release ownership of a critical
section. If the critical section is currently owned by another thread, EnterCriticalSection waits
indefinitely for ownership.
270
In contrast, when a mutex object is used for mutual exclusion, the wait functions accept a
specified time-out interval. The TryEnterCriticalSection function attempts to enter a critical
section without blocking the calling thread. Critical sections use a sophisticated algorithm when
trying to acquire the mutex. If the system is a multiprocessor, the code will attempt to acquire a
spin-lock. This works well in situations where the critical section is acquired for only a short
time. Effectively the spinlock optimizes for the case where the thread that currently owns the
critical section is executing on another processor. If the spinlock cannot be acquired within a
reasonable number of iterations, a dispatcher object is used to block the thread so that the Kernel
can dispatch another thread onto the processor.
The dispatcher object is only allocated as a last resort. Most critical sections are needed for
correctness, but in practice are rarely contended. By lazily allocating the dispatcher object the
system saves significant amounts of kernel virtual memory.
Windows Vista added a user mode reader-writer. Like critical sections, the readerwriter lock
enters the kernel to block only after attempting to use a spin-lock. It is slim in the sense that it
normally only requires allocation of a single pointer-sized piece of memory.
To use an SRW a process declares a variable of type SRWLOCK and a calls InitializeSRWLock
to initialize it. Threads call AcquireSRWLockExclusive or
AcquireSRWLockShared to acquire the lock and ReleaseSRWLockExclusive or
ReleaseSRWLockShared to release it.
Windows Vista also added condition variables. The process must declare a
CONDITION_VARIABLE and initialize it in some thread by calling
InitializeConditionVariable. Condition variables can be used with either critical sections or SRW
locks, so there are two methods, SleepConditionVariableCS and SleepConditionVariableSRW,
which sleep on the specified condition and releases the specified lock as an atomic operation.
There are two wake methods, WakeConditionVariable and WakeAllConditionVariable, which
wake one or all of the sleeping threads, respectively.
Condition variables are used as follows:
271
12.5.1 Windows Virtual Address Map
On 32-bit platforms, each Windows user process sees a separate 32-bit address space, allowing 4
Gbytes of virtual memory per process. By default, a portion of this memory is reserved for the
operating system, so each user actually has 2 Gbytes of available virtual address space and all
processes share the same 2 Gbytes of system space.There an option that allows user space to be
increased to 3 Gbytes, leaving 1 Gbyte for system space.This feature is intended to support large
memory-intensive applications on servers with multiple gigabytes of RAM, and that the use of
the larger address space can dramatically improve performance for applications such as decision
support or data mining.
Figure 12.5 shows the default virtual address space seen by a normal 32-bit user process. It
consists of four regions:
• 0x00000000 to 0x0000FFFF: Set aside to help programmers catch NULL pointer
assignments.
• 0x00010000 to 0x7FFEFFFF: Available user address space. This space is divided into
pages that may be loaded into main memory.
272
• 0x7FFF0000 to 0x7FFFFFFF: A guard page inaccessible to the user. This page makes it
easier for the operating system to check on out-of-bounds pointer references.
• 0x80000000 to 0xFFFFFFFF: System address space. This 2-Gbyte process is used for
the Windows Executive, Kernel, and device drivers.
On 64-bit platforms, 8TB of user address space is available in Windows Vista.
The resident set management scheme used by Windows is variable allocation, local scope.When
a process is first activated, it is assigned data structures to manage its working set. As the pages
needed by the process are brought into physical memory the memory manager uses the data
structures to keep track of the pages assigned to the process. Working sets of active processes are
adjusted using the following general conventions:
• When main memory is plentiful, the virtual memory manager allows the resident sets of
active processes to grow. To do this, when a page fault occurs, a new physical page is
added to the process but no older page is swapped out, resulting in an increase of the
resident set of that process by one page.
• When memory becomes scarce, the virtual memory manager recovers memory for the
system by removing less recently used pages out of the working sets of active processes,
reducing the size of those resident sets.
273
12.6.1 Process and Thread Priorities
Priorities in Windows are organized into two bands, or classes: real time and variable. Each of
these bands consists of 16 priority levels. Threads requiring immediate attention are in the real-
time class, which includes functions such as communications and real-time tasks. Overall,
because Windows makes use of a priority-driven preemptive scheduler, threads with real-time
priorities have precedence over other threads. On a uniprocessor, when a thread becomes ready
whose priority is higher than the currently executing thread, the lower-priority thread is
preempted and the processor given to the higher-priority thread.
Priorities are handled somewhat differently in the two classes (Figure 10.15). In the real-time
priority class, all threads have a fixed priority that never changes. All of the active threads at a
given priority level are in a round-robin queue. In the variable priority class, a thread’s priority
begins at some initial assigned value and then may be temporarily boosted (raised) during the
thread’s lifetime. There is a FIFO queue at each priority level; a thread will change queues among
the variable priority classes as its own priority changes. However, a thread at priority level 15 or
below is never boosted to level 16 or any other level in the real-time class.
274
Figure 12.6 Windows Thread Dispatching Priorities
The initial priority of a thread in the variable priority class is determined by two quantities:
process base priority and thread base priority. The process base priority is an attribute of the
process object, and can take on any value from 0 through 15. Each thread object associated with a
process object has a thread base priority attribute that indicates the thread’s base priority relative
to that of the process. The thread’s base priority can be equal to that of its process or within two
levels above or below that of the process. So, for example, if a process has a base priority of 4
and one of its threads has a base priority of -1, then the initial priority of that thread is 3.
Once a thread in the variable priority class has been activated, its actual priority, referred to as the
thread’s current priority, may fluctuate within given boundaries. The current priority may never
fall below the thread’s base priority and it may never exceed 15. Figure 12.7 gives an example.
The process object has a base priority attribute of 4. Each thread object associated with this
process object must have an initial priority of between 2 and 6. Suppose the base priority for
275
thread is 4.Then the current priority for that thread may fluctuate in the range from 4 through 15
depending on what boosts it has been given. If a thread is interrupted to wait on an I/O event, the
Windows Kernel boosts its priority. If a boosted thread is interrupted because it has used up its
current time quantum, the Kernel lowers its priority. Thus, processor-bound threads tend toward
lower priorities and I/O-bound threads tend toward higher priorities. In the case of I/O-bound
threads, the Kernel boosts the priority more for interactive waits (e.g., wait on keyboard or
display) than for other types of I/O (e.g., disk I/O).Thus, interactive threads tend to have the
highest priorities within the variable priority class.
Lower-priority threads may also have their priority boosted to 15 for a very short time if they are
being starved, solely to correct instances of priority inversion. The foregoing scheduling
discipline is affected by the processor affinity attribute of a thread. If a thread is ready to execute
276
but the only available processors are not in its processor affinity set, then that thread is forced to
wait, and the Kernel schedules the next available thread.
• File system drivers: The I/O manager treats a file system driver as just another device
driver and routes I/O requests for file system volumes to the appropriate software driver
for that volume. The file system, in turn, sends I/O requests to the software drivers that
manage the hardware device adapter.
• Network drivers: Windows includes integrated networking capabilities and support for
remote file systems. The facilities are implemented as software drivers rather than part of
the Windows Executive.
• Hardware device drivers: These software drivers access the hardware registers of the
peripheral devices using entry points in the kernel’s Hardware Abstraction Layer. A set of
these routines exists for every platform that Windows supports; because the routine names
are the same for all platforms, the source code of Windows device drivers is portable
across different processor types.
277
12.7.2. Asynchronous and Synchronous I/O
Windows offers two modes of I/O operation: asynchronous and synchronous. The asynchronous
mode is used whenever possible to optimize application performance. With asynchronous I/O, an
application initiates an I/O operation and then can continue processing while the I/O request is
fulfilled. With synchronous I/O, the application is blocked until the I/O operation completes.
Asynchronous I/O is more efficient, from the point of view of the calling thread, because it
allows the thread to continue execution while the I/O operation is queued by the I/O manager and
subsequently performed. However, the application that invoked the asynchronous I/O operation
needs some way to determine when the operation is complete. Windows provides five different
techniques for signaling
I/O completion:
• Signaling the file object: With this approach, the event associated with a file object is set
when an operation on that object is complete. The thread that invoked the I/O operation
can continue to execute until it reaches a point where it must stop until the I/O operation
is complete. At that point, the thread can wait until the operation is complete and then
continue. This technique is simple and easy to use but is not appropriate for handling
multiple I/O requests.
For example, if a thread needs to perform multiple simultaneous actions on a single file,
such as reading from one portion and writing to another portion of the file, with this
technique, the thread could not distinguish between the completion of the read and the
completion of the write. It would simply know that some requested I/O operation on this
file was complete.
• Signaling an event object: This technique allows multiple simultaneous I/O requests
against a single device or file. The thread creates an event for each request. Later, the
thread can wait on a single one of these requests or on an entire collection of requests.
• Asynchronous procedure call: This technique makes use of a queue associated with a
thread, known as the asynchronous procedure call (APC) queue. In this case, the thread
makes I/O requests, specifying a user mode routine to call when the I/O completes. The
I/O manager places the results of each request in the calling thread’s APC queue. The
next time the thread blocks in the kernel, the APCs will be delivered; each causing the
thread to return to user mode and execute the specified routine.
• I/O completion ports: This technique is used on a Windows server to optimize the use of
threads. The application creates a pool of threads for handling the completion of I/O
requests. Each thread waits on the completion port, and the Kernel wakes threads to
handle each I/O completion. One of the advantages of this approach is that the application
can specify a limit for how many of these threads will run at a time.
• Polling: Asynchronous I/O requests write a status and transfer count into the process’
user virtual memory when the operation completes. A thread can just check these values
to see if the operation has completed.
278
• Software RAID: Noncontiguous disk space combined into one or more logical partitions
by the fault-tolerant software disk driver, FTDISK.
In hardware RAID, the controller interface handles the creation and regeneration of redundant
information. The software RAID, available on Windows Server, implements the RAID
functionality as part of the operating system and can be used with any set of multiple disks. The
software RAID facility implements RAID 1 and RAID 5. In the case of RAID 1 (disk mirroring),
the two disks containing the primary and mirrored partitions may be on the same disk controller
or different disk controllers. The latter configuration is referred to as disk duplexing.
Volume Shadow Copies
Shadow copies are an efficient way of making consistent snapshots of volumes to that they can
be backed up. They are also useful for archiving files on a per-volume basis. If a user deletes a
file, he or she can retrieve an earlier copy from any available shadow copy made by the system
administrator. Shadow copies are implemented by a software driver that makes copies of data on
the volume before it is overwritten.
Volume Encryption
Starting with Windows Vista, the operating system supports the encryption of entire volumes.
This is more secure than encrypting individual files, as the entire system works to be sure that the
data is safe. Up to three different methods of supplying the cryptographic key can be provided;
allowing multiple interlocking layers of security.
279
• Large disks and large files: NTFS supports very large disks and very large files more
efficiently than most other file systems, including FAT.
• Multiple data streams: The actual contents of a file are treated as a stream of bytes. In
NTFS it is possible to define multiple data streams for a single file. An example of the
utility of this feature is that it allows Windows to be used by remote Macintosh systems to
store and retrieve files. On Macintosh, each file has two components: the file data and a
resource fork that contains information about the file. NTFS treats these two components
as two data streams.
• Journaling: NTFS keeps a log of all changes made to files on the volumes. Programs,
such as desktop search, can read the journal to identify what files have changed.
• Compression and Encryption: Entire directories and individual files can be
transparently compressed and/or encrypted.
The cluster is the fundamental unit of allocation in NTFS, which does not recognize sectors. For
example, suppose each sector is 512 bytes and the system is configured with two sectors per
cluster (one cluster = 1K bytes). If a user creates a file of 1600 bytes, two clusters are allocated to
the file. Later, if the user updates the file to 3200 bytes, another two clusters are allocated. The
clusters allocated to a file need not be contiguous; it is permissible to fragment a file on the disk.
Currently, the maximum file size supported by NTFS is 232 clusters, which is equivalent to a
maximum of 248 bytes. A cluster can have at most 216 bytes.
NTFS Volume Layout NTFS uses a remarkably simple but powerful approach to organizing
information on a disk volume. Every element on a volume is a file, and every file consists of a
collection of attributes. Even the data contents of a file is treated as an attribute. With this simple
structure, a few general-purpose functions suffice to organize and manage a file system. Figure
12.9 shows the layout of an NTFS volume, which consists of four regions.
The first few sectors on any volume are occupied by the partition boot sector (although it is
called a sector, it can be up to 16 sectors long), which contains information about the volume
layout and the file system structures as well as boot startup information and code. This is
followed by the master file table (MFT), which contains information about all of the files and
folders (directories) on this NTFS volume. In essence, the MFT is a list of all files and their
attributes on this NTFS volume, organized as a set of rows in a relational database structure.
Following the MFT is a region, typically about 1 Mbyte in length, containing system files.
Among the files in this region are the following:
• MFT2: A mirror of the first three rows of the MFT, used to guarantee access to the MFT
in the case of a single-sector failure
Master File Table The heart of the Windows file system is the MFT. The MFT is organized as a
table of 1024-byte rows, called records. Each row describes a file on this volume, including the
MFT itself, which is treated as a file. If the contents of a file are small enough, then the entire file
is located in a row of the MFT. Otherwise, the row for that file contains partial information and
the remainder of the file spills over into other available clusters on the volume, with pointers to
those clusters in the MFT row of that file.
281
Each record in the MFT consists of a set of attributes that serve to define the file (or folder)
characteristics and the file contents. Table 12.6 lists the attributes that may be found in a row,
with the required attributes indicated by shading.
Recoverability
NTFS makes it possible to recover the file system to a consistent state following a system crash
or disk failure. The key elements that support recoverability are as follows (Figure 12.10):
• I/O manager: Includes the NTFS driver, which handles the basic open, close, read, write
functions of NTFS. In addition, the software RAID module FTDISK can be configured
for use.
• Log file service: Maintains a log of file system metadata changes on disk. The log file is
used to recover an NTFS-formatted volume in the case of a system failure (i.e., without
having to run the file system check utility).
• Cache manager: Responsible for caching file reads and writes to enhance performance.
The cache manager optimizes disk I/O.
• Virtual memory manager: The NTFS accesses cached files by mapping file references
to virtual memory references and reading and writing virtual memory.
282
Figure 12.10 Windows NTFS Component
It is important to note that the recovery procedures used by NTFS are designed to recover file
system metadata, not file contents. Thus, the user should never lose a volume or the directory/file
structure of an application because of a crash. However, user data are not guaranteed by the file
system. Providing full recoverability, including user data, would make for a much more elaborate
and resource consuming recovery facility.
The essence of the NTFS recovery capability is logging. Each operation that alters a file system is
treated as a transaction. Each suboperation of a transaction that alters important file system data
structures is recorded in a log file before being recorded on the disk volume. Using the log, a
partially completed transaction at the time of a crash can later be redone or undone when the
system recovers.
In general terms, these are the steps taken to ensure recoverability, as described in [RUSS05]:
1. NTFS first calls the log file system to record in the log file in the cache any transactions
that will modify the volume structure.
2. NTFS modifies the volume (in the cache).
3. The cache manager calls the log file system to prompt it to flush the log file to disk.
4. Once the log file updates are safely on disk, the cache manager flushes the volume
changes to disk.
283
ACTIVITY 12.1
12.9. Summary
Kernel mode in Windows Vista is structured in the HAL, the kernel and executive layers of
NTOS, and a large number of device drivers implementing everything from device services to
file systems and networking to graphics.The operating system also creates objects internally. The
object manager maintains a name space into which objects can be inserted for subsequent
lookup.The most important objects in Windows are processes, threads, and sections.I/O is
performed by device drivers, which follow the Windows Driver Model.Each driver starts out by
284
initializing a driver object that contains the addresses ofthe procedures that the system can call to
manipulate devices. The NTFS file system is based on a master file table, which has one
recordper file or directory. All the metadata in an NTFS file system is itself part of anNTFS file.
Each file has multiple attributes, which can either be in the MFTrecord or nonresident (stored in
blocks outside the MFT). NTFS supportsUnicode, compression, journaling, and encryption
among many other features.Finally, Windows Vista has a sophisticated security system based on
accesscontrol lists and integrity levels. Each process has an authentication token thattells the
identity of the user and what special privileges the process has, if any.
REFERENCES
285