Windows Kernel Internals
Overview
9 October 2006
Singapore
Dave Probert, Ph.D.
Architect, Windows Kernel Group
Windows Core Operating Systems Division
Microsoft Corporation
v3 © Microsoft Corporation 2006 1
History of NT/OS2
• 1988: Bill Gates recruits VMS Architect Dave Cutler
• Business goals:
• an advanced commercial OS for desktops/servers
• compatible with OS/2
• Technical goals:
• scalable on symmetric multiprocessors
• secure, reliable, performant
• portable
v3 © Microsoft Corporation 2006 2 2
NT Timeline first 17 years
2/1989 Coding Begins
7/1993 NT 3.1
9/1994 NT 3.5
5/1995 NT 3.51
7/1996 NT 4.0
12/1999 NT 5.0 Windows 2000
8/2001 NT 5.1 Windows XP
3/2003 NT 5.2 Server 2003
8/2004 NT 5.2 Windows XP SP2
4/2005 NT 5.2 Windows XP 64 Bit Edition (& WS03SP1)
2006 NT 6.0 Windows Vista (client)
v3 © Microsoft Corporation 2006 3 3
Important NT kernel features
• Highly multi-threaded
• Completely asynchronous I/O model
• Thread-based scheduling
• Object-manager provides unified management of
• kernel data structures
• kernel references
• user references (handles)
• namespace
• synchronization objects
• resource charging
• cross-process sharing
• Centralized ACL-based security reference monitor
• Configuration store decoupled from file system
v3 © Microsoft Corporation 2006 4 4
Important NT kernel features (cont)
• Extensible filter-based I/O model with driver layering,
standard device models, notifications, tracing, journaling,
namespace, services/subsystems
• Virtual address space managed separately from memory
objects
• Advanced VM features for databases (app management
of virtual addresses, physical memory, I/O, dirty bits, and
large pages)
• Plug-and-play, power-management
• System library mapped in every process provides trusted
entrypoints
v3 © Microsoft Corporation 2006 5
Major Kernel Functions
• Manage naming & security OB, SE
• Manage address spaces PS, MM
• Manage physical memory MM, CACHE
• Manage CPU KE
• Provide I/O & net abstractions IO, drivers
• Implement cross-domain calls LPC
• Abstract low-level hardware HAL
• Internal support functions EX, RTL
• Internal configuration mgmt CONFIG
v3 © Microsoft Corporation 2006 6
Major NT Kernel Components
OB – Object Manager
SE – Security Reference Monitor
PS – Process/Thread management
MM – Memory Manager
CACHE – Cache Manager
KE – Scheduler
IO – I/O manager, PnP, device power mgmt, GUI
Drivers – devices, file systems, volumes, network
LPC – Local Procedure Calls
HAL – Hardware Abstraction Layer
EX – Executive functions
RTL – Run-Time Library
CONFIG – Persistent configuration state (registry)
v3 © Microsoft Corporation 2006 7
Major Kernel Services
Object Manager
Naming, referencing, synchronizing
Process management
Process/thread creation
Security reference monitor
Access checks, token management
Memory manager
Virtual address mgmt, physical memory mgmt, paging, Services
for sharing, copy-on-write, mapped files, GC support, large apps
Lightweight Procedure Call (LPC)
Native transport for RPC and user-mode system services.
I/O manager (& plug-and-play & power)
Maps user requests into IRP requests, configures/manages I/O
devices, implements services for drivers
Cache manager
Provides file-based caching to buffer file system I/O
Scheduler (aka ‘kernel’)
Schedules thread execution on each processor
v3 © Microsoft Corporation 2006 8
Windows Architecture
Applications
DLLs System Services Login/GINA
Subsystem
servers Kernel32 Critical services User32 / GDI
User-mode ntdll / run-time library
Kernel-mode Trap interface / LPC
Security refmon I/O Manager Memory Manger Procs & threads Win32 GUI
Net devices File filters
Filesys run-time Scheduler
Net protocols File systems
Net Interfaces Volume mgrs Synchronization
Cache mgr
Device stacks
Object Manager / Configuration Management (registry)
Kernel run-time / Hardware Abstraction Layer
v3 © Microsoft Corporation 2006 9
Windows Kernel Organization
Kernel-mode organized into
NTOS (kernel-mode services)
• Run-time Library, Scheduling, Executive services, object
manager, services for I/O, memory, processes, …
HAL (hardware-adaptation layer)
• Insulates NTOS & drivers from hardware details
• Providers facilities, such as device access, timers, interrupt
servicing, clocks, spinlocks
Drivers
• Kernel extensions (devices, file systems, network)
v3 © Microsoft Corporation 2006 10 10
Namespace
Components
Manage naming and security
Manage references to kernel data structures
Extensible mechanisms, scalable
Provides general synchronization
v3 © Microsoft Corporation 2006 11
NT Object Manager
– Provides underlying NT namespace
– Unifies kernel data structure referencing
– Unifies user-mode referencing via handles
– Simplifies resource charging
– Central facility for security protection
– Other namespaces ‘mount’ on OB nodes
– Provides device & I/O support
v3 © Microsoft Corporation 2006 12
L“\”
\Global??\C:
<directory> <directory>
L“Global??” L“C:”
<symbolic link>
\Device\HarddiskVolume1
L“\” \Device\HarddiskVolume1
<directory> <directory> <device>
implemented
L“Device” L“HarddiskVolume1” by I/O
manager
v3 © Microsoft Corporation 2006 13
Security Reference Monitor
• Based on discretionary access controls
– Single module for access checks
– Implements Security Descriptors, System and
Discretionary ACLs, Privileges, and Tokens
– Collaborates with Local Security Authority
Service to obtain authenticated credentials
– Provides auditing and fulfills other Common
Criteria requirements
v3 © Microsoft Corporation 2006 14
Object Mgr and Sec Monitor
Security
Ref Monitor
Access checks
Name lookup
Object
Kernel Manager
Returns ref’d ptr
Code
Kernel
Data Object
Ref’d ptr used until deref
v3 © Microsoft Corporation 2006 15
OB Namespace: objdir \
ArcName Directory NLS Directory
BaseNamedObjects Directory Ntfs Device
Callback Directory ObjectTypes Directory
Cdfs Device REGISTRY Key
Device Directory RPC Control Directory
Dfs Device SAM_SERVICE_STARTED Event
DosDevices SymbolicLink - \?? Security Directory
Driver Directory SeLsaCommandPort Port
ErrorLogPort Port SeLsaInitEvent Event
FileSystem Directory SeRmCommandPort Port
GLOBAL?? Directory Sessions Directory
i8042PortAccessMutex Event SmApiPort Port
KernelObjects Directory SmSsWinStationApiPort Port
KnownDlls Directory SystemRoot SymbolicLink -
LanmanServerAnnounceEvent Event \Device\Harddisk0\Partition1\WIN
LsaAuthenticationPort Port DOWS
NETLOGON_SERVICE_STARTED Event ThemeApiPort Port
NLAPrivatePort WaitablePort UniqueSessionIdEvent Event
NLAPublicPort WaitablePort Windows Directory
XactSrvLpcPort Port
v3 © Microsoft Corporation 2006 16
OB Extensibility: Object Methods
Note that the methods are unrelated to actual
operations on the underlying objects:
OPEN: Create/Open/Dup/Inherit handle
CLOSE: Called when each handle closed
DELETE: Called on last dereference
PARSE: Called looking up objects by name
SECURITY: Usually SeDefaultObjectMethod
QUERYNAME: Return object-specific name
v3 © Microsoft Corporation 2006 17
OB Extensibility: \ObjectTypes
Adapter File Semaphore
Callback IoCompletion SymbolicLink
Controller Job Thread
DebugObject Key Timer
Desktop KeyedEvent Token
Device Mutant Type
Directory Port WaitablePort
Driver Process WindowsStation
Event Profile WMIGuid
EventPair Section
v3 © Microsoft Corporation 2006 18
OB Extensibility: \ObjectTypes
Adapter File Semaphore
Callback IoCompletion SymbolicLink
Controller Job Thread
DebugObject Key Timer
Desktop KeyedEvent Token
Device Mutant Type
Directory Port WaitablePort
Driver Process WindowsStation
Event Profile WMIGuid
EventPair Section
v3 © Microsoft Corporation 2006 19
Object referencing: Handles
General mechanism: shorthand for referencing an opaque
data structure
e.g. a kernel structure (file, process, …)
user kernel
Mapping
handle mechanism
Data
structure
v3 © Microsoft Corporation 2006 20
Process/Thread structure
Any Handle Object Process
Table Manager Object
Thread
Thread
Files Virtual
Process’ Thread
Address
Events Handle Table
Descriptors Thread
Devices
Memory Thread
Drivers Manager
read(handle) Structures Thread
user-mode execution
v3 © Microsoft Corporation 2006 21
Handle Table
– NT handles allow user code to reference
kernel data structures (similar, but more
general than UNIX file descriptors)
– NT APIs use explicit handles to refer to
objects (simplifying cross-process operations)
– Handles can be used for synchronization,
including WaitMultiple
– Implementation is highly scalable
v3 © Microsoft Corporation 2006 22
Handle Table Requirements
• Perform well (time & memory) across a broad range of
handle table sizes
• Handles can’t change as table expands
• Efficient allocate, duplicate, free operations
• Scalable performance on high-MP systems
v3 © Microsoft Corporation 2006 23
One level: (to 512 handles)
v3 © Microsoft Corporation 2006 24
Two levels: (to 512K handles)
v3 © Microsoft Corporation 2006 25
Three levels: (to 16M handles)
v3 © Microsoft Corporation 2006 26
Kernel Handles
v3 © Microsoft Corporation 2006 27
IO Support: IopParseDevice
Returns handle to File object
user
Trap mechanism
kernel
Access Security
NtCreateFile() IopParseDevice() RefMon
check
context File object
DevObj,
ObjMgr Lookup context Access
Dev Stack check
File Sys
File System Fills in File object
v3 © Microsoft Corporation 2006 28
Object Manager Implementation
• Implements standard operations
– Open, close, delete, parse, security, query
• Dynamic definition of OB types, including
callbacks for standard ops and allocation
• Implements a unified API
– OpenByName, reference, dereference
– Namespace and synchronization functions
• Relies on Security Reference Monitor
• Every object has standard OBJECT_HEADER
v3 © Microsoft Corporation 2006 29
OBJECT_HEADER
PointerCount
HandleCount
pObjectType
oNameInfo oHandleInfo oQuotaInfo Flags
pQuotaBlockCharged
pSecurityDescriptor
CreateInfo + NameInfo + HandleInfo + QuotaInfo
OBJECT BODY
[with optional DISPATCHER_HEADER]
v3 © Microsoft Corporation 2006 30
Uniform Synchronization:
DISPATCHER_HEADER
Fundamental kernel synchronization mechanism
Equivalent to a KEVENT at front of dispatcher objects
Object Body → Inserted Size Absolute Type
SignalState
WaitListHead.flink
WaitListHead.blink
v3 © Microsoft Corporation 2006 31
KPRCB Thread Thread
WaitListHead WaitListEntry WaitListEntry
WaitBlockList WaitBlockList
Object->Header WaitBlock WaitBlock
WaitListHead WaitListEntry WaitListEntry
Signaled NextWaitBlock NextWaitBlock
Object->Header WaitBlock WaitBlock
WaitListHead WaitListEntry WaitListEntry
Signaled NextWaitBlock NextWaitBlock
Object->Header WaitBlock
WaitListHead WaitListEntry
Signaled NextWaitBlock
Object->Header WaitBlock
WaitListHead WaitListEntry
Signaled Structure used by
NextWaitBlock WaitMultiple
v3 © Microsoft Corporation 2006 32
Address Spaces
Memory Mgmt
• Virtual Address management, processes
• Shared memory, cache management
• Virtual Address Translation, page tables
• Physical pageframe (& pagefile) management
• Large app support
v3 © Microsoft Corporation 2006 33
Address Space Layout (2GB mode)
0x7FFFFFFF No access region
0x7FFE1000
0x7FFE0000 Shared User Data
PEB
TEBs
Module images
Stacks
Private Unused
Process
Space
Virtual Allocations
Heaps
0x0000FFFF
v3 © Microsoft Corporation 2006 34
0x00000000 No access region
Process/Thread structure
Any Handle Object Process
Table Manager Object
Thread
Thread
Files Virtual
Process’ Thread
Address
Events Handle Table
Descriptors Thread
Devices
Memory Thread
Drivers Manager
Structures Thread
v3 © Microsoft Corporation 2006 35
Processes
• An environment for program execution
(conceptually)
• Binds
– namespaces
– virtual address mappings
– ports (debug, exceptions)
– threads
• Not a virtualization of a processor
v3 © Microsoft Corporation 2006 36
Virtual Address Descriptors
• Tree representation of an address space
• Types of VAD nodes
– invalid
– reserved
– committed
– committed to backing store
– app-managed (large pages, AWE, physical)
• Backing store represented by section
objects
v3 © Microsoft Corporation 2006 37
Shared Memory Data Structures
File Object Segment
Handle
Control Area
Section
Object
Handle Page
Directory Proto
PTEs
Shared
Cache Map Subsection
Subsection
VAD
Process Page Page
Directory Table
v3 © Microsoft Corporation 2006 38
Cache Manager Summary
• Virtual block cache for files not logical block cache for
disks
• Memory manager is the ACTUAL cache manager
• Cache Manager context integrated into FileObjects
• Cache Manager manages views on files in kernel virtual
address space
• I/O has special fast path for cached accesses
• The Lazy Writer periodically flushes dirty data to disk
• Filesystems need two interfaces to CC: map and pin
v3 © Microsoft Corporation 2006 39
The Big Block Diagram
Fast IO Read/Write Cached IO IRP-based Read/Write
Cache Manager Filesystem
Cache
Access,
Page Noncached
Flush, Fault IO
Purge
Memory Manager Storage Drivers
Disk
v3 © Microsoft Corporation 2006 40
Filesystem & Cache Manager
• 3 basic types of I/O: cached, noncached and “paging”
• Paging I/O is I/O generated by Mm – flushing or faulting
– the data section implies the file is big enough
– can never extend a file
• A filesystem will recurse on the same callstack as Mm
dispatches cache pagefaults
– This makes things exciting! (ERESOURCEs)
Three File Sizes
• FileSize – normal length expected by the user
• AllocationSize – backing store allocated on the volume
– multiple of cluster size, which is 2n * sector size
• ValidDataLength – size written so far
– ValidDataLength <= FileSize <= AllocationSize
v3 © Microsoft Corporation 2006 41
Letting the Filesystem Into The Cache
• Two distinct access interfaces
– Map – given File+FileOffset, return a cache address
– Pin – same, but acquires synchronization – this is a
range lock on the stream
• Lazy writer acquires synchronization, allowing it to serialize
metadata production with metadata writing
• Pinning also allows setting of a log sequence
number (LSN) on the update, for transactional
FS
– FS receives an LSN callback from the lazy writer prior
to range flush
v3 © Microsoft Corporation 2006 42
Virtual Address Translation
CR3
PD PT page DATA
1024 1024 4096
PDEs PTEs bytes
0000 0000 0000 0000 0000 0000 0000 0000
v3 © Microsoft Corporation 2006 43
Self-mapping page tables
• Page Table Entries (PTEs) and Page Directory Entries
(PDEs) contain Physical Frame Numbers (PFNs)
– But Kernel runs with Virtual Addresses
• To access PDE/PTE from kernel use the self-
map for the current process:
PageDirectory[0x300] uses PageDirectory as
PageTable
– GetPdeAddress(va): 0xc0300000[va>>20]
– GetPteAddress(va): 0xc0000000[va>>10]
• PDE/PTE formats are compatible!
• Access another process VA via thread ‘attach’
v3 © Microsoft Corporation 2006 44
Self-mapping page tables
Virtual Access to PageDirectory[0x300]
CR3
Phys: PD[0xc0300000>>22] = PD
Virt: *((0xc0300c00) == PD
PD
0x300
PTE
0000 0000 0011
1100 0000 0000 0000 1100
0000 0000 0000
v3 © Microsoft Corporation 2006 45
Self-mapping page tables
Virtual Access to PTE for va 0xe4321000
CR3
GetPteAddress:
0xe4321000
PD PT => 0xc0390c84
0x300 0x321
0x390 PTE
0000 0000 0011
1100 0000 1001
0000 0000 1100
0000 1000
0000 0100
0000
v3 © Microsoft Corporation 2006 46
Writing Cached Data
• There are three basic sets of threads involved,
only one of which is Cc’s
– Mm’s modified page writer (paging file)
– Mm’s mapped page writer (mapped file)
– Cc’s lazy writer pool (cleans data in cache)
v3 © Microsoft Corporation 2006 47
The Lazy Writer
• Name is misleading, its really delayed
• All files with dirty data have been queued onto
CcDirtySharedCacheMapList
• Work queueing – CcLazyWriteScan()
– Once per second, queues work to arrive at writing 1/8th of dirty data
given current dirty and production rates
– Fairness considerations are interesting
• CcLazyWriterCursor rotated around the list, pointing at the
next file to operate on (fairness)
– 16th pass rule for user and metadata streams
• Work issuing – CcWriteBehind()
– Uses a special mode of CcFlushCache() which flushes front to back
v3 © Microsoft Corporation 2006 48
Physical Frame Management
• Table of PFN data structures
– represent all pageable pages
– synchronize page-ins
– linked to management lists
• Page Tables
– hierarchical index of page directories and tables
– leaf-node is page table entry (PTE)
– PTE states:
• Active/valid
• Transition
• Modified-no-write
• Demand zero
• Page file
• Mapped file
v3 © Microsoft Corporation 2006 49
Paging Overview
Working Sets: list of valid pages for each process
(and the kernel)
Pages ‘trimmed’ from working set on lists
Standby list: pages backed by disk
Modified list: dirty pages to push to disk
Free list: pages not associated with disk
Zero list: supply of demand-zero pages
Modify/standby pages can be faulted back into a
working set w/o disk activity (soft fault)
Background system threads trim working sets,
write modified pages and produce zero pages
based on memory state and config parameters
v3 © Microsoft Corporation 2006 50
Physical Frame Management
Process/System Soft
Soft
Working Set Fault
Fault
Trim Trim
Clean Dirty
Delete
Page
Modified
Standby Modified
Page-
List List
writer
MM Low
Memory Physical Page State
Changes
Hardfault Zerofault
(DISK) (FILL)
Free Zero Zero
List Thread List
v3 © Microsoft Corporation 2006 51
Managing Working Sets
Aging pages: Increment age counts for pages
which haven't been accessed
Estimate unused pages: count in working set and
keep a global count of estimate
When getting tight on memory: replace rather
than add pages when a fault occurs in a working
set with significant unused pages
When memory is tight: reduce (trim) working sets
which are above their maximum
Balance Set Manager: periodically runs Working
Set Trimmer, also swaps out kernel stacks of
long-waiting threads
v3 © Microsoft Corporation 2006 52
Bypassing Memory Management
Working-set list Working-set Manager
VAD tree
Sections
executable
Image
c-o-w
Standby List
Free List
Data
SQL db
Modified List
File
Application Phys
Data
datafile
File
Data
pagefile
Data Modified
Page Writer
v3 © Microsoft Corporation 2006 53
CPU
Processes versus Threads
Lighterweight multi-threading
CPU scheduling
CPU mechanisms:
APCs, ISRs/DPCs, system worker threads
v3 © Microsoft Corporation 2006 54
Process
Container for an address space and threads
Associated User-mode Process Environment Block (PEB)
Primary Access Token
Quota, Debug port, Handle Table etc
Unique process ID
Queued to the Job, global process list and Session list
MM structures like the WorkingSet, VAD tree, AWE etc
v3 © Microsoft Corporation 2006 55
Thread
Fundamental schedulable entity in the system
Represented by ETHREAD that includes a KTHREAD
Queued to the process (both E and K thread)
IRP list
Impersonation Access Token
Unique thread ID
Associated User-mode Thread Environment Block (TEB)
User-mode stack
Kernel-mode stack
Processor Control Block (in KTHREAD) for cpu state when
not running
v3 © Microsoft Corporation 2006 56
Process/Thread structure
Any Handle Object Process
Table Manager Object
Thread
Thread
Files Virtual
Process’ Thread
Address
Events Handle Table
Descriptors Thread
Devices
Thread
Drivers
Thread
v3 © Microsoft Corporation 2006 57
Mitigating thread costs
Thread pools
• Driven by work items
• User-mode thread pool
• Kernel-mode worker threads
Fibers
• user-mode threads
• allows user-mode control of scheduling
• better performance for certain apps, but generally
discouraged
• has most of the usual user vs. kernel thread issues
v3 © Microsoft Corporation 2006 58
Thread latencies
Scheduling introduces bad latencies
– Preemption
• introduces fairness and responsiveness
• creates priority inversion if holding locks/resources
– Scheduling
• allows prioritized sharing
Boost priority
• defeats RPC
block ready
caller IPC
scheduler ready
ready
ready block
IPC callee
v3 © Microsoft Corporation 2006 59
Scheduling
Windows schedules threads, not processes
Scheduling is preemptive, priority-based, and round-robin at the
highest-priority
16 real-time priorities above 16 normal priorities
Scheduler tries to keep a thread on its ideal processor/node to
avoid perf degradation of cache/NUMA-memory
Threads can specify affinity mask to run only on certain processors
Each thread has a current & base priority
Base priority initialized from process
Non-realtime threads have priority boost/decay from base
Boosts for GUI foreground, waking for event
Priority decays, particularly if thread is CPU bound (running at
quantum end)
Scheduler is state-driven by timer, setting thread priority,
thread block/exit, etc
Priority inversions can lead to starvation
balance manager periodically boosts non-running runnable threads
v3 © Microsoft Corporation 2006 60
ed
pp
a
Sw
Ready
Scheduler
Blocked
Running
v3 © Microsoft Corporation 2006 61
Thread scheduling states
• Main quasi-states:
– Ready – able to run (queued on Prcb ReadyList)
– Running – current thread (Prcb CurrentThread)
– Waiting – waiting an event
• For scalability Ready is three real states:
– DeferredReady – queued on any processor
– Standby – will be imminently start Running
– Ready – queue on target processor by priority
• Goal is granular locking of thread priority queues
• Red states related to swapped stacks and processes
v3 © Microsoft Corporation 2006 62
NT thread priorities
worker 15 critical 31
H
I 14 30
threads 13
+ G 29
12 28
N H 11 27
N O 10 26
O R 9 normal 25
N 8 real-time
R M 24
O 7 (dynamic) (fixed)
M 23
I R 6 22
D M 5 21
- 4 20
L
3 19
E 2 18
1 idle 17
0 zero thread 16
v3 © Microsoft Corporation 2006 63
CPU Control-flow
Thread scheduling occurs at PASSIVE or APC level
(IRQL < 2)
APCs (Asynchronous Procedure Calls) deliver I/O
completions, thread/process termination, etc (IRQL == 1)
Not a general mechanism like unix signals (user-mode code must
explicitly block pending APC delivery)
Interrupt Service Routines run at IRL > 2
ISRs defer most processing to run at IRQL==2 (DISPATCH
level) by queuing a DPC to their current processor
A pool of worker threads available for kernel components to
run in a normal thread context when user-mode thread is
unavailable or inappropriate
Normal thread scheduling is round-robin among priority
levels, with priority adjustments (except for fixed priority
real-time threads)
v3 © Microsoft Corporation 2006 64
Asynchronous Procedure Calls
APCs execute routine in thread context
not as general as UNIX signals
user-mode APCs run when blocked & alertable
kernel-mode APCs used extensively: timers,
notifications, swapping stacks, debugging, set
thread ctx, I/O completion, error reporting,
creating & destroying processes & threads, …
APCs generally blocked in critical sections
e.g. don’t want thread to exit holding resources
v3 © Microsoft Corporation 2006 65
Deferred Procedure Calls
DPCs run a routine on a particular processor
DPCs are higher priority than threads
common usage is deferred interrupt processing
ISR queues DPC to do bulk of work
• long DPCs harm perf, by blocking threads
• Drivers must be careful to flush DPCs before unloading
also used by scheduler & timers (e.g. at quantum end)
kernel-mode APCs used extensively: timers,
notifications, swapping stacks, debugging, set thread
ctx, I/O completion, error reporting, creating &
destroying processes & threads, …
High-priority routines use IPI (inter-processor intr)
used by MM to flush TLB in other processors
v3 © Microsoft Corporation 2006 66
System Threads
System threads have no user-mode context
Run in ‘system’ context, use system handle table
System thread examples
Dedicated threads
Lazy writer, modified page writer, balance set manager,
mapped pager writer, other housekeeping functions
General worker threads
Used to move work out of context of user thread
Must be freed before drivers unload
Sometimes used to avoid kernel stack overflows
Driver worker threads
Extends pool of worker threads for heavy hitters, like file server
v3 © Microsoft Corporation 2006 67
Synchronization
Multiple tailored mechanisms for synchronization
and resource sharing
Examples:
PushLocks
Fast Referencing
v3 © Microsoft Corporation 2006 68
Kernel synchronization mechanisms
Pushlocks DISPATCHER_HEADER
Fastref KQUEUEs
Rundown protection KEVENTs
Spinlocks Guarded mutexes
Queued spinlocks Mutants
IPI Semaphores
SLISTs EventPairs
ERESOURCEs
Critical Sections
v3 © Microsoft Corporation 2006 69
Push Locks
• Acquired shared or exclusive
• NOT recursive
• Locks granted in order of arrival
• Fast non-contended / Slow contended
• Sizeof(pushlock) == Sizeof(void*)
• Pageable
• Acquire/release are lock-free
• Contended case blocks using local stack
v3 © Microsoft Corporation 2006 70
Pushlock format
v3 © Microsoft Corporation 2006 71
Fast Referencing
• Used to protect rarely changing reference
counted data
• Small pageable structure that’s the size of
a pointer
• Scalable since it requires no lock acquires
in over 99% of calls
v3 © Microsoft Corporation 2006 72
Fast Referencing Internals
Object Pointer R
Object: RefCnt: R + 1 + N
v3 © Microsoft Corporation 2006 73
Obtaining a Fast Reference
Object Pointer 3
Reference Dereference
Object Pointer 2
v3 © Microsoft Corporation 2006 74
I/O
Driver stacks
I/O Request Packets
Synchronous vs Asynchronous I/O
I/O completion ports
File Systems
v3 © Microsoft Corporation 2006 75
NtCreateFile
IRP
File
Object
I/O Manager
FS filter drivers
ObOpenObjectByName
IoCallDriver
Object Manager NTFS
IopParseDevice IoCallDriver
I/O Manager Volume Mgr
IoCallDriver
IoCallDriver
Result: File Object HAL Disk Driver
filled in by NTFS
v3 © Microsoft Corporation 2006 76
Layering Drivers
Device objects attach one on top of another using
IoAttachDevice* APIs creating device stacks
– IO manager sends IRP to top of the stack
– drivers store next lower device object in their private
data structure
– stack tear down done using IoDetachDevice and
IoDeleteDevice
Device objects point to driver objects
– driver represent driver state, including dispatch table
File objects point to open files
File systems are drivers which manage file objects for
volumes (described by VolumeParameterBlocks)
v3 © Microsoft Corporation 2006 77
IO Request Packet (IRP)
• IO operations encapsulated in IRPs.
• IO requests travel down a driver stack in an IRP.
• Each driver gets a stack location which contains
parameters for that IO request.
• IRP has major and minor codes to describe IO
operations.
• Major codes include create, read, write, PNP,
devioctl, cleanup and close.
• Irps are associated with a thread that made the
IO request.
v3 © Microsoft Corporation 2006 78
IRP Fields
Flags
System Buffer Pointers
User MDL Chain
MDL Thread’s IRPs
Thread Completion/Cancel Info
Driver
Completion
Queuing
APC block
& Comm.
IRP Stack Locations
v3 © Microsoft Corporation 2006 79
Each IRP Stack Location DrvrObj
Major/Minor Function Codes
Flags & Control
MDL Chain DevObj
Create: security, options
Parameters: Read: len, key, offset
DeviceObject
FileObject FileObj
Completion Routine & Parameter
v3 © Microsoft Corporation 2006 80
IRP flow of control (synchronous)
IOMgr (e.g. IopParseDevice) creates IRP, fills in top
stack location, calls IoCallDriver to pass to stack
driver determined by top device object on device stack
driver passed the device object and IRP
IoCallDriver
copies stack location for next driver
driver routine determined by major function in drvobj
Each driver in turn
does work on IRP, if desired
keeps track in the device object of the next stack device
Calls IoCallDriver on next device
Eventually bottom driver completes IO and returns on callstack
v3 © Microsoft Corporation 2006 81
IRP flow of control (asynch)
Eventually a driver decides to be asynchronous
driver queues IRP for further processing
driver returns STATUS_PENDING up call stack
higher drivers may return all the way to user, or may
wait for IO to complete (synchronizing the stack)
Eventually a driver decides IO is complete
usually due to an interrupt/DPC completing IO
each completion routine in device stack is called,
possibly at DPC or in arbitrary thread context
IRP turned into APC request delivered to original thread
APC runs final completion, accessing process memory
v3 © Microsoft Corporation 2006 82
Asychronous I/O
• Applications can issue asynchronous IO requests to files
opened with FILE_FLAG_OVERLAPPED and passing
an LPOVERLAPPED parameter to the IO API (e.g.,
ReadFile(…))
• Five methods available to wait for IO completion,
– Wait on the file handle
– Wait on an event handle passed in the overlapped
structure (e.g., GetOverlappedResult(…))
– Specify a routine to be called on IO completion
– Use completion ports
– Poll status variable
v3 © Microsoft Corporation 2006 83
I/O Completion Ports
• Five methods to receive notification of completion for
asynchronous I/O:
– poll status variable
– wait for the file handle to be signalled
– wait for an explicitly passed event to be signalled
– specify a routine to be called on the originating ports
– use and I/O completion port
v3 © Microsoft Corporation 2006 84
Completing Asynchronous I/O
complete
I/O Completion
complete
I/O I/O I/O I/O I/O I/O
K K
complete
complete
complete
U U
request
request
request
request
request
request
thread
thread
thread
thread
thread
thread
thread
thread
normal completion I/O completion ports
v3 © Microsoft Corporation 2006 85
File System Device Stack
Application
Kernel32 / ntdll
user
kernel
NT I/O Manager
File System Filters
Disk Class Manager
File System Driver
Cache Manager Disk Driver
Partition/Volume
Storage Manager
Virtual Memory
Manager DISK
v3 © Microsoft Corporation 2006 86
Discussion
v3 © Microsoft Corporation 2006 87