Co Case Study Final Report
Co Case Study Final Report
0
NITTE
1
NITTE
1. Memory Access Patterns - Differing page table structures and TLB management
affect how efficiently applications access memory
3
NITTE
Through this structured analysis, the study aims to provide system architects, performance engineers, and AI
researchers with:
4
NITTE
5
NITTE
operating system design. We will explore the hardware and software mechanisms that enable
efficient memory abstraction, the various algorithms and policies that optimize performance,
and the specialized techniques developed for handling memory-intensive workloads.
The main memory is central to the operation of a Modern Computer. Main Memory
is a large array of words or bytes, ranging in size from hundreds of thousands to billions.
Main memory is a repository of rapidly available information shared by the CPU and I/O
devices. Main memory is the place where programs and information are kept when the
processor is effectively utilizing them. There are multiple levels present (Fig. 2) in the
memory, each one having a different size, different cost, etc. Some types of memory, like
cache and main memory are faster as compared to other types of memory, but they have a
little less size and are also costly, whereas some memory has a little higher storage value, but
they are a little slower. Accessing of data is not similar in all types of memory some have faster
access, whereas some have slower access. These multiple levels stacked in order are known
as t h e Memory Hierarchy.
Memory hierarchy helps optimize the memory available in the computer.
6
NITTE
Registers are small, high-speed memory units located in the CPU. They are used to
store the most frequently used data and instructions. Registers have the fastest access time
and the smallest storage capacity, typically ranging from 16 to 64 bits.
Cache Memory is a small, fast memory unit located close to the CPU. It stores
frequently used data and instructions that have been recently accessed from the main
memory.
Main Memory, also known as RAM (Random Access Memory), is the primary
memory of a computer system. It has a larger storage capacity than cache memory, but it
is slower. Main memory is used to store data and instructions that are currently in use by
the CPU.
Secondary Storage, such as hard disk drives (HDD) and solid-state drives (SSD),
is a non-volatile memory unit that has a larger storage capacity than main memory. It is
used to store data and instructions that are not currently in use by the CPU.
Magnetic Disk are simply circular plates that are fabricated with either a metal or a
plastic or a magnetized material. The Magnetic disks work at a high speed inside the computer
and these are frequently used.
Magnetic Tape is simply a magnetic recording device that is covered with a plastic film.
Magnetic Tape is generally used for the backup of data. In the case of a magnetic tape,
the access time for a computer is a little slower and therefore, it requires some amount of
time for accessing the strip.
According to the memory Hierarchy, the system-supported memory standards are defined
below:
7
NITTE
Loading a process into the main memory is done by a loader. There are two different
types of loading:
• Static Loading: Static Loading is basically loading the entire program into a fixed
address. It requires more memory space.
• Dynamic Loading: The entire program and all data of a process must be in physical
memory for the process to execute. So, the size of a process is limited to the size of
physical memory. To gain proper memory utilization, dynamic loading is used. In
dynamic loading, a routine is not loaded until it is called.
To perform a linking task, a linker is used. A linker is a program that takes one or
more object files generated by a compiler and combines them into a single executable file.
• Static Linking: In static linking, the linker combines all necessary program modules
into a single executable program. So there is no runtime dependency. Some
operating systems support only static linking, in which system language libraries are
treated like any other object module.
reference. A stub is a small piece of code. When the stub is executed, it checks
whether the needed routine is already in memory or not. If not available, then the
program loads the routine into memory.
Memory management mostly involves the management of main memory. The task of
subdividing the memory among different processes is called Memory Management. Memory
management is a method in the operating system to manage operations between main memory
and disk during process execution. Different Memory Management techniques are:
• Swapping temporarily moves entire processes between main memory and secondary
storage to free up memory for higher-priority tasks. While simple, swapping can incur
significant overhead due to the large amounts of data being moved.
9
NITTE
• Paging (Fig. 4) divides memory into fixed-size blocks called pages (typically 4 KB).
The system maintains page tables that map virtual pages to physical frames. When a
process accesses a page not currently in memory (page fault), the operating system:
Fig. 4. Paging
– If the CPU tries to refer to a page that is currently not available in the main
memory, it generates an interrupt indicating a memory access fault.
– The OS puts the interrupted process in a blocking state. For the execution to
proceed the OS must bring the required page into the memory.
– The OS will search for the required page in the logical address space.
– The required page will be brought from logical address space to physical ad- dress
space. The page replacement algorithms are used for the decision-making of
replacing the page in physical address space.
10
NITTE
– The signal will be sent to the CPU to continue the program execution and it
will place the process back into the ready state.
Hence whenever a page fault occurs these steps are followed by the operating system and
the required page is brought into memory.
The time taken to service the page fault is called page fault service time. The page fault
service time includes the time taken to perform all the above six steps. Let Main
memory access time is: m
Page fault service time is: s
Page fault rate is: p
Then, Effective memory access time = (p*s) + (1 − p)∗m
• Segmentation divides virtual memory into segments of different sizes. Segments that
aren’t currently needed can be moved to the hard drive. The system uses a segment
table to keep track of each segment’s status, including whether it’s in memory, if it’s
been modified, and its physical address. Segments are mapped into a process’s address
space only when needed.
The MMU enables the virtual memory abstraction while minimizing performance over- head
through hardware acceleration of address translation.
Page replacement algorithms are critical components of virtual memory systems that
determine which memory pages to evict when free physical memory becomes scarce. These
algorithms significantly impact system performance, particularly under memory pressure
11
NITTE
conditions. The primary goal of page replacement is to minimize page faults situations where
a requested page isn’t in physical memory and must be loaded from disk. Effective
algorithms aim to:
• Minimize disk I/O: Since retrieving pages from disk is ∼ 100, 000x slower than
RAM access
The OPT algorithm follows an offline approach, making replacement decisions based on
perfect future knowledge of memory accesses. When a page fault occurs, it:
• Identifies the page that will not be used for the longest time in the future.
• When 4 is loaded and a page fault occurs, OPT replaces 7 (not used again) instead
of 0 or 2 (needed soon).
1. Pages are maintained in a queue in the order they were loaded into memory
• The page at the head of the queue (oldest loaded page) is selected for replacement
12
NITTE
A critical weakness where increasing memory frames can paradoxically increase page
faults (Belady’s Anomaly):
• Example case
• Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
The LRU algorithm approximates optimal behaviour by tracking page usage recency:
• Replacement policy: On page fault, evicts the page with the oldest last-access
timestamp
[1, 2, 3] (Fault)
[2, 3, 4] (Replace 1)
[3, 4, 1] (Replace 2)
[4, 1, 2] (Replace 3)
13
NITTE
[1, 2, 5] (Replace 4)
Replacement priority: Unreferenced + Clean > Unreferenced + Dirty > Referenced + Clean
14
NITTE
Working set algorithms optimize memory usage by tracking the actively used pages
of each process. Unlike pure page replacement policies (e.g., LRU, Clock), they focus on:
• Identifying a process’s working set: The set of pages actively referenced in a given
time window.
• Memory trimming: Gradually reclaiming pages from processes that exceed their
working set.
This approach reduces thrashing by ensuring processes retain their actively used pages while
inactive pages are reclaimed.
At any given time, only a few pages of any process are in the main memory, and
therefore, more processes can be maintained in memory. Furthermore, time is saved because
unused pages are not swapped in and out of memory. However, the OS must be clever
about how it manages this scheme. In the steady state practically all of the main memory
will be occupied with process pages, so that the processor and OS have direct access to as
many processes as possible. Thus when the OS brings one page in, it must throw another
out. If it throws out a page just before it is used, then it will just have to get that page again
almost immediately. Too much of this leads to a condition called Thrashing. The system
spends most of its time swapping pages rather than executing instructions. So a good page
replacement algorithm is required.
In Fig. 5, the initial degree of multiprogramming up to some extent of point (lambda),
the CPU utilization is very high, and the system resources are utilized 100%. But if we
further increase the degree of multiprogramming, the CPU utilization will drastically fall
and the system will spend more time only on the page replacement, and the time taken to
complete the execution of the process will increase. This situation in the system is called
thrashing.
15
NITTE
16
NITTE
The design of Linux was actually intended to drive the 64-bit Alpha processor, which
provided the needed hardware support for three levels of paging. It uses a hierarchical three-
level page table structure that is platform-independent. This page table structure consists of
the following types of tables. Each individual table has a size of one page. The three
levels are:
1. Page global directory: (Fig. 6) Each active process has a single page global
directory, and this directory must be resident in one page in main memory for an
active process. Each entry in this directory points to one page of the page middle
directory.
2. Page middle directory: (Fig. 6) Each entry in the page middle directory points
to one page in the page table. This directory may span multiple pages.
3. Page table: (Fig. 6) As usual, each entry in the page table points to one virtual page
of the process. This page table may also span multiple pages.
Linux uses a page size of 4 Kbytes. It uses a buddy system allocator for speedy
allocation/deallocation of contiguous page frames (for mapping of contiguous blocks of
pages) with a group of fixed sizes consisting of 1, 2, 4, 8, 16, or 32 page frames. The
use of the buddy system allocator is also advantageous for traditional I/O operations
involving DMA that require contiguous allocation of main memory.
17
NITTE
Linux essentially uses the clock algorithm described earlier (see Fig. 6) with a slight
change that the reference bit associated with each page frame in memory is replaced by an
8-bit age variable. Each time a page is accessed, its age variable is incremented. At the same
time, in the background, Linux periodically sweeps through the global page pool and
decrements the age variable for each page while traversing through all the pages in memory.
By this act, lower the value of age variable of a page, the higher its probability of being
removed at the time of replacement. On the other hand, a larger value of the age variable of a
page implies that it is less eligible for removal when replacement is required. Thus, the Linux
system implements a form of the least frequently used policy (LFU).
A Linux system always tries to maintain a sufficient number of free page frames at
all times so that page faults can be quickly serviced using one of these free page frames.
For this purpose, it uses two lists called the active list and inactive list and takes certain
approved measures to maintain the size of the active list at two-thirds of the size of the
inactive list. When the number of free page frames falls below a lower threshold, it executes a
series of actions until a few page frames are freed. As usual, a page frame is moved from the
inactive list to the active list if it is referenced.
Linux also uses the buddy algorithm in units of one or more pages, in a way similar
to the page allocation mechanism used for virtual memory management of users. Here, the
minimum amount of memory allocated is one page. To satisfy the request for odd sizes of
small and short-term memory requirements sometimes needed by the kernel, the memory
allocator implements a different approach in addition to the existing one. To provide these
small chunks of memory, Linux often uses a scheme known as slab allocation (the slab
allocator was discussed earlier) that offers a small chunk of memory space less than the size
of a page within an allocated page. The size of the slab is always a power of 2 and
depends on the page size. On a machine based on the Pentium X-86 processor, the page size
is 4 Kbytes, and the different sizes of slabs that can be allocated within a page may range from
32 to 4096 bytes.
18
NITTE
efficiently within their virtual memory systems. The Nouveau/NVIDIA driver stack
implements Unified Memory (UM) with Heterogeneous Memory Management (HMM),
allowing seamless CPU-GPU memory access while supporting demand paging through
GPU page faults. AMD’s ROCm memory architecture enhances this with fine-grained
page migration and Shared Virtual Memory (SVM) capabilities, particularly beneficial for
APU systems. Both approaches leverage IOMMU/SMMU integration for secure memory
mapping, utilizing DMA-BUF heaps for zero-copy transfers between devices and RDMA-
aware page pinning to minimize latency in high-performance computing scenarios. These
implementations differ significantly between Windows and Linux, with Linux typically
offering more granular control through its open-source driver ecosystem.
19
NITTE
Windows provides various types of page table organization and uses different page table
formats for different system architectures. It uses two-level, three-level, and even four- level
page tables, and consequently, the virtual addresses used for addressing are also of different
formats for using these differently organized page tables.
Windows allows a process to occupy the entire user space of 2 gigabytes (minus 128
Kbytes) when it is created. This space is divided into fixed-size pages. But the sizes of
pages may be different, from 4 to 64 Kbytes, depending on the processor architecture. For
example, 4 Kbytes is used on Intel, PowerPC, and MIPS platforms, while in DEC Alpha
systems, pages are 8 Kbytes in size.
In this scheme (Fig. 7), a 32-bit virtual address is divided into three parts: a 10-bit
page directory index, a 10-bit page table index, and a 12-bit page offset (for 4 KB pages). The
virtual address space is 4 GB, requiring 220 pages (4 GB ÷ 4 KB). Each page table entry
is 4 bytes, so a full page table would need 4 MB (2 20 × 4 B). To avoid storing such large tables
in memory, a two-level paging system is used:
20
NITTE
• A page directory with 210 entries (one page of 4 KB) points to 1,024 page tables.
• Each page table also has 210 entries, mapping to individual 4 KB pages.
• Thus, the directory maps 210 ×210 = 220 pages, covering the full 4 GB virtual space.
The page directory (root table) is always resident in memory, while the actual page tables
and pages can be swapped in/out as needed.
At the time of handling the sharing of pages, the pages to be shared are represented as
section objects held in a section of memory. Processes that share the section object have their
own individual view of this object. A view controls the part of the object that the process wants
to view. A process maps a view of a section into its own address space by issuing a system
(kernel) call with parameters indicating the part of the section object that is to be mapped
(in fact, an offset), the number of bytes to be mapped, and the logical address in the
address space of the process where the object is to be mapped. When a view is accessed
for the first time, the kernel allocates memory to that view unless memory is already
allocated to it. If the memory section to be shared has an attribute based, the shared memory
has the same virtual address in the logical address spaces of all sharing processes.
Windows uses the variable allocation, local scope scheme (see replacement scope, de-
scribed earlier) to manage its resident set. As usual, when a process is first activated, it is
allocated a certain number of page frames as its working set. When a process references a page
not in main memory, the virtual memory manager resolves this page-fault situation by
adjusting the working set of the process using the following standard procedures:
• When a sufficient amount of main memory is available, the virtual memory manager
simply offers an additional page frame to bring in the new page as referenced without
swapping out any existing page of the faulting process. This eventually results in
an increase in the size of the resident set of the process.
• When there is a dearth of available memory space, the virtual memory manager swaps
less recently used pages out of the working set of the process to make room for the new
page to be brought into memory. This ultimately reduces the size of the resident set
of the process.
21
NITTE
Windows implements advanced GPU memory paging through its DirectX 12 Ultimate
and WDDM 3.0 architecture. The system employs GPU Page Fault Isolation, allowing
precise handling of memory access violations while supporting tiered memory hierarchies
that combine VRAM, system RAM, and persistent NVDIMM storage. Key innovations
include:
• WDDM 3.0 Scheduler implements predictive page prefetching using usage pattern
analysis
22
NITTE
23
NITTE
• Hardware: Dell R750xa, Dual Xeon 8380, 512GB RAM, A100 GPUs
The experimental results provide detailed insights into the performance characteristics of
Windows and Linux virtual memory management across multiple dimensions.
• Page Faults: Windows 1.2µs (minor), 8.7µs (swap) vs Linux 1.45µs, 9.3µs
• Swap I/O:
24
NITTE
Table 3. Trade-offs
Category Windows Strength Linux Advantage
Latency Consistent response NUMA optimization
Throughput Sequential I/O Concurrent operations
ML Support DX12 integration Huge page management
Configuration Automated Granular control
This case study reveals how fundamental design philosophies translate to measurable
performance differences in virtual memory implementations. Windows’ centralized VMM
architecture, optimized for desktop responsiveness, delivers superior deterministic latency (12-
22% better in real-time scenarios) through techniques like Xpress compression and WDDM
3.0 scheduling. Conversely, Linux’s decentralized approach achieves 15-28% better memory
efficiency and throughput via transparent huge pages and NUMA-aware allocations. The
findings validate theoretical memory hierarchy principles. Windows’ working set model aligns
with locality-of-reference expectations, while Linux’s overcommit capabilities demonstrate
the practical value of probabilistic allocation.
Study limitations include hardware-specific results (tested only on x86-64) and version
25
NITTE
dependencies (Windows 11/22H2, Linux 5.19). Future work should investigate ARM
implementations, persistent memory integration, and ML-specific optimizations like Tensor-
aware page prefetching. These insights empower system architects to make informed OS
selections - Windows for latency-critical applications like real-time inference, Linux for
memory-bound workloads such as distributed training - while suggesting opportunities for
cross-platform learning, particularly in GPU memory management.
[1] GeeksforGeeks, "Memory Hierarchy Design and its Characteristics," 2023. [Online].
Available: https://www.geeksforgeeks.org/memory-hierarchy-design-and-its-characteristics/
[2] GeeksforGeeks, "Memory Management in Operating System," 2023. [Online].
Available: https://www.geeksforgeeks.org/memory-management-in-operating-system/
[3] GeeksforGeeks, "Virtual Memory in Operating System," 2023. [Online].
Available: https://www.geeksforgeeks.org/virtual-memory-in-operating-system/
[4] The Linux Kernel Archives, "Linux Kernel Documentation," 2023. [Online].
Available: https://www.kernel.org/
[5] P. Chakraborty, Operating Systems: A Concept-Based Evolutionary Approach, 2nd ed. New
Delhi: McGraw Hill Education, 2018.
[6] M. Russinovich, P. Yosifovich, and D. Solomon, Windows Internals, Part 1, 7th ed.
Microsoft Press, 2017.
[7] A. Silberschatz, P. Galvin, and G. Gagne, Operating System Concepts, 10th ed. Wiley, 2018.
26