UNIT-2
Memory Hierarchies:
The computer memory can be divided into 5 major hierarchies that are based on
use as well as speed. A processor can easily move from any one level to some
other on the basis of its requirements. These five hierarchies in a system’s
memory are register, cache memory, main memory, magnetic disc, and
magnetic tape.
What are the Design and Characteristics of
Memory Hierarchy?
Memory Hierarchy, in Computer System Design, is an enhancement that
helps in organising the memory so that it can actually minimise the access
time. The development of the Memory Hierarchy occurred on a behaviour
of a program known as locality of references. Here is a figure that
demonstrates the various levels of memory hierarchy clearly:
Memory Hierarchy Design
This Hierarchy Design of Memory is divided into two main types. They are:
External or Secondary Memory
It consists of Magnetic Tape, Optical Disk, Magnetic Disk, i.e. it includes
peripheral storage devices that are accessible by the system’s processor
via I/O Module.
Internal Memory or Primary Memory
It consists of CPU registers, Cache Memory, and Main Memory. It is
accessible directly by the processor.
Characteristics of Memory Hierarchy
One can infer these characteristics of a Memory Hierarchy Design from the
figure given above:
1. Capacity
It refers to the total volume of data that a system’s memory can store. The
capacity increases moving from the top to the bottom in the Memory
Hierarchy.
2. Access Time
It refers to the time interval present between the request for read/write and
the data availability. The access time increases as we move from the top
to the bottom in the Memory Hierarchy.
3. Performance
When a computer system was designed earlier without the Memory
Hierarchy Design, the gap in speed increased between the given CPU
registers and the Main Memory due to a large difference in the system’s
access time. It ultimately resulted in the system’s lower performance, and
thus, enhancement was required. Such a kind of enhancement was
introduced in the form of Memory Hierarchy Design, and because of this,
the system’s performance increased. One of the primary ways to increase
the performance of a system is minimising how much a memory hierarchy
has to be done to manipulate data.
4. Cost per bit
The cost per bit increases as one moves from the bottom to the top in the
Memory Hierarchy, i.e. External Memory is cheaper than Internal Memory.
Design of Memory Hierarchy
In computers, the memory hierarchy primarily includes the following:
1. Registers
The register is usually an SRAM or static RAM in the computer processor
that is used to hold the data word that is typically 64 bits or 128 bits. A
majority of the processors make use of a status word register and an
accumulator. The accumulator is primarily used to store the data in the
form of mathematical operations, and the status word register is primarily
used for decision making.
2. Cache Memory
The cache basically holds a chunk of information that is used frequently
from the main memory. We can also find cache memory in the processor.
In case the processor has a single-core, it will rarely have multiple cache
levels. The present multi-core processors would have three 2-levels for
every individual core, and one of the levels is shared.
3. Main Memory
In a computer, the main memory is nothing but the CPU’s memory unit that
communicates directly. It’s the primary storage unit of a computer system.
The main memory is very fast and a very large memory that is used for
storing the information throughout the computer’s operations. This type of
memory is made up of ROM as well as RAM.
4. Magnetic Disks
In a computer, the magnetic disks are circular plates that’s fabricated with
plastic or metal with a magnetised material. Two faces of a disk are
frequently used, and many disks can be stacked on a single spindle by
read/write heads that are obtainable on every plane. The disks in a
computer jointly turn at high speed.
5. Magnetic Tape
Magnetic tape refers to a normal magnetic recording designed with a
slender magnetizable overlay that covers an extended, thin strip of plastic
film. It is used mainly to back up huge chunks of data. When a computer
needs to access a strip, it will first mount it to access the information. Once
the information is allowed, it will then be unmounted. The actual access
time of a computer memory would be slower within a magnetic strip, and it
will take a few minutes for us to access a strip.
Multi-core Processors: Homogeneous and Heterogeneous:
A multicore processor is an integrated circuit that has two or more
processor cores attached for enhanced performance and reduced power
consumption. These processors also enable more efficient simultaneous
processing of multiple tasks, such as with parallel
processing and multithreading. A dual core setup is similar to having
multiple, separate processors installed on a computer. However, because
the two processors are plugged into the same socket, the connection
between them is faster.
The use of multicore processors or microprocessors is one approach to
boost processor performance without exceeding the practical limitations of
semiconductor design and fabrication. Using multicores also ensure safe
operation in areas such as heat generation.
How do multicore processors work?
The heart of every processor is an execution engine, also known as a core. The
core is designed to process instructions and data according to the direction of
software programs in the computer's memory. Over the years, designers found that
every new processor design had limits. Numerous technologies were developed to
accelerate performance, including the following ones:
• Clock speed. One approach was to make the processor's clock faster.
The clock is the "drumbeat" used to synchronize the processing of
instructions and data through the processing engine. Clock speeds have
accelerated from several megahertz to several gigahertz (GHz) today.
However, transistors use up power with each clock tick. As a result,
clock speeds have nearly reached their limits given current
semiconductor fabrication and heat management techniques.
• Hyper-threading. Another approach involved the handling of multiple
instruction threads. Intel calls this hyper-threading. With hyper-
threading, processor cores are designed to handle two separate
instruction threads at the same time. When properly enabled and
supported by both the computer's firmware and operating system (OS),
hyper-threading techniques enable one physical core to function as two
logical cores. Still, the processor only possesses a single physical core.
The logical abstraction of the physical processor added little real
performance to the processor other than to help streamline the behavior
of multiple simultaneous applications running on the computer.
• More chips. The next step was to add processor chips -- or dies -- to the
processor package, which is the physical device that plugs into the
motherboard. A dual-core processor includes two separate processor
cores. A quad-core processor includes four separate cores. Today's
multicore processors can easily include 12, 24 or even more processor
cores. The multicore approach is almost identical to the use of
multiprocessor motherboards, which have two or four separate processor
sockets. The effect is the same. Today's huge processor performance
involves the use of processor products that combine fast clock speeds
and multiple hyper-threaded cores.
Multicore processors have multiple processing units incorporated in them. They connect
directly with their internal cache, as well as with the system bus and memory.
However, multicore chips have several issues to consider. First, the addition of
more processor cores doesn't automatically improve computer performance. The
OS and applications must direct software program instructions to recognize and
use the multiple cores. This must be done in parallel, using various threads to
different cores within the processor package. Some software applications may need
to be refactored to support and use multicore processor platforms. Otherwise, only
the default first processor core is used, and any additional cores are unused or idle.
Second, the performance benefit of additional cores is not a direct multiple. That
is, adding a second core does not double the processor's performance, or a quad-
core processor does not multiply the processor's performance by a factor of four.
This happens because of the shared elements of the processor, such as access to
internal memory or caches, external buses and computer system memory.
The benefit of multiple cores can be substantial, but there are practical limits. Still,
the acceleration is typically better than a traditional multiprocessor system because
the coupling between cores in the same package is tighter and there are shorter
distances and fewer components between cores.
Consider the analogy of cars on a road. Each car might be a processor, but each car
must share the common roads and traffic limitations. More cars can transport more
people and goods in a given time, but more cars also cause congestion and other
problems.
What are multicore processors used for?
Multicore processors work on any modern computer hardware platform. Virtually
all PCs and laptops today build in some multicore processor model. However, the
true power and benefit of these processors depend on software applications
designed to emphasize parallelism. A parallel approach divides application work
into numerous processing threads, and then distributes and manages those threads
across two or more processor cores.
There are several major use cases for multicore processors, including the following
five:
1. Virtualization. A virtualization platform, such as VMware, is designed
to abstract the software environment from the underlying hardware.
Virtualization is capable of abstracting physical processor cores into
virtual processors or central processing units (vCPUs) which are then
assigned to virtual machines (VMs). Each VM becomes a virtual server
capable of running its own OS and application. It is possible to assign
more than one vCPU to each VM, allowing each VM and its application
to run parallel processing software if desired.
2. Databases. A database is a complex software platform that frequently
needs to run many simultaneous tasks such as queries. As a result,
databases are highly dependent on multicore processors to distribute and
handle these many task threads. The use of multiple processors in
databases is often coupled with extremely high memory capacity that can
reach 1 terabyte or more on the physical server.
3. Analytics and HPC. Big data analytics, such as machine learning, and
high-performance computing (HPC) both require breaking large,
complex tasks into smaller and more manageable pieces. Each piece of
the computational effort can then be solved by distributing each piece of
the problem to a different processor. This approach enables each
processor to work in parallel to solve the overarching problem far faster
and more efficiently than with a single processor.
4. Cloud. Organizations building a cloud will almost certainly adopt
multicore processors to support all the virtualization needed to
accommodate the highly scalable and highly transactional demands of
cloud software platforms such as OpenStack. A set of servers with
multicore processors can allow the cloud to create and scale up more
VM instances on demand.
5. Visualization. Graphics applications, such as games and data-rendering
engines, have the same parallelism requirements as other HPC
applications. Visual rendering is math- and task-intensive, and
visualization applications can make extensive use of multiple processors
to distribute the calculations required. Many graphics applications rely
on graphics processing units (GPUs) rather than CPUs. GPUs are
tailored to optimize graphics-related tasks. GPU packages often contain
multiple GPU cores, similar in principle to multicore processors.
Homogenous vs. heterogeneous multicore processors
The cores within a multicore processor may be homogeneous or heterogeneous.
Mainstream Intel and AMD multicore processors for x86 computer architectures
are homogeneous and provide identical cores. Consequently, most discussion of
multicore processors are about homogeneous processors.
However, dedicating a complex device to do a simple job or to get greatest
efficiency is often wasteful. There is a heterogeneous multicore processor market
that uses processors with different cores for different purposes. Heterogeneous
cores are generally found in embedded or Arm processors that might mix
microprocessor and microcontroller cores in the same package.
There are three general goals for heterogeneous multicore processors:
1. Optimized performance. While homogeneous multicore processors are
typically intended to provide vanilla or universal processing capabilities,
many processors are not intended for such generic system use cases.
Instead, they are designed and sold for use in embedded -- dedicated or
task-specific -- systems that can benefit from the unique strengths of
different processors. For example, a processor intended for a signal
processing device might use an Arm processor that contains a Cortex-A
general-purpose processor with a Cortex-M core for dedicated signal
processing tasks.
2. Optimized power. Providing simpler processor cores reduces the
transistor count and eases power demands. This makes the processor
package and the overall system cooler and more power-efficient.
3. Optimized security. Jobs or processes can be divided among different
types of cores, enabling designers to deliberately build high levels of
isolation that tightly control access among the various processor cores.
This greater control and isolation offer better stability and security for
the overall system, though at the cost of general flexibility.
Shared-memory Symmetric Multiprocessors:
Shared-memory symmetric multiprocessors (SMPs) play a significant role in high-
performance computing (HPC) environments. SMP architectures are designed to provide
multiple processors or cores with access to a shared main memory, allowing them to work
concurrently on a single task or multiple tasks. Here's an overview of how shared-memory
SMPs are used in HPC:
1. Parallelism and Scalability: SMPs excel at exploiting parallelism in HPC applications. By
dividing computational tasks into smaller units or threads, these threads can be executed in
parallel across multiple processors or cores within an SMP system. This parallel execution
allows for improved performance and efficient utilization of resources. Moreover, SMPs can
scale up by increasing the number of processors or cores, enabling HPC systems to handle
increasingly complex and demanding workloads.
2. Memory Access and Data Sharing: The shared-memory architecture of SMPs simplifies
memory access and data sharing between processors. All processors have equal access to the
entire memory space, which facilitates efficient communication and data sharing among
threads. This characteristic is particularly advantageous for applications that require frequent
data exchange or synchronization, such as iterative algorithms, numerical simulations, and
parallel data processing.
3. Synchronization and Communication: SMPs provide mechanisms for synchronization and
communication between threads or processes running on different processors. This allows for
coordination and cooperation among parallel tasks, enabling efficient parallelization of
algorithms and minimizing overheads associated with inter-processor communication.
4. Programming Models and Libraries: SMPs in HPC often support various programming
models and libraries that facilitate the development of parallel applications. Examples include
OpenMP, which allows developers to express parallelism using shared-memory constructs,
and MPI (Message Passing Interface), which enables communication and coordination
between distributed memory systems. These programming models and libraries simplify the
task of parallel programming and enable software developers to harness the power of shared-
memory SMPs effectively.
5. NUMA Architectures: Non-Uniform Memory Access (NUMA) architectures are a variant
of SMPs that incorporate multiple memory banks or nodes, each associated with a subset of
processors. NUMA architectures optimize memory access by minimizing latency and
maximizing bandwidth for each processor's local memory. This design improves performance
for applications with locality of memory references, while still providing the shared-memory
programming model for inter-node communication.
Shared-memory SMPs remain a widely used and effective computing paradigm in HPC,
providing a balance between parallelism, memory access, and communication. However, it's
important to note that other computing models, such as distributed memory systems and
hybrid approaches combining shared-memory and distributed-memory models, are also
prevalent in HPC to address the challenges of scalability and performance for larger-scale
applications.
Vector Computers:
Vector computers have played a significant role in the history of high-performance
computing (HPC). These machines were specifically designed to exploit vector processing
capabilities, where operations are performed on arrays or vectors of data elements
simultaneously, resulting in efficient parallel execution. Let's delve into more detailed content
about vector computers in HPC:
1. Vector Processing Architecture: Vector computers employ a vector processing
architecture, also known as array processing. In this architecture, instructions operate on
multiple data elements simultaneously, stored in vector registers. Vector registers are large
registers capable of holding a significant number of data elements, typically ranging from a
few dozen to thousands.
2. Data-Level Parallelism: Vector architectures excel at executing computations that can be
parallelized. Instead of processing one data element at a time, vector instructions enable
operations on entire vectors or array sections with a single instruction. This data-level
parallelism allows for a higher degree of parallel execution and improved performance for
suitable applications.
3. Vector Instructions: Vector instructions encompass a range of operations, including vector
addition, subtraction, multiplication, division, and other arithmetic and logical operations.
These instructions operate on entire vectors or portions of vectors, allowing for efficient
computation across multiple data elements simultaneously.
4. Performance Benefits: Vector computers achieved high performance by exploiting data-
level parallelism and efficient memory access patterns. By processing multiple data elements
simultaneously, vector architectures provided significant speedups over scalar processors for
applications that could take advantage of vectorization. This made them particularly suitable
for scientific simulations and computations in fields like physics, weather forecasting,
computational chemistry, and other data-intensive applications.
5. Memory Bandwidth and Capacity: Memory bandwidth is crucial for vector performance,
and vector architectures incorporated specialized memory subsystems to support high data
transfer rates. Vector computers typically featured large memory capacities to accommodate
the large datasets required by scientific simulations and computations. High memory capacity
allowed for efficient processing of vast amounts of data.
6. Programming Models and Languages: Software development for vector computers often
required specific programming models and languages. Fortran, a popular programming
language for scientific computing, often had vector extensions to take advantage of vector
architectures. These extensions provided additional directives and constructs to facilitate
vectorization and maximize performance.
7. Compiler Support: Vectorizing compilers played a crucial role in automatically identifying
and exploiting vectorization opportunities in code. These compilers could recognize loops
that could be vectorized and generate vector instructions accordingly. Compiler optimizations
and techniques, such as loop unrolling and loop interchange, aided in enabling vectorization
and improving performance.
8. Transition and Decline: Despite their high performance, vector computers faced challenges
as general-purpose microprocessors and parallel processing architectures evolved. The rise of
scalar and parallel processing architectures, along with advancements in commodity CPUs
and GPUs, led to a decline in dedicated vector machines. However, vectorization techniques
and concepts continue to be relevant in modern HPC architectures, where SIMD (Single
Instruction, Multiple Data) units in CPUs and GPUs provide vectorized execution
capabilities.
9. Modern SIMD and GPU Architectures: Modern CPUs and GPUs often feature vectorized
instructions and SIMD execution units, allowing processors to operate on multiple data
elements simultaneously, similar to the vector operations in traditional vector computers.
GPUs, in particular, have become popular in HPC due to their high parallelism and
vectorized execution capabilities. Programming models and libraries like CUDA and
OpenCL facilitate programming and utilization of SIMD and GPU architectures in HPC
applications.
10. Vectorization Techniques: In modern HPC, vectorization techniques are employed to
optimize critical sections of code and improve performance on SIMD and GPU architectures.
Identifying vectorization opportunities involves analyzing loops and data dependencies to
determine if parallel execution is possible. Efficient vectorization often requires data to be
stored in a contiguous and aligned manner to achieve optimal memory access. Compiler
directives or pragmas, such as OpenMP's '#pragma omp simd,' can guide the compiler
Distributed Memory Computers:
Distributed memory computers play a crucial role in high-performance computing (HPC) by
enabling the parallel processing of large-scale computational problems. These computers are
designed to work with distributed memory systems, where each processing unit has its own
local memory. Let's explore the concept of distributed memory computers in HPC:
1. Distributed Memory Architecture: In a distributed memory architecture, each processing
unit, often referred to as a node, has its own local memory and operates independently. Nodes
are connected via a high-speed interconnect, such as a network, allowing them to
communicate and exchange data during computation.
2. Scalability: Distributed memory computers excel at scaling up to handle large-scale
problems by adding more nodes to the system. This scalability allows HPC systems to tackle
computationally intensive tasks that require a significant amount of memory and processing
power.
3. Message Passing Interface (MPI): To facilitate communication between nodes, a standard
programming interface called Message Passing Interface (MPI) is commonly used in
distributed memory systems. MPI provides a set of functions and libraries that allow
developers to coordinate the execution of tasks across multiple nodes, exchange data, and
synchronize computation.
4. Parallelism and Load Balancing: Distributed memory computers leverage parallelism by
dividing computational tasks into smaller pieces that can be executed independently on
different nodes. Load balancing techniques are applied to distribute the workload evenly
across nodes, ensuring efficient resource utilization and minimizing idle time.
5. Data Distribution: In distributed memory systems, data is often partitioned and distributed
among the nodes. Each node works on its assigned portion of the data, performing
computations locally. Efficient data distribution strategies are essential to minimize
communication overhead and ensure balanced computation across nodes.
6. Latency and Bandwidth: Communication latency and bandwidth between nodes can
significantly impact the performance of distributed memory computers. High-speed
interconnects and efficient communication protocols are employed to minimize latency and
maximize bandwidth, allowing for efficient data exchange between nodes.
7. Hybrid Approaches: In some cases, distributed memory computers are combined with
shared memory systems, creating hybrid architectures. These hybrid approaches leverage the
strengths of both distributed and shared memory models to achieve high performance and
scalability.
8. Supercomputers and Clusters: Distributed memory computers are commonly used in the
construction of supercomputers and computing clusters. Supercomputers are built using a
massive number of processing nodes, interconnected to form a high-performance computing
infrastructure. Clusters, on the other hand, consist of a collection of individual computers or
servers, each acting as a processing node, interconnected to create a distributed memory
system.
9. HPC Applications: Distributed memory computers are widely used in a range of HPC
applications, including weather forecasting, climate modeling, computational fluid dynamics,
molecular dynamics simulations, and large-scale data analytics. These applications often
require vast amounts of computational power and memory to process complex simulations
and analyze massive datasets.
10. Programming Models: Programming models such as MPI, OpenMP, and hybrid models
like MPI+OpenMP are commonly used to develop parallel applications for distributed
memory computers. These models provide abstractions and libraries to express parallelism,
coordinate communication, and optimize performance on distributed memory systems.
Distributed memory computers have revolutionized the field of HPC by enabling the parallel
processing of large-scale problems. They provide the necessary scalability and computational
power to tackle complex simulations and data-intensive applications, making them a
cornerstone of modern high-performance computing infrastructure.
Application Accelerators / Reconfigurable Computing,:
Reconfigurable Computing (RC) is an interesting paradigm to accelerate applications
by targeting algorithms into programmable hardware.
One of the enabling technologies useful in RC is the field-programmable gate array
(FPGA). Putting FPGAs on add-on cards or motherboards allow FPGAs to serve as
compute-intensive co-processors. FPGAs can be re-configured over and over again,
to perform multitude of operations. This enables application-specific, dynamically
"programmable" hardware accelerators.
A number of Scientific & Engineering applications find RC technology useful. To
name a few: satellite networks with adaptive communication algorithms, scalable
computing systems, Encryption/Decryption engines and Pattern recognition.
C-DAC has pioneered the RC technology for HPC in India through its state-of-the-art
design of hardware, system software and hardware libraries ('Avatars'). Avatars are
dynamically changeable circuits, corresponding to the compute intensive routines of
the application code. C-DAC with its expertise in RC is capable of providing
accelerated solutions for a wide spectrum of scientific and engineering areas.
Applications
Scientific and engineering applications in the areas of fracture mechanics, radio
astronomy and bioinformatics are ported on RC, providing up to 240X speedup
compared to purely software based solutions. These speedups can further increase
by manifolds based on the hardware configuration and nature of applications. In
order to obtain same performance as RC hardware, one would require a huge
cluster of computing nodes.
• Bioinformatics sequence
search solution using RC,
gave 240 times faster
results.
• C-DAC's own fracture
mechanics code, having
double precision Cholesky
factorization and forward-
backward substitution steps
ported on RC provided 16X
speedup.
• High speed data acquisition and signal processing solutions designed for Very
Long Baseline Interferometry (VLBI) and power spectrum experimentation in
radio astronomy, replaced sizable computing cluster.
• Double precision matrix multiplication implemented on RC performed better
than the Intel math kernel library.
RC Product Features:
• Multi-million gate FPGAs
for mapping compute
intensive portions of
application codes
• Standard bus interface like
PCI/PCI-X/PCI Express
• System software Interface
for all standard Linux
distributions
• Low power consumption
• FCC and CE compliant
Supercomputers and Peta scale Systems
Supercomputers and petascale systems are at the forefront of high-performance computing
(HPC), pushing the boundaries of computational power and enabling groundbreaking
scientific research, simulations, and data analysis. Let's explore these concepts in more detail:
Supercomputers:
1. Definition: Supercomputers are highly advanced computing systems designed to deliver
exceptional performance for solving complex problems. They are capable of performing
massive calculations at unprecedented speeds, surpassing the capabilities of conventional
computers.
2. Parallel Processing: Supercomputers leverage parallel processing, where multiple
processors or cores work together simultaneously on different parts of a problem. This allows
for high-speed computation by breaking down complex tasks into smaller, manageable
pieces.
3. Performance Metrics: Supercomputers are measured in terms of their processing power,
typically quantified in floating-point operations per second (FLOPS). The performance of
supercomputers is often expressed in teraflops (trillions of calculations per second) or
petaflops (quadrillions of calculations per second).
4. Architecture: Supercomputers employ a variety of architectures, including vector
processors, multi-core processors, graphics processing units (GPUs), and specialized
accelerators like field-programmable gate arrays (FPGAs). These architectures are optimized
for high-performance and efficient parallel processing.
5. Interconnects: Supercomputers rely on high-speed interconnects to facilitate
communication between processing elements. Interconnect technologies such as InfiniBand
and Cray's Aries enable rapid data exchange and synchronization among distributed
components of the system.
6. Supercomputer Centers: Supercomputers are typically housed in specialized facilities
known as supercomputer centers. These centers provide the necessary infrastructure, cooling
systems, and power supply to support the immense computing power and data storage
requirements of supercomputers.
7. Application Areas: Supercomputers are utilized in a wide range of scientific, engineering,
and research fields. They are instrumental in applications like climate modeling, astrophysics,
computational fluid dynamics, drug discovery, genome sequencing, and nuclear simulations,
where massive computational resources are required.
Petaflop and Exascale Systems:
1. Petaflop Systems: Petaflop-scale systems are supercomputers capable of performing at
least one petaflop (1,000 trillion floating-point operations) per second. These systems were
achieved in the late 2000s and early 2010s and marked a significant milestone in HPC.
2. Exascale Systems: Exascale systems represent the next frontier in supercomputing,
targeting computational capabilities of at least one exaflop (1,000 petaflops) per second.
Exascale systems have the potential to deliver unprecedented computational power, enabling
scientific breakthroughs in areas such as weather prediction, energy research, drug discovery,
and more.
3. Challenges: Building exascale systems poses significant challenges, including power
consumption, system resilience, memory management, and efficient parallelization of
applications. Overcoming these challenges requires innovative hardware designs, software
optimizations, and advancements in cooling technologies.
4. International Initiatives: Various countries, including the United States, China, Japan, and
the European Union, have launched initiatives to develop exascale systems. These initiatives
aim to foster collaboration between academia, industry, and government organizations to
realize the potential of exascale computing.
5. Impact on Research: Exascale systems have the potential to revolutionize scientific
research by enabling simulations and data analysis on an unprecedented scale. They will
facilitate advancements in fields such as climate modeling, materials science, genomics,
fusion energy research, and artificial intelligence.
Supercomputers and petascale systems represent the pinnacle of computational power,
driving innovation and discovery across numerous scientific disciplines. As the demand for
more powerful computing resources continues to grow, the development of exascale
Novel computers: Stream, multithreaded, and purpose-built.:
Novel computers with stream processing, multithreading, and purpose-built architectures
have emerged as powerful tools in high-performance computing (HPC), offering unique
capabilities to address specific computational challenges. Let's explore each of these concepts
in the context of HPC:
1. Stream Computers:
Stream computers, also known as stream processors or stream architectures, are designed to
efficiently process streams of data. They excel at handling data-intensive applications that
involve large amounts of streaming data, such as multimedia processing, signal processing,
and data analytics.
- Stream Processing: Stream computers employ a data-centric approach, where data flows
through a pipeline of processing elements. Each processing element operates on a stream of
data elements simultaneously, executing a series of operations in a pipelined fashion. This
enables efficient parallel processing and high throughput for stream-based workloads.
- Dataflow Model: Stream computers are based on the dataflow model, where computations
are driven by the availability of input data. They focus on exploiting data-level parallelism
and optimizing data movement to maximize performance. This makes them well-suited for
applications with irregular data access patterns or data dependencies.
- Programming Models: Stream programming models, such as CUDA (Compute Unified
Device Architecture) for GPUs and OpenCL (Open Computing Language) for heterogeneous
platforms, provide frameworks for developing applications targeting stream architectures.
These models offer low-level control over the hardware, allowing developers to optimize
performance for specific stream-based workloads.
2. Multithreaded Computers:
Multithreaded computers utilize multiple threads of execution to parallelize computations and
maximize performance. They can significantly enhance the throughput and responsiveness of
applications by overlapping computations, hiding latencies, and exploiting available
parallelism.
- Thread-Level Parallelism: Multithreaded architectures allow concurrent execution of
multiple threads, enabling parallelism at the thread level. Each thread represents an
independent stream of instructions that can be executed simultaneously, taking advantage of
the available processing resources.
- Simultaneous Multithreading (SMT): SMT is a technique employed in some multithreaded
architectures, where multiple threads are executed concurrently on a single core. SMT
enhances resource utilization by leveraging thread-level parallelism and dynamically
scheduling instructions from multiple threads.
- Task Parallelism: Multithreaded architectures can also support task parallelism, where
independent tasks are distributed among multiple threads for simultaneous execution. Task
parallelism is particularly beneficial for irregular workloads or applications with dynamic
task creation.
- Programming Models: Multithreaded programming models, such as OpenMP (Open Multi-
Processing) and pthreads (POSIX threads), provide abstractions for parallel programming and
thread management. These models facilitate the development of multithreaded applications,
allowing developers to express parallelism and optimize performance.
3. Purpose-Built Computers:
Purpose-built computers are designed with specific workloads or applications in mind,
tailoring the architecture to optimize performance for targeted tasks. In the context of HPC,
purpose-built computers are developed to address the unique requirements of scientific
simulations, data analytics, or other specialized computational domains.
- Customized Architecture: Purpose-built computers often feature specialized hardware or
architectures that are specifically designed to accelerate specific computations. Examples
include graphics processing units (GPUs) for parallel processing, field-programmable gate
arrays (FPGAs) for hardware customization, or application-specific integrated circuits
(ASICs) designed for specific algorithms or tasks.
- Performance Optimization: Purpose-built architectures prioritize performance optimizations
that align with the targeted workloads. This may involve incorporating specialized functional
units, memory hierarchies, interconnects, or instruction sets that cater to the specific
computational requirements.
- Energy Efficiency: Purpose-built architectures can also focus on energy efficiency to
optimize power consumption for specific applications. By tailoring the hardware and
optimizing power management techniques, purpose-built computers can achieve higher
performance per watt for targeted workloads.