Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views7 pages

Multithreading, SMT and CMP

The document discusses hardware multithreading, explaining its basics, types, and benefits, such as improved processor utilization and latency tolerance. It details fine-grained, coarse-grained, and simultaneous multithreading, highlighting their advantages and disadvantages. Additionally, it covers chip-level multiprocessing (CMP) and the integration of multiple cores or threads, emphasizing the importance of these technologies in modern high-performance computing environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

Multithreading, SMT and CMP

The document discusses hardware multithreading, explaining its basics, types, and benefits, such as improved processor utilization and latency tolerance. It details fine-grained, coarse-grained, and simultaneous multithreading, highlighting their advantages and disadvantages. Additionally, it covers chip-level multiprocessing (CMP) and the integration of multiple cores or threads, emphasizing the importance of these technologies in modern high-performance computing environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Hardware Multithreading:

Basics:
▪ Thread: A thread is a lightweight process with its own instructions and data.

🢭 Instruction stream with state (registers and memory)


🢭 Register state is also called “thread context”
▪ Threads could be part of the same process (program) or from different programs 🢭
Threads in the same program share the same address space (shared memory
model)
▪ Traditionally, the processor keeps track of the context of a single thread ▪
Multitasking: When a new thread needs to be executed, old thread‟s context in
hardware written back to memory and new thread‟s context loaded
▪ Process: A process includes one or more threads, the address space, and the
operating system state.
Hardware Multithreading:
▪ Increasing utilization of a processor by switching to another thread when one
thread is stalled.
▪ General idea: Have multiple thread contexts in a single processor
🢭 When the hardware executes from those hardware contexts determines the
granularity of multithreading
▪ Why?
🢭 To tolerate latency (initial motivation)
▪ Latency of memory operations, dependent instructions, branch
resolution ▪ By utilizing processing resources more efficiently
🢭 To improve system throughput
▪ By exploiting thread-level parallelism
▪ By improving superscalar/ processor utilization
🢭 To reduce context switch penalty
▪ Tolerate latency
🢭 When one thread encounters a long-latency operation, the processor can
execute a useful operation from another thread
▪ Benefit
🢭 + Latency tolerance
🢭 + Better hardware utilization (when?)
🢭 + Reduced context switch penalty
▪ Cost
🢭 - Requires multiple thread contexts to be implemented in hardware (area,
power, latency cost)
🢭 - Usually reduced single-thread performance
- Resource sharing,
contention
- Switching penalty (can be reduced with additional hardware)

Types of Multithreading:
▪ Fine-grained Multithreading
🢭 Cycle by cycle
▪ Coarse-grained Multithreading
🢭 Switch on event (e.g., cache miss)
▪ Simultaneous Multithreading (SMT)
🢭 Instructions from multiple threads executed concurrently in the same cycle

Fine-grained Multithreading:
▪ Idea: Switch to another thread every cycle such that no two instructions from the
thread are in the pipeline concurrently
▪ Improves pipeline utilization by taking advantage of multiple threads ▪ Alternative
way of looking at it: Tolerates the control and data dependency latencies by
overlapping the latency with useful work from other threads
▪ Advantages
+ No need for dependency checking between instructions (only one
instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from different
threads + Improved system throughput, latency tolerance, utilization
▪ Disadvantages
- Extra hardware complexity: multiple hardware contexts, thread selection
logic - Reduced single thread performance (one instruction fetched every N
cycles) - Resource contention between threads in caches and memory
- Dependency checking logic between threads remains (load/store)

Coarse-grained Multithreading:
▪ A version of hardware multithreading that implies switching between threads
only after significant events, such as a last-level cache miss.
▪ Idea: When a thread is stalled due to some event, switch to a different hardware context 🢭
Switch-on-event multithreading
▪ Possible stall events
🢭 Cache misses
🢭 Synchronization events (e.g., load an empty location)
🢭 FP operations

Fine-grained vs. Coarse-grained MT


▪ Fine-grained advantages
+ Simpler to implement, can eliminate dependency checking, branch prediction
logic completely
+ Switching need not have any performance overhead (i.e. dead cycles)
+ Coarse-grained requires a pipeline flush or a lot of hardware to save
pipeline state
🡪 Higher performance overhead with deep pipelines and large windows
▪ Disadvantages
- Low single thread performance: each thread gets 1/Nth of the bandwidth
of the pipeline

Simultaneous Multithreading (SMT):


▪ A version of multithreading that lowers the cost of multithreading by utilizing
the resources needed for multiple issue, dynamically scheduled
microarchitecture. ▪ Instructions from multiple threads issued on same cycle
🢭 Uses register renaming and dynamic scheduling facility of multi-
issue architecture
▪ Needs more hardware support
🢭 Register files, PC‟s for each thread
🢭 Temporary result registers before commit
🢭 Support to sort out which threads get results from which instructions
▪ Maximizes utilization of execution units

Fig: Hardware Multithreading options

The four threads at the top show how each would execute running alone on a
standard superscalar processor without multithreading support. The three examples at the
bottom show how they would execute running together in three multithreading options. The
horizontal dimension represents the instruction issue capability in each clock cycle. The
vertical dimension represents a sequence of clock cycles. An empty (white) box indicates
that the corresponding issue slot is unused in that clock cycle. The shades of gray and color
correspond to four different threads in the multithreading processors.
Comparison of Multithreading options:
SNo Features Fine Grained Coarse Grained Simultaneous
Multithreading Multithreading Multithreading

1. Thread Issues instructions one thread runs Instructions from


Scheduling for different until it is blocked multiple threads
Policy threads after by an event that issued on same
every cycle in a normally would cycle in a
Round Robin create a long- Round Robin
fashion, latency stall fashion

2. Pipeline Dynamic, no flush None, flush on switch Dynamic, no flush


Partitioning

3. Efficiency More efficient Less efficient Efficient than


than Coarse the other two
grained
Multithreading

4. Required Requires more Requires lesser Requires more


Threads threads to keep threads to keep the threads to keep
the processor processor busy the processor
busy busy

5. Hardware Extra Hardware No such extra Extra Hardware


Complexity is required Hardware is required is required

6. Advantages Conceptually simple ∙ Implementation Hides memory


Simple latency
∙ Low cost

7. Disadvantage Very poor single Not suitable for Out- Increased


s thread performance of order execution conflicts in
shared resources

8. Example UltraSparc 1 IBM Northstar/Pulsar IntelP4


(Sun Niagara 1)
CMP Architecture
Ø Chip-level multiprocessing (CMP or multicore): integrates two or more independent cores
(normally a CPU) into a single package composed of a single integrated circuit (IC), called a die,
or more dies packaged, each executing threads independently.
Ø Every funtional units of a processor is duplicated.
Ø Multiple processors, each with a full set of architectural resources, reside on the same die
Ø Processors may share an on-chip cache or each can have its own cache
Ø Examples: HP Mako, IBM Power4
Ø Challenges: Power, Die area (cost)

Single core computer


Chip Multithreading
Chip Multithreading = Chip Multiprocessing + Hardware Multithreading.
Ø Chip Multithreading is the capability of a processor to process multiple s/w threads
simulataneous h/w threads of execution.
Ø CMP is achieved by multiple cores on a single chip or multiple threads on a single core.
Ø CMP processors are especially suited to server workloads, which generally have high levels
of Thread-Level Parallelism(TLP).
CMP’s Performance
Ø CMP’s are now the only way to build high performance microprocessors, for a variety of
reasons:
Ø Large uniprocessors are no longer scaling in performance, because it is only possible to
extract a limited amount of parallelism from a typical instruction stream.
Ø Cannot simply ratchet up the clock speed on today’s processors, or the power dissipation will
become prohibitive.
Ø CMT processors support many h/w strands through efficient sharing of on-chip resources
such as pipelines, caches and predictors.
Ø CMT processors are a good match for server workloads, which have high levels of TLP and
relatively low levels of ILP.
SMT and CMP
Ø CMP is easier to implement, but only SMT has the ability to hide latencies.
Ø A functional partitioning is not exactly reached within a SMT processor due to the centralized
instruction issue.
Ø A separation of the thread queues is a possible solution, although it does not remove the
central instruction issue.
Ø A combination of simultaneous multithreading with the CMP may be superior.
Ø Research: combine SMT or CMP organization with the ability to create threads with compiler
support of fully dynamically out of a single thread.
Ø Thread-level speculation
Ø Close to multiscalar

You might also like