Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Multithreading - Intro to Computer Architecture - Lecture Slides, Slides of Computer Architecture and Organization

During the course work of the Intro to Computer Architecture, we study the main concept regarding the:Multithreading, Pipeline Hazards, Peripheral Processors, Simple Multithreaded Pipeline, Multithreading Costs, Thread Scheduling Policies, Coarse-Grained Multithreading, Multithreading Design Choices, Instruction Format

Typology: Slides

2012/2013

Uploaded on 05/06/2013

anurati
anurati 🇮🇳

4.1

(23)

128 documents

1 / 30

Toggle sidebar

Related documents


Partial preview of the text

Download Multithreading - Intro to Computer Architecture - Lecture Slides and more Slides Computer Architecture and Organization in PDF only on Docsity! CS 162 Computer Architecture Lecture 10: Multithreading Docsity.com Pipeline Hazards LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 • Each instruction may depend on the next – Without bypassing, need interlocks LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 • Bypassing cannot completely eliminate interlocks or delay slots Docsity.com Simple Multithreaded Pipeline • Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage Docsity.com Multithreading Costs • Appears to software (including OS) as multiple slower CPUs • Each thread requires its own user state – GPRs – PC • Also, needs own OS control state – virtual memory page table base register – exception handling registers • Other costs? Docsity.com Thread Scheduling Policies • Fixed interleave (CDC 6600 PPUs, 1965) – each of N threads executes one instruction every N cycles – if thread not ready to go in its slot, insert pipeline bubble • Software-controlled interleave (TI ASC PPUs, 1971) – OS allocates S pipeline slots amongst N threads – hardware performs fixed interleave over S slots, executing whichever thread is in that slot • Hardware-controlled thread scheduling (HEP, 1982) – hardware keeps track of which threads are ready to go – picks next thread to execute based on hardware priority scheme Docsity.com Denelcor HEP (Burton Smith, 1982) • First commercial machine to use hardware threading in main CPU – 120 threads per processor – 10 MHz clock rate – Up to 8 processors – precursor to Tera MTA (Multithreaded Architecture) Docsity.com Tera MTA Overview • Up to 256 processors • Up to 128 active threads per processor • Processors and memory modules populate a sparse 3D torus interconnection fabric • Flat, shared main memory – No data cache – Sustains one main memory access per cycle per processor • 50W/processor @ 260MHz Docsity.com MTA Instruction Format • Three operations packed into 64-bit instruction word (short VLIW) • One memory operation, one arithmetic operation, plus one arithmetic or branch operation • Memory operations incur ~150 cycles of latency • Explicit 3-bit “lookahead” field in instruction gives number of subsequent instructions (0-7) that are independent of this one – c.f. Instruction grouping in VLIW – allows fewer threads to fill machine pipeline – used for variable- sized branch delay slots • Thread creation and termination instructions Docsity.com Coarse-Grain Multithreading • Tera MTA designed for supercomputing applications with large data sets and low locality – No data cache – Many parallel threads needed to hide large memory latency • Other applications are more cache friendly – Few pipeline bubbles when cache getting hits – Just add a few threads to hide occasional cache miss latencies – Swap threads on cache misses Docsity.com MIT Alewife • Modified SPARC chips – register windows hold different thread contexts • Up to four threads per node • Thread switch on local cache miss Docsity.com IBM PowerPC RS64-III (Pulsar) • Commercial coarse-grain multithreading CPU • Based on PowerPC with quad-issue in-order fivestage pipeline • Each physical CPU supports two virtual CPUs • On L2 cache miss, pipeline is flushed and execution switches to second thread – short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency – flush pipeline to simplify exception handling Docsity.com Vertical Multithreading • Cycle-by-cycle interleaving of second thread removes vertical waste Docsity.com Ideal Multithreading for Superscalar • Interleave multiple threads to multiple issue slots with no restrictions Docsity.com Simultaneous Multithreading • Add multiple contexts and fetch engines to wide out-of-order superscalar processor – [Tullsen, Eggers, Levy, UW, 1995] • OOO instruction window already has most of the circuitry required to schedule from multiple threads • Any single thread can utilize whole machine Docsity.com From Superscalar to SMT • Extra pipeline stages for accessing thread-shared register files Docsity.com From Superscalar to SMT • Fetch from the two highest throughput threads. • Why? Docsity.com From Superscalar to SMT • Small items – per-thread program counters – per-thread return stacks – per-thread bookkeeping for instruction retirement, trap & instruction dispatch queue flush – thread identifiers, e.g., with BTB & TLB entries Docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved