Download Multithreading - Intro to Computer Architecture - Lecture Slides and more Slides Computer Architecture and Organization in PDF only on Docsity! CS 162 Computer Architecture Lecture 10: Multithreading Docsity.com Pipeline Hazards LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 • Each instruction may depend on the next – Without bypassing, need interlocks LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 • Bypassing cannot completely eliminate interlocks or delay slots Docsity.com Simple Multithreaded Pipeline • Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage Docsity.com Multithreading Costs • Appears to software (including OS) as multiple slower CPUs • Each thread requires its own user state – GPRs – PC • Also, needs own OS control state – virtual memory page table base register – exception handling registers • Other costs? Docsity.com Thread Scheduling Policies • Fixed interleave (CDC 6600 PPUs, 1965) – each of N threads executes one instruction every N cycles – if thread not ready to go in its slot, insert pipeline bubble • Software-controlled interleave (TI ASC PPUs, 1971) – OS allocates S pipeline slots amongst N threads – hardware performs fixed interleave over S slots, executing whichever thread is in that slot • Hardware-controlled thread scheduling (HEP, 1982) – hardware keeps track of which threads are ready to go – picks next thread to execute based on hardware priority scheme Docsity.com Denelcor HEP (Burton Smith, 1982) • First commercial machine to use hardware threading in main CPU – 120 threads per processor – 10 MHz clock rate – Up to 8 processors – precursor to Tera MTA (Multithreaded Architecture) Docsity.com Tera MTA Overview • Up to 256 processors • Up to 128 active threads per processor • Processors and memory modules populate a sparse 3D torus interconnection fabric • Flat, shared main memory – No data cache – Sustains one main memory access per cycle per processor • 50W/processor @ 260MHz Docsity.com MTA Instruction Format • Three operations packed into 64-bit instruction word (short VLIW) • One memory operation, one arithmetic operation, plus one arithmetic or branch operation • Memory operations incur ~150 cycles of latency • Explicit 3-bit “lookahead” field in instruction gives number of subsequent instructions (0-7) that are independent of this one – c.f. Instruction grouping in VLIW – allows fewer threads to fill machine pipeline – used for variable- sized branch delay slots • Thread creation and termination instructions Docsity.com Coarse-Grain Multithreading • Tera MTA designed for supercomputing applications with large data sets and low locality – No data cache – Many parallel threads needed to hide large memory latency • Other applications are more cache friendly – Few pipeline bubbles when cache getting hits – Just add a few threads to hide occasional cache miss latencies – Swap threads on cache misses Docsity.com MIT Alewife • Modified SPARC chips – register windows hold different thread contexts • Up to four threads per node • Thread switch on local cache miss Docsity.com IBM PowerPC RS64-III (Pulsar) • Commercial coarse-grain multithreading CPU • Based on PowerPC with quad-issue in-order fivestage pipeline • Each physical CPU supports two virtual CPUs • On L2 cache miss, pipeline is flushed and execution switches to second thread – short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency – flush pipeline to simplify exception handling Docsity.com Vertical Multithreading • Cycle-by-cycle interleaving of second thread removes vertical waste Docsity.com Ideal Multithreading for Superscalar • Interleave multiple threads to multiple issue slots with no restrictions Docsity.com Simultaneous Multithreading • Add multiple contexts and fetch engines to wide out-of-order superscalar processor – [Tullsen, Eggers, Levy, UW, 1995] • OOO instruction window already has most of the circuitry required to schedule from multiple threads • Any single thread can utilize whole machine Docsity.com From Superscalar to SMT • Extra pipeline stages for accessing thread-shared register files Docsity.com From Superscalar to SMT • Fetch from the two highest throughput threads. • Why? Docsity.com From Superscalar to SMT • Small items – per-thread program counters – per-thread return stacks – per-thread bookkeeping for instruction retirement, trap & instruction dispatch queue flush – thread identifiers, e.g., with BTB & TLB entries Docsity.com