Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Computer Architecture: Single Cycle vs. Multicycle & Pipelined Processors, Study notes of Electrical and Electronics Engineering

An overview of computer architecture, comparing single cycle and multicycle implementations, and introducing pipelined processors. Topics include the advantages and disadvantages of single cycle and multicycle implementations, the five stages of a load instruction, and the differences between single cycle, multicycle, and pipelined timing.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-iq2
koofers-user-iq2 🇺🇸

10 documents

1 / 38

Toggle sidebar

Related documents


Partial preview of the text

Download Computer Architecture: Single Cycle vs. Multicycle & Pipelined Processors and more Study notes Electrical and Electronics Engineering in PDF only on Docsity! Fall 2006331 W12.1 14:332:331 Computer Architecture and Assembly Language Fall 2006 Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane Irwin’s PSU CSE331 slides] Fall 2006331 W12.2 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction Is wasteful of area since some functional units must (e.g., adders) be duplicated since they can not be shared during a clock cycle but Is simple and easy to understand Clk Single Cycle Implementation: lw sw Waste Cycle 1 Cycle 2 Fall 2006331 W12.5 Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Clk Cycle 1 Multiple Cycle Implementation: IFetch Dec Exec Mem WB Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 IFetch Dec Exec Mem lw sw Clk lw sw Waste IFetch R-type Cycle 1 Cycle 2 multicycle clock slower than 1/5th of single cycle clock due to stage flipflop overhead Fall 2006331 W12.6 Pipelined MIPS Processor Start the next instruction while still working on the current one improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 IFetch Dec Exec Mem WBlw Cycle 7Cycle 6 Cycle 8 sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB Fall 2006331 W12.7 Single Cycle, Multiple Cycle, vs. Pipeline Clk Cycle 1 Multiple Cycle Implementation: IFetch Dec Exec Mem WB Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 lw IFetch Dec Exec Mem WB IFetch Dec Exec Mem lw sw Pipeline Implementation: IFetch Dec Exec Mem WBsw Clk Single Cycle Implementation: Load Store Waste IFetch R-type Cycle 1 Cycle 2 wasted cycle IFetch Dec Exec Mem WBR-type Fall 2006331 W12.10 MIPS Pipeline Control Path Modifications All control signals are determined during Decode and held in the state registers between pipeline stages Read Address Instruction Memory Add PC 4 0 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 16 32 ALU 1 0 Shift left 2 Add Data MemoryAddress Write Data Read Data 1 0 WB IF et ch /D ec D ec /E xe c Ex ec /M em M em /W B IFetch Dec Exec Mem System Clock Control Sign Extend Fall 2006331 W12.11 Graphically Representing MIPS Pipeline A LUIM Reg DM Reg Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed? Fall 2006331 W12.12 Why Pipeline? For Throughput! Time (clock cycles) A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg Once the pipeline is full, one instruction is completed every cycle Time to fill the pipeline Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 I n s t r. O r d e r Fall 2006331 W12.15 How About Register File Access? Time (clock cycles) I n s t r. O r d e r add Inst 1 Inst 2 Inst 4 add A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. Fall 2006331 W12.16 Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 I n s t r. O r d e r Fall 2006331 W12.17 One Way to “Fix” a Data Hazard I n s t r. O r d e r add r1,r2,r3 A LUIM Reg DM Reg sub r4,r1,r5 and r6,r1,r7 A LUIM Reg DM Reg A LUIM Reg DM Reg stall stall Can fix data hazard by waiting – stall – but affects throughput Fall 2006331 W12.20 Stores Can Cause Data Hazards Dependencies backward in time cause hazards add r1,r2,r3 sw r1,100(r5) and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg I n s t r. O r d e r 7 OHS = ODaQn 0 Forwarding with Load-use Data Hazards lw $1,100($2) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 v IM Q Will still need one stall cycle even with forwarding 331 W12.21 Fall 2006 Fall 2006331 W12.22 Branch Instructions Cause Control Hazards Dependencies backward in time cause hazards A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg A LUIM Reg DM Reg add beq lw Inst 4 Inst 3 I n s t r. O r d e r Fall 2006331 W12.25 Sample Pipeline Alternatives ARM7 StrongARM-1 XScale IM Reg EX PC update IM access decode reg access ALU op DM access shift/rotate commit result (write back) A LUIM Reg DM Reg A LUIM1 IM2 DM1 Reg DM2 Reg SHFT PC update BTB access start IM access IM access decode reg 1 access shift/rotate reg 2 access ALU op start DM access exception DM write reg write Fall 2006331 W12.26 Summary All modern day processors use pipelining Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number of pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Must detect and resolve hazards Stalling negatively affects throughput Fall 2006331 W12.27 Performance Purchasing perspective given a collection of machines, which has the - best performance ? - least cost ? - best performance / cost ? Design perspective faced with design options, which has the - best performance improvement ? - least cost ? - best performance / cost ? Both require basis for comparison metric for evaluation Our goal is to understand cost & performance implications of architectural choices Fall 2006331 W12.30 Example Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2.2 times faster • = 6.5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” • Boeing is 1.6 times (“60%”)faster in terms of throughput • Concord is 2.2 times (“120%”) faster in terms of flying time • We will focus primarily on execution time for a single job Fall 2006331 W12.31 Basis of Evaluation Actual Target Workload Full Application Benchmarks Small “Kernel” Benchmarks Microbenchmarks Pros Cons • representative • very specific • non-portable • difficult to run, or measure • hard to identify cause • portable • widely used • improvements useful in reality •less representative • easy to “fool”• easy to run, early in design cycle • “peak” may be a long way from application performance • identify peak capability and potential bottlenecks Fall 2006331 W12.32 SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags eliminate special undocumented incantations that may not even generate working code for real programs Fall 2006331 W12.35 CPI “Average cycles per instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count CPU time = ClockCycleTime *  CPI * I i = 1 n i i CPI =  CPI * F where F = I i = 1 n i ii i Instruction Count "instruction frequency" Invest Resources where time is Spent! Fall 2006331 W12.36 Example (RISC processor) Typical Mix Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? Fall 2006331 W12.37 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved