Download CIS501 Introduction to Computer Architecture and more Exercises Computer Networks in PDF only on Docsity! UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 1 CIS501 Introduction to Computer Architecture Prof. Milo Martin Unit 1: Technology, Cost, Performance, Power, and Reliability UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 2 This Unit • What is a computer and what is computer architecture • Forces that shape computer architecture • Applications (covered last time) • Semiconductor technology • Evaluation metrics: parameters and technology basis • Cost • Performance • Power • Reliability UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 3 Readings • H+P • Chapters 1 • Paper • G. Moore, “Cramming More Components onto Integrated Circuits” • Reminders • Pre-quiz • Paper review • Groups of 3-4, send via e-mail to cis501+review@cis.upenn.edu • Don’t worry (much) about power question, as we might not get to it today UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 4 What is Computer Architecture? (review) • Design of interfaces and implementations… • Under constantly changing set of external forces… • Applications: change from above (discussed last time) • Technology: changes transistor characteristics from below • Inertia: resists changing all levels of system at once • To satisfy different constraints • CIS 501 mostly about performance • Cost • Power • Reliability • Iterative process driven by empirical evaluation • The art/science of tradeoffs UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 5 Abstraction and Layering • Abstraction: only way of dealing with complex systems • Divide world into objects, each with an… • Interface: knobs, behaviors, knobs ! behaviors • Implementation: “black box” (ignorance+apathy) • Only specialists deal with implementation, rest of us with interface • Example: car, only mechanics know how implementation works • Layering: abstraction discipline makes life even simpler • Removes need to even know interfaces of most objects • Divide objects in system into layers • Layer X objects • Implemented in terms of interfaces of layer X-1 objects • Don’t even need to know interfaces of layer X-2 objects • But sometimes helps if they do UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 6 Abstraction, Layering, and Computers • Computers are complex systems, built in layers • Applications • O/S, compiler • Firmware, device drivers • Processor, memory, raw I/O devices • Digital circuits, digital/analog converters • Gates • Transistors • 99% of users don’t know hardware layers implementation • 90% of users don’t know implementation of any layer • That’s OK, world still works just fine • But unfortunately, the layers sometimes breakdown • Someone needs to understand what’s “under the hood” UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 7 CIS501: A Picture • Computer architecture • Definition of ISA to facilitate implementation of software layers • CIS 501 mostly about computer micro-architecture • Design CPU, Memory, I/O to implement ISA … Application OS FirmwareCompiler CPU I/O Memory Digital Circuits Gates & Transistors Hardware Software Instruction Set Architecture (ISA) UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 8 Semiconductor Technology Background • Transistor • invention of the century • Fabrication Application OS FirmwareCompiler CPU I/O Memory Digital Circuits Gates & Transistors UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 17 Manufacturing Process • Grow SiO2 • Grow photo-resist • Burn “wire-level-1” mask • Dissolve unburned photo-resist • And underlying SiO2 • Grow copper “wires” • Dissolve remaining photo-resist • Continue with next wire layer… • Typical number of wire layers: 3-6 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 18 Defects • Defects can arise • Under-/over-doping • Over-/under-dissolved insulator • Mask mis-alignment • Particle contaminants • Try to minimize defects • Process margins • Design rules • Minimal transistor size, separation • Or, tolerate defects • Redundant or “spare” memory cells Defective: Defective: Slow: UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 19 Empirical Evaluation • Metrics • Cost • Performance • Power • Reliability • Often more important in combination than individually • Performance/cost (MIPS/$) • Performance/power (MIPS/W) • Basis for • Design decisions • Purchasing decisions UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 20 Cost • Metric: $ • In grand scheme: CPU accounts for fraction of cost • Some of that is profit (Intel’s, Dell’s) • We are concerned about Intel’s cost (transfers to you) • Unit cost: costs to manufacture individual chips • Startup cost: cost to design chip, build the fab line, marketing Memory, display, power supply/battery, disk, packagingOther costs 20-30%20–30%10–20%10–30%% of total $10–$20$50–$100$150-$350$100–$300$ PhonePDALaptopDesktop UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 21 Unit Cost: Integrated Circuit (IC) • Chips built in multi-step chemical processes on wafers • Cost / wafer is constant, f(wafer size, number of steps) • Chip (die) cost is proportional to area • Larger chips means fewer of them • Larger chips means fewer working ones • Why? Uniform defect density • Chip cost ~ chip area" • " = 2#3 • Wafer yield: % wafer that is chips • Die yield: % chips that work • Yield is increasingly non-binary - fast vs slow chips UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 22 Yield/Cost Examples • Parameters • wafer yield = 90%, " = 2, defect density = 2/cm2 400324256196144100Die size (mm2) 10%11%12%16%19%23%Die yield 90(9)116(13)153(20)206(32)290(53)431(96)10” Wafer 52(5)68(7)90(11)124(19)177(32)256(59)8” Wafer 23(2)32(3)44(5)62(9)90(16)139(31)6” Wafer $473 $202 $119 $35 Total $37 $23 $21 $12 Test Cost $19(273) $30(431) $3(304) $11(168) Package Cost (pins) $417 $149 $95 $12 Die Cost 9%402961.5$1500Intel Pentium 19%532341.2$1500DEC Alpha 27%661961.3$1700IBM PPC601 54%181811.0$1200Intel 486DX2 YieldDiesArea (mm2) Defect (/cm2) Wafer Cost UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 23 Startup Costs • Startup costs: must be amortized over chips sold • Research and development: ~$100M per chip • 500 person-years @ $200K per • Fabrication facilities: ~$2B per new line • Clean rooms (bunny suits), lithography, testing equipment • If you sell 10M chips, startup adds ~$200 to cost of each • Companies (e.g., Intel) don’t make money on new chips • They make money on proliferations (shrinks and frequency) • No startup cost for these UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 24 Moore’s Effect on Cost • Scaling has opposite effects on unit and startup costs + Reduces unit integrated circuit cost • Either lower cost for same functionality… • Or same cost for more functionality – Increases startup cost • More expensive fabrication equipment • Takes longer to design, verify, and test chips UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 25 Performance • Two definitions • Latency (execution time): time to finish a fixed task • Throughput (bandwidth): number of tasks in fixed time • Very different: throughput can exploit parallelism, latency cannot • Baking bread analogy • Often contradictory • Choose definition that matches goals (most frequently thruput) • Example: move people from A to B, 10 miles • Car: capacity = 5, speed = 60 miles/hour • Bus: capacity = 60, speed = 20 miles/hour • Latency: car = 10 min, bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 26 Performance Improvement • Processor A is X times faster than processor B if • Latency(P,A) = Latency(P,B) / X • Throughput(P,A) = Throughput(P,B) * X • Processor A is X% faster than processor B if • Latency(P,A) = Latency(P,B) / (1+X/100) • Throughput(P,A) = Throughput(P,B) * (1+X/100) • Car/bus example • Latency? Car is 3 times (and 200%) faster than bus • Throughput? Bus is 4 times (and 300%) faster than car UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 27 What Is ‘P’ in Latency(P,A)? • Program • Latency(A) makes no sense, processor executes some program • But which one? • Actual target workload? + Accurate – Not portable/repeatable, overly specific, hard to pinpoint problems • Some representative benchmark program(s)? + Portable/repeatable, pretty accurate – Hard to pinpoint problems, may not be exactly what you run • Some small kernel benchmarks (micro-benchmarks) + Portable/repeatable, easy to run, easy to pinpoint problems – Not representative of complex behaviors of real programs UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 28 SPEC Benchmarks • SPEC (Standard Performance Evaluation Corporation) • http://www.spec.org/ • Consortium of companies that collects, standardizes, and distributes benchmark programs • Post SPECmark results for different processors • 1 number that represents performance for entire suite • Benchmark suites for CPU, Java, I/O, Web, Mail, etc. • Updated every few years: so companies don’t target benchmarks • SPEC CPU 2000 • 12 “integer”: gzip, gcc, perl, crafty (chess), vortex (DB), etc. • 14 “floating point”: mesa (openGL), equake, facerec, etc. • Written in C and Fortran (a few in C++) UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 37 Another CPI Example • Assume a processor with instruction frequencies and costs • Integer ALU: 50%, 1 cycle • Load: 20%, 5 cycle • Store: 10%, 1 cycle • Branch: 20%, 2 cycle • Which change would improve performance more? • A. Branch prediction to reduce branch cost to 1 cycle? • B. A bigger data cache to reduce load cost to 3 cycles? • Compute CPI • Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 • A = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*1 = 1.8 • B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 (winner) UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 38 Increasing Clock Frequency: Pipelining • CPU is a pipeline: compute stages separated by latches • http://…/~amir/cse371/lecture_slides/pipeline.pdf • Clock period: maximum delay of any stage • Number of gate levels in stage • Delay of individual gates (these days, wire delay more important) PC Insn Mem Register File s1 s2 d Data Mem a d + 4 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 39 Increasing Clock Frequency: Pipelining • Reduce pipeline stage delay • Reduce logic levels and wire lengths (better design) • Complementary to technology efforts (described later) • Increase number of pipeline stages (multi-stage operations) – Often causes CPI to increase – At some point, actually causes performance to decrease • “Optimal” pipeline depth is program and technology specific • Remember example • PentiumIII: 12 stage pipeline, 800 MHz faster than • Pentium4: 22 stage pipeline, 1 GHz • Next Intel design: more like PentiumIII • Much more about this later UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 40 CPI and Clock Frequency • System components “clocked” independently • E.g., Increasing processor clock frequency doesn’t improve memory performance • Example • Processor A: CPICPU = 1, CPIMEM = 1, clock = 500 MHz • What is the speedup if we double clock frequency? • Base: CPI = 2 ! IPC = 0.5 ! MIPS = 250 • New: CPI = 3 ! IPC = 0.33 ! MIPS = 333 • Clock *= 2 ! CPIMEM *= 2 • Speedup = 333/250 = 1.33 << 2 • What about an infinite clock frequency? • Only a x2 speedup (Example of Amdahl’s Law) UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 41 Measuring CPI • How are CPI and execution-time actually measured? • Execution time: time (Unix): wall clock + CPU + system • CPI = CPU time / (clock frequency * dynamic insn count) • How is dynamic instruction count measured? • More useful is CPI breakdown (CPICPU, CPIMEM, etc.) • So we know what performance problems are and what to fix • CPI breakdowns • Hardware event counters • Calculate CPI using counter frequencies/event costs • Cycle-level micro-architecture simulation (e.g., SimpleScalar) + Measure exactly what you want + Measure impact of potential fixes • Must model micro-architecture faithfully • Method of choice for many micro-architects (and you) UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 42 Improving CPI • CIS501 is more about improving CPI than frequency • Historically, clock accounts for 70%+ of performance improvement • Achieved via deeper pipelines • That will (have to) change • Deep pipelining is not power efficient • Physical speed limits are approaching • 1GHz: 1999, 2GHz: 2001, 3GHz: 2002, 4GHz? almost 2006 • Techniques we will look at • Caching, speculation, multiple issue, out-of-order issue • Vectors, multiprocessing, more… • Moore helps because CPI reduction requires transistors • The definition of parallelism is “more transistors” • But best example is caches UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 43 Moore’s Effect on Performance • Moore’s Curve: common interpretation of Moore’s Law • “CPU performance doubles every 18 months” • Self fulfilling prophecy • 2X every 18 months is ~1% per week • Q: Would you add a feature that improved performance 20% if it took 8 months to design and test? • Processors under Moore’s Curve (arrive too late) fail spectacularly • E.g., Intel’s Itanium, Sun’s Millennium 0 50 100 150 200 250 300 350 1982 1984 1986 1988 1990 1992 1994 Year Pe rf or m an ce RISC Intel x86 35%/yr UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 44 Performance Rules of Thumb • Make common case fast • Sometimes called “Amdahl’s Law” • Corollary: don’t optimize 1% to the detriment of other 99% • Build a balanced system • Don’t over-engineer capabilities that cannot be utilized • Design for actual, not peak, performance • For actual performance X, machine capability must be > X UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 45 Transistor Speed, Power, and Reliability • Transistor characteristics and scaling impact: • Switching speed • Power • Reliability • “Undergrad” gate delay model for architecture • Each Not, NAND, NOR, AND, OR gate has delay of “1” • Reality is not so simple UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 46 Transistors and Wires IBM SOI Technology © IB M From slides © Krste Asanovi!, MIT UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 47 Transistors and Wires IBM CMOS7, 6 layers of copper wiring © IB M From slides © Krste Asanovi!, MIT UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 48 1!0 I 0!1 1!0 1!0 Simple RC Delay Model • Switching time is a RC circuit (charge or discharge) • R - Resistance: slows rate of current flow • Depends on material, length, cross-section area • C - Capacitance: electrical charge storage • Depends on material, area, distance • Voltage affects speed, too UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 57 Moore’s Effect on RC Delay • Scaling helps reduce wire and gate delays • In some ways, hurts in others + Wires become shorter (Length( ! Resistance() + Wire “surface areas” become smaller (Capacitance() + Transistors become shorter (Resistance() + Transistors become narrower (Capacitance(, Resistance*) – Gate insulator thickness becomes smaller (Capacitance*) – Distance between wires becomes smaller (Capacitance*) UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 58 Improving RC Delay • Exploit good effects of scaling • Fabrication technology improvements + Use copper instead of aluminum for wires ('( ! Resistance() + Use lower-dielectric insulators ()( ! Capacitance() + Increase Voltage + Design implications + Use bigger cross-section wires (Area* ! Resistance() • Typically means taller, otherwise fewer of them – Increases “surface area” and capacitance (Capacitance*) + Use wider transistors (Area* ! Resistance() – Increases capacitance (not for you, for upstream transistors) – Use selectively UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 59 Another Constraint: Power and Energy • Power (Watt or Joule/Second): short-term (peak, max) • Mostly a dissipation (heat) concern • Power-density (Watt/cm2): important related metric – Thermal cycle: power dissipation* ! power density* ! temperature* ! resistance* ! power dissipation*… • Cost (and form factor): packaging, heat sink, fan, etc. • Energy (Joule): long-term • Mostly a consumption concern • Primary issue is battery life (cost, weight of battery, too) • Low-power implies low-energy, but not the other way around • 10 years ago, nobody cared UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 60 Sources of Energy Consumption CL Diode Leakage Current Subthreshold Leakage Current Short-Circuit Current Capacitor Charging Current Dynamic power: • Capacitor Charging (85-90% of active power) • Energy is $ CV2 per transition • Short-Circuit Current (10-15% of active power) • When both p and n transistors turn on during signal transition Static power: • Subthreshold Leakage (dominates when inactive) • Transistors don’t turn off completely • Diode Leakage (negligible) • Parasitic source and drain diodes leak to substrate From slides © Krste Asanovi!, MIT UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 61 Moore’s Effect on Power • Scaling has largely good effects on local power + Shorter wires/smaller transistors (Length( ! Capacitance() – Shorter transistor length (Resistance(, Capacitance() – Global effects largely undone by increased transistor counts • Scaling has a largely negative effect on power density + Transistor/wire power decreases linearly – Transistor/wire density decreases quadratically – Power-density increases linearly • Thermal cycle • Controlled somewhat by reduced VDD (5!3.3!1.6!1.3!1.1) • Reduced VDD sacrifices some switching speed UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 62 Reducing Power • Reduce supply voltage (VDD) + Reduces dynamic power quadratically and static power linearly • But poses a tough choice regarding VT – Constant VT slows circuit speed ! clock frequency ! performance – Reduced VT increases static power exponentially • Reduce clock frequency (f) + Reduces dynamic power linearly – Doesn’t reduce static power – Reduces performance linearly • Generally doesn’t make sense without also reduced VDD … • Except that frequency can be adjusted cycle-to-cycle and locally • More on this later UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 63 Dynamic Voltage Scaling (DVS) • Dynamic voltage scaling (DVS) • OS reduces voltage/frequency when peak performance not needed ± X-Scale is power efficient (6200 MIPS/W), but not IA32 compatible 62MIPS @ 0.01W300MIPS @ 0.25W1100MIPS @ 4.5WLow-power 800MIPS @ 0.9W1600MIPS @ 2W3400MIPS @ 34WHigh-speed 0.7–1.65V (continuous) 1.1–1.6V (continuous) 0.9–1.7V (0.1V steps) Voltage 50–800MHz (50MHz steps) 200–700MHz (33MHz steps) 300–1000MHz (50MHz steps) Frequency Intel X-Scale (StrongARM2) TM5400 “LongRun” Mobile PentiumIII “SpeedStep” UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 64 Reducing Power: Processor Modes • Modern electrical components have low-power modes • Note: no low-power disk mode, magnetic (non-volatile) • “Standby” mode • Turn off internal clock • Leave external signal controller and pins on • Restart clock on interrupt ± Cuts dynamic power linearly, doesn’t effect static power • Laptops go into this mode between keystrokes • “Sleep” mode • Flush caches, OS may also flush DRAM to disk • Turn off processor power plane – Needs a “hard” restart + Cuts dynamic and static power • Laptops go into this mode after ~10 idle minutes UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 65 Reliability • Mean Time Between Failures (MTBF) • How long before you have to reboot or buy a new one • Not very quantitative yet, people just starting to think about this • CPU reliability small in grand scheme • Software most unreliable component in a system • Much more difficult to specify & test • Much more of it • Most unreliable hardware component … disk • Subject to mechanical wear UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 66 Moore’s Bad Effect on Reliability • CMOS devices: CPU and memory • Historically almost perfectly reliable • Moore has made them less reliable over time • Two sources of electrical faults • Energetic particle strikes (from sun) • Randomly charge nodes, cause bits to flip, transient • Electro-migration: change in electrical interfaces/properties • Temperature-driven, happens gradually, permanent • Large, high-energy transistors are immune to these effects – Scaling makes node energy closer to particle energy – Scaling increases power-density which increases temperature • Memory (DRAM) was hit first: denser, smaller devices than SRAM UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 67 Moore’s Good Effect on Reliability • The key to providing reliability is redundancy • The same scaling that makes devices less reliable… • Also increase device density to enable redundancy • Classic example • Error correcting code (ECC) for DRAM • ECC also starting to appear for caches • More reliability techniques later • Today’s big open questions • Can we protect logic? • Can architectural techniques help hardware reliability? • Can architectural techniques help with software reliability? UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 68 Summary: A Global Look at Moore • Device scaling (Moore’s Law) + Increases performance • Reduces transistor/wire delay • Gives us more transistors with which to reduce CPI + Reduces local power consumption – Which is quickly undone by increased integration – Aggravates power-density and temperature problems – Aggravates reliability problem + But gives us the transistors to solve it via redundancy + Reduces unit cost – But increases startup cost • Will we fall off Moore’s Cliff? (for real, this time?) • What’s next: nanotubes, quantum-dots, optical, spin-tronics, DNA?