Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Computer Architecture, Study notes of Computer Architecture and Organization

An overview of computer architecture, including its definition, underlying technology, history, and design. It discusses the importance of computer architecture in understanding how computers work and improving their performance, reliability, and security. The document also includes a case study on deep learning and the impact of specialized architectures on the computational model. likely to be useful as study notes or a summary for university students in computer science or engineering courses.

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

ekagarh
ekagarh 🇺🇸

4.5

(26)

19 documents

1 / 90

Toggle sidebar

Related documents


Partial preview of the text

Download Introduction to Computer Architecture and more Study notes Computer Architecture and Organization in PDF only on Docsity! Welcome to 15-740! 15-740 FALL’19 NATHAN BECKMANN 1 Topics What is computer architecture? Underlying technology History (of x86) ISA design Information about the class 2 The science and art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals. [wikipedia] Abstractions to bridge gap Application Physics What is computer architecture? Computer architecture 5 •Underlying components: ◦ Relays → Tubes → Transistors → VLSI → ??? Carbon nano tubes ??? ◦ Mercury delay lines →Magnetic core → DRAM → FLASH → ??? Resistive RAM ??? •What to optimize for: ◦ Transistors ◦ Memory ◦ Instructions ◦ Performance ◦ Power ◦ Parallelism •Technology constantly changing! Responsive to technology 6 Responsive to applications 500 million apps downloaded. And counting. Moore’s Law G o o d n es s (l o g sc al e) Time Moore’s Law G o o d n es s (l o g sc al e) Time Moore’s Law G o o d n es s (l o g sc al e) Time rn SLE | P26 | 26! 2261 1261 026 | 696| 896! 296! 93961 S96l P96 £961 c96l |96l 0961 6S6| hoy .igey bt pd ep wwunrTMN-ONDOKONTMN-O NOILINAS OILVYOSLNI B3d SINSNOdWOD 40 YSEWAN IHL 40 2907 YEAR ¢ fotos tf bot bt tf bt J. Moore’s original prediction from 1965 Moore’s law really about economics? MOORE’S LAW GRAPH — 1965 Relative manufacturing cost/component 10° 1962 10" NS 1965 10 1 1 40 10% 10% 107% 108 Number of components per integrated circuit Technology changes architecture It isn’t just transistor density ◦ Transistor size, density, speed, power, cost ◦ Memory size, density, latency, throughput ◦ Disks ◦ Networks ◦ Communication These trends lead to exponential increase in ops/sec-$-m3-watt Which in turn leads to changes in applications ◦ Mainframes → Desktops → Mobile Which leads to new design goals 17 Why you should study computer architecture Understand how computers work Its not just how to build them: ◦ Why does my program run slowly? ◦ How do I increase performance? ◦ How do I improve reliability? ◦ Is my system secure? ◦ What can I expect tomorrow? We are at a crossroads… 20 “Deep learning” (a.k.a. neural networks) are taking over the world …An old technique that had fallen out of favor for decades What happened? ◦ 1) Big data – massive training datasets ◦ 2) GPUs – massive compute available for little $$ Now, “neural accelerators” are the hottest topic in computer architecture ◦ E.g., ~one-third of papers at top arch conferences since 2016 ◦ Google, Apple, Microsoft, 100s of startups building & deploying custom hardware Highly specialized architectures disrupt the computational model we’ve all grown up with Case study: Deep learning 21 Combination of Hans Moravac + Larry Roberts + Gordon Bell WordSize*ops/s/sysprice [Gray Turing Award Lecture] 1.E-06 1.E-03 1.E+00 1.E+03 1.E+06 1.E+09 1880 1900 1920 1940 1960 1980 2000 doubles every 7.5 years doubles every 2.3 years doubles every 1.0 years Ops/sec/$ Mechanical/ Relays Tubes/ Transistor CMOS 2010 2020 2030 1.E+10 No… Moore’s Law is ending nanometer doubles every few months? Dramatic changes for architecture and computing History perspective: 22 The microprocessor Microprocessor revolution ◦ Technology threshold crossed in 1970s: Enough transistors (~25K) to fit a 16-bit processor on one chip ◦ Huge performance advantages: fewer slow chip-crossings ◦ Even bigger cost advantages: one “stamped-out” component Created new applications ◦ Desktops, CD/DVD players, laptops, game consoles, set-top boxes, mobile phones, digital camera, mp3 players, GPS, automotive, … And replaced incumbents in existing segments ◦ Supercomputers, “mainframes”, “minicomputers”, etc. 25 First microprocessor •Intel 4004 (1971) ◦ The first single-chip CPU! ◦ Application: calculators ◦ Technology: 10000 nm ◦ 2300 transistors in 13 mm2 ◦ 740 KHz, 8 or 16 cycles/instr. ◦ Multiple cycles to xfer data ◦ 12 Volts ◦ 640-byte address space ◦ 4-bit data Reminder: This looks silly today, but it is a miraculous machine by any historical standard. 26 Tracing the microprocessor revolution •How were growing transistor counts used? •Initially to widen the datapath ◦ 4004: 4 bits → Pentium4: 64 bits •And to add more powerful instructions ◦ To amortize overhead of fetch and decode ◦ To simplify programming (which was done by hand then) ◦ To reduce memory requirements for program ◦ Could get absurd: e.g., VAX “POLY” instruction 27 •Intel Pentium4 (2003) ◦ Application: desktop/server ◦ Technology: 90nm (1% of 4004) ◦ 55M transistors (20,000x) ◦ 101 mm2 (10x) ◦ 3.4 GHz (10,000x) ◦ 3 instrs / cycle (superscalar) ◦ 1.2 Volts (1/10x) ◦ 32/64-bit data (16x) ◦ 22-stage pipelined datapath ◦ Two levels of on-chip cache ◦ Data-parallel “vector” (SIMD) instructions, hyperthreading Nearing the end of uniprocessors 30 •Then to support explicit data & thread-level parallelism ◦ Hardware provides parallel resources, software specifies usage ◦ Why? diminishing returns on instruction-level-parallelism •First using (sub-word) vector instructions … ◦ E.g., in Intel’s SSE, one instruction does four parallel multiplies •… adding support for multi-threaded programs … ◦ Coherent caches, hardware synchronization primitives •New architectures, e.g., programmable GPUs ◦ Some attempts at convergence between CPUs and GPUs (e.g., Intel’s Xeon Phi) Explicit parallelism 31 Multicore •Intel Core i7 (2013) ◦ Application: desktop/server ◦ Technology: 22nm (25% of P4) ◦ 1.4B transistors (30x) ◦ 177 mm2 (2x) ◦ 3.5 GHz to 3.9 Ghz (~1x) ◦ 1.8 Volts (~1x) ◦ 256-bit data (2x) ◦ 14-stage pipelined datapath (0.5x) ◦ 4 instructions per cycle (~1x) ◦ Three levels of on-chip cache (1.5x) ◦ Data-parallel “vector” (SIMD) instructions, hyperthreading ◦ Four-core multicore (4x) ??? 32 System-on-chip Qualcomm Snapdragon 835 ◦ Application: Mobile ◦ Technology: 10nm ◦ ARM CPUs – heterogeneous “big.LITTLE” design ◦ 4 “performance” cores – 2.45 GHz, 2MB L2 cache ◦ 4 “efficiency” cores – 1.9 GHz, 1MB L2 cache ◦ “Performance” cores are 20% faster; “efficiency” cores used 80% of the time ◦ Graphics processing unit (GPU) ◦ ~650 MHz ◦ Specialized floating-point datapath, e.g., for interpolation of textures ◦ Data-parallel: 16 pixels / clock ◦ Processor dynamically finds & schedules work (“warp scheduling”) ◦ Digital signal processor (DSP) ◦ Data-parallel SIMD architecture with 4 instructions / cycle ◦ No floating-point ◦ Compiler statically schedules parallelism (“VLIW”) ◦ Other custom accelerators (camera, modem, etc) *Snapdragon 820 (only die shot I could find) 35 Architectures today Multicore CPUs (e.g., Intel Xeon) ◦ Traditional hard-to-parallelize code – web serving ◦ Renewed focus on CPU microarchitecture – sequential performance still matters! GPUs (e.g., Nvidia) ◦ “Embarrassingly parallel” code – science, graphics, DNNs ◦ Increasing programmability, converging towards traditional vector design System-on-chip & domain-specialized accelerators ◦ Energy-efficiency – embedded, mobile, (datacenter – Google’s TPU???) ◦ Lots of open questions ... ◦ How many accelerators do we need? ◦ Which ones? ◦ How specialized should they be? 36 Computer Architecture in Broad Strokes 37 Broad strokes: Processing vs. Memory Computer scientists make a fundamental distinction between processing and memory This makes sense, but it is a choice (contrast with, e.g., neural networks) Historically, computer science focuses on processing (also true of architects) But increasingly, memory/communication is the primary challenge ◦ Increasing data sizes + technology ➔ data is increasingly expensive ◦ Consistency of parallel updates to data ◦ Compute is easier to specialize ◦ Some recent designs attempt to eliminate this dichotomy (“processing in memory”) Processor Memory 40 Algorithm Gates/Register-Transfer Level (RTL) Application Instruction Set Architecture (ISA) Operating System/Virtual Machine Microarchitecture Devices Programming Language Circuits Physics A rc h it e ct u re in t h e ‘5 0 s th e ‘9 0 s th e fu tu re Abstraction layers in modern systems 41 Broad strokes: Erosion of Familiar Abstractions During mid ’80s-early ‘00s, processors got steadily faster each year ➔Most computer scientists learned it was safe to ignore computer architecture SURPRISE! Technology scaling ends: first Dennard in early ‘00s, now Moore in next ~five years Architectural limits: pipelining + ILP give diminishing returns (more on this later) ➔ Software must change 2000-2010: Multicore, GPGPU 2010-now: Accelerators Software folks must pay more attention to hardware! (Even if you don’t want to… CPUs have stopped scaling, but application’s haven’t.) 42 Microarchitecture ↓ Abstraction & your program High-level language ◦ Level of abstraction closer to problem domain ◦ Provides for productivity and portability Assembly language ◦ Textual representation of instructions (ISA) Hardware representation ◦ Binary representation of instructions (ISA) 45 The ISA defines the functional contract between the software and the hardware The ISA is an abstraction that hides details of the implementation from the software ➔ The ISA is functional abstraction of the processor (a “mental model”) ◦ What operations can be performed ◦ How to name storage locations ◦ The format (bit pattern) of the instructions ISA typically does NOT define ◦ Timing of the operations ◦ Power used by operations ◦ How operations/storage are implemented Instruction set architecture (ISA) If timing leaks information, is it Intel’s fault? No, the abstraction is broken. 46 Operands ◦ How many? ◦ What kind? ◦ Addressing mechanisms Operations ◦ What kind? ◦ How many? Format/encoding ◦ Length(s) of bit pattern ◦ Which bits mean what What goes into an ISA? 47 Stack Accumulator Register-Memory Load-Store 1 address add A Acc ← Acc + mem[A] 0 address add push(pop() + pop()) 2 address add R1, A R1 ← R1 + mem[A] 3 address add R1, R2, A R1 ← R2 + mem[A] 3 address add R1, R2, R3 R1 ← R2 + R3 2 address load R1, R2 R1 ← mem[R2] store R1, R2 mem[R1] ← R2 Operands per instruction 50 Code for: A=X*Y – B*C X Y B C A SP + 4 + 8 +12 +16 Stack push 8(SP) push 16(SP) mult push 4(sp) push 12(sp) mult sub st 20(sp) pop Examples 51 Code for: A=X*Y – B*C X Y B C A SP + 4 + 8 +12 +16 Stack push 8(SP) push 16(SP) mult push 4(sp) push 12(sp) mult sub st 20(sp) pop Examples Accumulator ld 8(SP) mult 12(SP) st 20(SP) ld 4(SP) mult 0(SP) sub 20(sp) st 16(sp) 52 Machine model tradeoffs Stack and Accumulator: ◦ Each instruction encoding is short ◦ IC is high ◦ Very simple exposed architecture Register-Memory: ◦ Instruction encoding is much longer ◦ More work per instruction ◦ IC is low ◦ Architectural state more complex Load/Store: ◦ Medium encoding length ◦ Less work per instruction ◦ IC is high ◦ Architectural state more complex 55 Ease of programming (software perspective) Ease of implementation (hardware perspective) Good performance Compatibility Completeness (eg, Turing) Compactness – reduce program size Scalability / extensibility Features: Support for OS / parallelism / … Etc ISA design goals / Code generation 61 The ISA should make it easy to express programs and make it easy to create efficient programs. Who is creating the programs? ◦ Early Days: Humans. Why? Ease of programming 62 “The Iron Law of Performance” 𝐶𝑃𝑈 𝑇𝑖𝑚𝑒 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑃𝑟𝑜𝑔𝑟𝑎𝑚 × 𝐶𝑦𝑐𝑙𝑒𝑠 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 × 𝑆𝑒𝑐𝑜𝑛𝑑𝑠 𝐶𝑦𝑐𝑙𝑒 What determines each factor? How does ISA impact each? Instructions / program = dynamic instruction count (not code size) ◦ Determined by program, compiler, and ISA Cycles / instruction (CPI) ◦ Determined by ISA, 𝜇arch, program, and compiler Seconds / cycle (critical path) ◦ Determined by 𝜇arch and technology 65 Cycles per instruction (CPI) Different instruction classes take different numbers of cycles In fact, even the same instruction can take a different number of cycles ◦ Example? When we say CPI, we really mean: Weighted CPI 𝐶𝑃𝐼 = 𝐶𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡 = ෍ 𝑖=1 𝑛 𝐶𝑃𝐼𝑖 × 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑖 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡 66 How to improve performance 𝐶𝑃𝑈 𝑇𝑖𝑚𝑒 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑃𝑟𝑜𝑔𝑟𝑎𝑚 × 𝐶𝑦𝑐𝑙𝑒𝑠 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 × 𝑆𝑒𝑐𝑜𝑛𝑑𝑠 𝐶𝑦𝑐𝑙𝑒 1. Reduce instruction count 2. Reduce cycles per instruction 3. Reduce cycle time But there is a tension between these… 67 Compatibility “Between 1970 and 1985 many thought the primary job of the computer architect was the design of instruction sets. …The educated architect was expected to have strong opinions about the strengths and especially the weaknesses of the popular computers. The importance of binary compatibility in quashing innovation in instruction set design was unappreciated by many researchers and textbook writers, giving the impression that many architects would get a chance to design an instruction set.” - H&P, Appendix A 70 ISA separates interface from implementation Thus, many different implementations possible ◦ IBM/360 first to do this and introduce 7 different machines all with same ISA ◦ Intel from 8086 → core i7 → Xeon Phi → ? ◦ ARM ISA mobile → server Protects software investment Important to decide what should be exposed and what should be kept hidden. ◦ E.g., MIPS “branch delay slots” Compatibility 71 RISC vs CISC 72 ADD—Add Opcode 04 ib 05 iw 05 id REX.W + 05 id 80/0 ib REX + 80 /0 ib 81/0 iw 81/0 id REX.W + 81 /O id 83 /0 ib 83 /0 ib REX.W + 63 /0 ib 00 fr REX + 00 /r 01 fr 01 fr REXW +01 /r 02 Ir REX + 02 /r 03 /r 03 fr REX.W + 03 /r Instruction ADD AL, imme ADD AX, imm16 ADD EAX, imm32 ADD RAX, imm32 ADD r/m8, imm8 ADD r/mé’, imm8 ADD r/m16, imm16 ADD r/m32, imm32 ADD r/m64, imm32 ADD r/m16, imm8 ADD r/m32, imm8 ADD r/m64, imm8 ADD r/m@, r8 ADD r/mé", r8° ADD r/mi6, r16 ADD r/m32, r32 ADD r/m64, r64 ADD 18, r/m8 ADD 58’, rime ADD r16, r/m16 ADD 132, r/m32 ADD r64, r/m64 MI MI MI MR MR MR MR MR RM RM RM RM RM 64-bit Mode Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Compat/ Leg Mode Valid Valid Valid NE. Valid NE. Valid Valid NE. Valid Valid NE. Valid NE. Valid Valid NE. Valid NE. Valid Valid NE. Description Add immé to AL. Add immi6 ta AX. Add imm32 to EAX. Add imm32 sign-extended to 64-bits to RAX. Add immé to r/m8. Add sign-extended immé to r/m64. Add imm16 to r/m16. Add imm32 to r/m32. Add imm32 sign-extended to 64-bits to rim64. Add sign-extended immé to r/m76. Add sign-extended immé to r/m32. Add sign-extended immé to r/m64. Add r8 to r/m8. Add r8 to r/m8. Add r76 to r/m16. Add r32 to r/m32. Add r64 to r/m64. Add r/m8 to ré. Add r/m8 to r8. Add r/m16 to ri6. Add r/m32 to r32. Add r/m64 to r64. NOTES: *In 64-bit mode, r/m6 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. Pre-1980 ◦ Lots of hand written assembly ◦ Compiler technology in its infancy ◦ Multi-chip implementations ◦ Small memories at ~CPU speed Early 80’s ◦ VLSI makes single chip processor possible (But only if very simple) ◦ Compiler technology improving How did RISC happen? 76 Pre-1980 ◦ Lots of hand written assembly ◦ Compiler technology in its infancy ◦ Multi-chip implementations ◦ Small memories at ~CPU speed Early 80’s ◦ VLSI makes single chip processor possible (But only if very simple) ◦ Compiler technology improving RISC goals: ◦ Enable single-chip CPU ◦ Rely on compiler ◦ Aim for high frequency & low CPI How did RISC happen? 77 Schools of ISA design & performance 𝐶𝑃𝑈 𝑇𝑖𝑚𝑒 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑃𝑟𝑜𝑔𝑟𝑎𝑚 × 𝐶𝑦𝑐𝑙𝑒𝑠 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 × 𝑆𝑒𝑐𝑜𝑛𝑑𝑠 𝐶𝑦𝑐𝑙𝑒 Complex instruction set computer (CISC) ◦ Complex instructions ➔ lots of work per instruction ➔ fewer instructions per program ◦ But… more cycles per instruction & longer clock period ◦ Modern 𝜇arch gets around most of this! Reduced instruction set computer (RISC) ◦ Fine-grain instructions ➔ less work per instruction ➔more instructions per program ◦ But… lower cycles per instruction & shorter clock period ◦ Heavy reliance on compiler to “do the right thing” 80 The case for RISC CISC is fundamentally handicapped At any given technology, RISC implementation will be faster: ◦ Current technology enables single-chip RISC ◦ When it enables single-chip CISC, RISC will be pipelined ◦ When it enables pipelined CISC, RISC will have caches ◦ When it enables CISC with caches, RISC will have ... ➔ RISC will always be one step ahead of CISC! 81 Pre-1980 ◦ lots of hand written assembly ◦ Compiler technology in its infancy ◦ multi-chip implementations ◦ Small memories at ~CPU speed Early 80’s ◦ VLSI makes single chip processor possible (But only if very simple) ◦ Compiler technology improving Late 90’s ◦ CPU speed vastly faster than memory speed ◦ More transistors makes ops possible What actually happened? 82 Potential Micro-op Scheme Most instructions are a single micro-op ◦ Add, xor, compare, branch, etc. ◦ Loads example: mov -4(%rax), %ebx ◦ Stores example: mov %ebx, -4(%rax) Each memory access adds a micro-op ◦ “addl -4(%rax), %ebx” is two micro-ops (load, add) ◦ “addl %ebx, -4(%rax)” is three micro-ops (load, add, store) Function call (CALL) – 4 uops ◦ Get program counter, store program counter to stack, adjust stack pointer, unconditional jump to function start Return from function (RET) – 3 uops ◦ Adjust stack pointer, load return address from stack, jump register (Again, just a basic idea, micro-ops are specific to each chip) 85 More About Micro-ops Two forms of ops “cracking” ◦ Hard-coded logic: fast, but complex (for insn in few ops) ◦ Table: slow, but “off to the side”, doesn’t complicate rest of machine ◦ Handles the really complicated instructions Core precept of architecture: Make the common case fast, make the rare case correct. 86 Redux: Are ISAs Important? Does “quality” of ISA actually matter? ◦ Not for performance (mostly) ◦ Mostly comes as a design complexity issue ◦ Insn/program: everything is compiled, compilers are good ◦ Cycles/insn and seconds/cycle: ISA, many other tricks ◦ What about power efficiency? ◦ Somewhat…RISC is most power-efficient today Does “nastiness” of ISA matter? ◦ Mostly no, only compiler writers and hardware designers see it Even compatibility is not what it used to be ◦ Software emulation, cloud services ◦ Open question: will “ARM compatibility” be the next x86? 90 Please come Participate! (10% of grade) Lecture schedule and slides are online • Until Exam 1: Memory hierarchy & parallelism • After Exam 1: Microarchitecture & recent research topics Lectures 98 Labs 2 labs early in semester ◦ 5% grade each Work in groups of 2-3 Goal: ◦ Become familiar with some tools ◦ Understand performance measurement ◦ Understand optimization aka How architecture affects use First lab out this week! (more info next time) 99 We are trying something new➔ feedback is welcome + we’ll respond to it! Each lecture has an associated paper Reviews ◦ Before each discussion, pick two papers from intervening lectures to review ◦ Max half-page summary + discussion of the paper ◦ What’s the main idea? What problem is it solving? How does it solve it? How does it evaluate the solution? ◦ Include 3 questions you would ask the authors ◦ 10% of grade – more importantly, an essential skill Discussion ◦ You will each present once per semester (~20 total) ◦ Instructor (that’s me) will then lead an open discussion of the paper Paper readings, discussion & reviews 100
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved