Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Computer Architecture: Vector Processing and Vector Units, Slides of Electronics engineering

An in-depth review of vector processing, its advantages over superscalar processors, and the implementation of vector units in various systems. Topics include vector length, vector execution time, vector load/store units & memories, vector stride, and common vector metrics. Real-world examples and metrics are also included.

Typology: Slides

2012/2013

Uploaded on 03/23/2013

dhrupad
dhrupad 🇮🇳

4.4

(17)

221 documents

1 / 62

Toggle sidebar

Related documents


Partial preview of the text

Download Computer Architecture: Vector Processing and Vector Units and more Slides Electronics engineering in PDF only on Docsity! Graduate Computer Architecture Vectors, Branch Prediction, Dependence Speculation, and Data Prediction Docsity.com 25 Review: Alternative Model: Vector Processing • Vector processors have high-level operations that work on linear arrays of numbers: "vectors" + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 + vector length add.vv v3, v1, v2 VECTOR (N operations) Docsity.com Playstation 2000 Continued • Sample Vector Unit – 2-wide VLIW – Includes Microcode Memory – High-level instructions like matrix-multiply • Emotion Engine: – Superscalar MIPS core – Vector Coprocessor Pipelines – RAMBUS DRAM interface Docsity.com Virtual Processor Vector Model • Vector operations are SIMD (single instruction multiple data)operations • Each element is computed by a virtual processor (VP) • Number of VPs given by vector length – vector control register Docsity.com Vector Architectural State General Purpose Registers Flag Registers (32) VP0 VP1 VP$vlr-1 vr0 vr1 vr31 vf0 vf1 vf31 $vdw bits 1 bit Virtual Processors ($vlr) vcr0 vcr1 vcr31 Control Registers 32 bits Docsity.com Vector Execution Time • Time = f(vector length, data dependicies, struct. hazards) • Initiation rate: rate that FU consumes vector elements (= number of lanes; usually 1 or 2 on Cray T-90) • Convoy: set of vector instructions that can begin execution in same clock (no struct. or data hazards) • Chime: approx. time for a vector operation • m convoys take m chimes; if each vector length is n, then they take approx. m x n clock cycles (i h d d i i ti f l 4 conveys, 1 lane, VL=64 => 4 x 64 = 256 clocks (or 4 clocks per result) 1: LV V1,Rx ;load vector X 2: MULV V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y 3: ADDV V4,V2,V3 ;add 4: SV Ry,V4 ;store the result Docsity.com DLXV Start-up Time • Start-up time: pipeline latency time (depth of FU pipeline); another sources of overhead Operation Start-up penalty (from CRAY-1) Vector load/store 12 Vector multiply 7 Vector add 6 Assume convoys don't overlap; vector length = n: Convoy Start 1st resultlast result 1. LV 0 12 11+n (12+n-1) 2. MULV, LV 12+n 12+n+7 18+2n Multiply startup 12+n+1 12+n+13 24+2n Load start-up 3. ADDV 25+2n 25+2n+6 30+3n Wait convoy 2 Docsity.com Why startup time for each vector instruction? • Why not overlap startup time of back-to-back vector instructions? • Cray machines built from many ECL chips operating at high clock rates; hard to do? • Berkeley vector design (“T0”) didn’t know it wasn’t supposed to do overlap, so no startup times for functional units (except load) Docsity.com Strip Mining • Suppose Vector Length > Max. Vector Length (MVL)? • Strip mining: generation of code such that each vector operation is done for a size Š to the MVL • 1st loop do short piece (n mod MVL), rest VL = MVL low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0,(n / MVL) /*outer loop*/ Docsity.com Common Vector Metrics • R∞: MFLOPS rate on an infinite-length vector – vector “speed of light” – Real problems do not have unlimited vector lengths, and the start-up penalties encountered in real problems will be larger – (Rn is the MFLOPS rate for a vector of length n) • N1/2: The vector length needed to reach one-half of R∞ – a good measure of the impact of start-up • NV: The vector length needed to make vector mode f t th l d Docsity.com Vector Stride • Suppose adjacent elements not sequential in memory do 10 i = 1,100 do 10 j = 1,100 A(i,j) = 0.0 do 10 k = 1,100 10 A(i,j) = A(i,j)+B(i,k)*C(k,j) • Either B or C accesses not adjacent (800 bytes between) • stride: distance separating elements that are to be merged into a single vector (caches do unit stride) Docsity.com Vector Opt #2: Conditional Execution • Suppose: do 100 i = 1, 64 if (A(i) .ne. 0) then A(i) = A(i) – B(i) endif 100 continue • vector-mask control takes a Boolean vector: when vector-mask register is loaded from vector test, vector instructions operate only on vector elements whose corresponding entries in the Docsity.com Vector Opt #3: Sparse Matrices • Suppose: do 100 i = 1,n 100 A(K(i)) = A(K(i)) + C(M(i)) • gather (LVI) operation takes an index vector and fetches data from each address in the index vector – This produces a “dense” vector in the vector registers • After these elements are operated on in dense form, the sparse vector can be stored in Docsity.com Sparse Matrix Example • Cache (1993) vs. Vector (1988) IBM RS6000 Cray YMP Clock 72 MHz 167 MHz Cache 256 KB 0.25 KB Linpack 140 MFLOPS 160 (1.1) Sparse Matrix 17 MFLOPS 125 (7.3) (Cholesky Blocked ) • Cache: 1 address per cache block (32B to 64B) Docsity.com MMX Instructions • Move 32b, 64b • Add, Subtract in parallel: 8 8b, 4 16b, 2 32b – opt. signed/unsigned saturate (set to max) if overflow • Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b • Multiply, Multiply-Add in parallel: 4 16b • Compare = , > in parallel: 8 8b, 4 16b, 2 Docsity.com Vectors and Variable Data Width • Programmer thinks in terms of vectors of data of some width (8, 16, 32, or 64 bits) • Good for multimedia; More elegant than MMX-style extensions • Don’t have to worry about how data stored in hardware – No need for explicit pack/unpack operations • Just think of more virtual processors operating on narrow data • Expand Maximum Vector Length with Docsity.com Mediaprocessing: Vectorizable? Vector Lengths? Kernel Vector length • Matrix transpose/multiply # vertices at once • DCT (video, communication) image width • FFT (audio) 256-1024 • Motion estimation (video) image width, iw/16 • Gamma correction (video) image width • Haar transform (media mining) image width • Median filter (image processing) image width • Separable convolution (img. proc.) image width (from Pradeep Dubey - IBM, http://www.research.ibm.com/people/p/pradeep/tutor.html) Docsity.com Vector Advantages • Easy to get high performance; N operations: – are independent – use same functional unit – access disjoint registers – access registers in same order as previous instructions – access contiguous memory words or known pattern – can exploit large memory bandwidth – hide memory latency (and any other latency) • Scalable: (get higher performance by adding HW resources) • Compact: Describe N operations with 1 short instruction • Predictable: performance vs. statistical performance (cache) • Multimedia ready: N * 64b, 2N * 32b, 4N * 16b, 8N * 8b • Mature, developed compiler technology • Vector Disadvantage: Out of Fashion? – Hard to say. Many irregular loop structures seem to still be hard to vectorize automatically. – Theory of some researchers that SIMD model has great potential. Docsity.com Vector Processing and Power • If code is vectorizable, then simple hardware, more energy efficient than Out-of-order machines. • Can decrease power by lowering frequency so that voltage can be lowered, then duplicating hardware to make up for slower clock: fCV 2∝Power       <δ ⇒               <δδ= ×= = 1 ; 1 2 0 0 0 :ChangePower Constant ePerformanc 1 VV LanesnLanes f n f Docsity.com CS252 Administrivia • Select Project by next Friday (we will look at some options later in the lecture) – Need to have a partner for this. News group/email list? – Web site (as we shall see) has a number of suggestions – I am certainly open to other suggestions • make one project fit two classes? • Something close to your research? Docsity.com Need Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) – Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p. 273) Branch PC Predicted PC =? PC of instruction FETCH Predict taken or untaken Docsity.com Dynamic Branch Prediction • Performance = ƒ(accuracy, cost of misprediction) • Branch History Table: Lower bits of PC address index table of 1-bit values – Says whether or not branch taken last time – No address check • Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iteratios before exit): – End of loop case, when it exits instead of looping as before Docsity.com Dynamic Branch Prediction (Jim Smith, 1981) • Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264) T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T NT NT Docsity.com Correlating Branches (2,2) GAs predictor – First 2 means that we keep two bits of history – Second means that we have 2 bit counters in each slot. – Then behavior of recent branches selects between, say, four predictions of Branch address 2-bits per branch predictors Prediction 2-bit global branch history register • For instance, consider global history, set-indexed BHT. That gives us a GAs history table. Each slot is 2-bit counter Docsity.com Fr eq ue nc y of M is pr ed ic tio ns 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% na sa 7 m at rix 30 0 to m ca tv do du cd sp ic e fp pp p gc c es pr es so eq nt ot t li 0% 1% 5% 6% 6% 11% 4% 6% 5% 1% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) Accuracy of Different Schemes (Figure 4.21, p. 272) 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 0% 18% Fr eq ue nc y of M is pr ed ic tio ns Docsity.com Re-evaluating Correlation • Several of the SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches: program branch % static # = 90% compress 14% 236 13 eqntott 25% 494 5 gcc 15% 9531 2020 mpeg 10% 5598 532 real gcc 13% 17361 3214 • Real programs + OS more like gcc Docsity.com Discussion of Young/Smith paper Docsity.com Discussion of Store Sets Design problem: improve answer Docsity.com CS252 Projects • DynaCOMP related (or Introspective Computing) • OceanStore related • IRAM project related • BRASS project related • Industry suggested/MISC Docsity.com OceanStore Vision Docsity.com Ubiquitous Devices ⇒ Ubiquitous Storage • Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc. • Properties REQUIRED for Endeavour storage substrate: – Strong Security: data must be encrypted whenever in the infrastructure; resistance to monitoring – Coherence: too much data for naïve users to keep coherent “by hand” – Automatic replica management and optimization: huge quantities of data cannot be managed manually Docsity.com Pac Bell Sprint IBM AT&T Canadian OceanStore Utility-based Infrastructure • Service provided by confederation of companies – Monthly fee paid to one service provider – Companies buy and sell capacity from each other IBM Docsity.com Intelligent PDA ( 2003?) • Pilot PDA (todo,calendar, calculator, addresses,...) + Gameboy (Tetris, ...) + Nikon Coolpix (camera) + Cell Phone, Pager, GPS, tape recorder, TV remote, am/fm radio garage door – Speech control of all devices – Vision to see surroundings, scan documents, read bar codes, measure room Docsity.com V-IRAM-2: 0.13 µm, Fast Logic, 1GHz 16 GFLOPS(64b)/64 GOPS(16b)/128MB Memory Crossbar Switch M M … M M M … M M M … M M M … M M M … M M M … M … M M … M M M … M M M … M M M … M + Vector Registers x ÷ Load/Store 8K I cache 8K D cache 2-way Superscalar Vector Processor 8 x 64 8 x 64 8 x 64 8 x 64 8 x 64 8 x 64 or 16 x 32 or 32 x 16 8 x 64 8 x 64 Queue Instruction I/O I/O I/O I/O Serial I/O Docsity.com Ring- based Switch C P U +$ Tentative VIRAM-1 Floorplan I/O  0.18 µm DRAM 32 MB in 16 banks x 256b, 128 subbanks  0.25 µm, 5 Metal Logic  200 MHz MIPS, 16K I$, 16K D$  4 200 MHz FP/int. vector units  die: 16x16 mm  xtors: 270M  power: 2 Watts 4 Vector Pipes/Lanes Memory (128 Mbits / 16 MBytes) Memory (128 Mbits / 16 MBytes) Docsity.com SCORE: Stream-oriented computation model • Computations are expressed as data-flow graphs. • Graphs are broken up into compute pages. • Compute pages are linked together in a data-flow manner with streams. • A run-time manager allocates and schedules pages for computations and memory. Goal: Provide view of reconfigurable hardware which exposes strengths while abstracting physical resources. Docsity.com Summary #1 • Vector model accommodates long memory latency, doesn’t rely on caches as does Out-Of- Order, superscalar/VLIW designs • Much easier for hardware: more powerful instructions, more predictable memory accesses, fewer hazards, fewer branches, fewer mispredicted branches, ... • What % of computation is vectorizable? • Is vector a good match to new apps such as multimedia, DSP? Docsity.com Summary #2 Dynamic Branch Prediction • Prediction becoming important part of scalar execution. – Prediction is exploiting “information compressibility” in execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch. – Either different branches (GA) – Or different executions of same branches (PA). Docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved