Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Software Pipelining - Lecture Slides | CS 6241, Study notes of Computer Science

Material Type: Notes; Professor: Clark; Class: Compiler Design; Subject: Computer Science; University: Georgia Institute of Technology-Main Campus; Term: Spring 2008;

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-o6p
koofers-user-o6p 🇺🇸

10 documents

1 / 23

Toggle sidebar

Related documents


Partial preview of the text

Download Introduction to Software Pipelining - Lecture Slides | CS 6241 and more Study notes Computer Science in PDF only on Docsity! CS 6241 – Class 15 Intro to Software Pipelining Georgia Tech. February 26, 2008 - 1 - Scalar Scheduling Wrap Up  SB scheduling has no bookkeeping, so its simpler » Replicate code during SB formation thus eliminating need for bookkeeping  Trace scheduling » In general has less code expansion than SB » But, it can be quite messy due to compensation code  Elcor/Impact » Uses SB/HB scheduling » Restricted or general speculation models supported  General is default  Next topic – Modulo scheduling for loops - 4 - Unroll Then Schedule Larger Body 1,2 3,4 5,6 n-1,nIteration time Schedule each iteration resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, cmpp = 1, mpy=3, ld = 2, st = 1, br = 1 time ops 0 1, 4 1 1’, 6, 4’ 2 2, 6’ 3 2’ 4 - 5 3, 5, 7 6 3’,5’,7’ 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 6: p1 = cmpp (r1 < r9) 7: brct p1 Loop Total time = 7 * n/2 - 5 - Problems With Unrolling  Code bloat » Typical unroll is 4-16x » Use profile statistics to only unroll “important” loops » But still, code grows fast  Barrier after across unrolled bodies » I.e., for unroll 2, can only overlap iterations 1 and 2, 3 and 4, …  Does this mean unrolling is bad? » No, in some settings its very useful  Low trip count  Lots of branches in the loop body » But, in other settings, there is room for improvement - 6 - Overlap Iterations Using Pipelining 1 2 3 nIteration time 1 2 3 n With hardware pipelining, while one instruction is in fetch, another is in decode, another in execute. Same thing here, multiple iterations are processed simultaneously, with each instruction in a separate stage. 1 iteration still takes the same time, but time to complete n iterations is reduced! - 9 - Creating Software Pipelines (2)  Create a schedule for 1 iteration of the loop such that when the same schedule is repeated at intervals of II cycles » No intra-iteration dependence is violated » No inter-iteration dependence is violated » No resource conflict arises between operation in same or distinct iterations  We will start out assuming Itanium-style hardware support, then remove it later » Rotating registers » Predicates » Brtop - 10 - Terminology Iter 1 Iter 2 Iter 3 II time Initiation Interval (II) = fixed delay between the start of successive iterations Each iteration can be divided into stages consisting of II cycles each Number of stages in 1 iteration is termed the stage count (SC) Takes SC-1 cycles to fill/drain the pipe - 11 - Resource Usage Legality  Need to guarantee that » No resource is used at 2 points in time that are separated by an interval which is a multiple of II » I.E., within a single iteration, the same resource is never used more than 1x at the same time modulo II » Known as modulo constraint, where the name modulo scheduling comes from » Modulo reservation table solves this problem  To schedule an op at time T needing resource R  The entry for R at T mod II must be free  Mark busy at T mod II if schedule 0 1 2 II = 3 alu1 alu2 mem bus0 bus1 br - 14 - Physical Realization of EVRs  EVR may contain an unlimited number values » But, only a finite contiguous set of elements of an EVR are ever live at any point in time » These must be given physical registers  Conventional register file » Remaps are essentially copies, so each EVR is realized by a set of physical registers and copies are inserted  Rotating registers » Direct support for EVRs » No copies needed » File “rotated” after each loop iteration is completed - 15 - Loop Dependence Example 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 6: p1[-1] = cmpp (r1[-1] < r9) remap r1, r2, r3, r4, p1 7: brct p1[-1] Loop 1 2 3 4 5 6 7 In DSA form, there are no inter-iteration anti or output dependences! 1,0 1,0 0,0 3,0 2,0 1,1 1,1 1,1 1,1 0,0 <delay, distance> - 16 - Class Problem 1: r1[-1] = load(r2[0]) 2: r3[-1] = r1[1] – r1[2] 3: store (r3[-1], r2[0]) 4: r2[-1] = r2[0] + 4 5: p1[-1] = cmpp (r2[-1] < 100) remap r1, r2, r3 6: brct p1[-1] Loop Draw the dependence graph showing both intra and inter iteration dependences Latencies: ld = 2, st = 1, add = 1, cmpp = 1, br = 1 - 19 - ResMII Example resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 6: p1 = cmpp (r1 < r9) 7: brct p1 Loop ALU: used by 2, 4, 5, 6  4 ops / 2 units = 2 Mem: used by 1, 3  2 ops / 1 unit = 2 Br: used by 7  1 op / 1 unit = 1 ResMII = MAX(2,2,1) = 2 - 20 - RecMII Approach: Enumerate all irredundant elementary circuits in the dependence graph RecMII = MAX (delay(c) / distance(c)) for all c in C delay(c) = total latency in dependence cycle c (sum of delays) distance(c) = total iteration distance of cycle c (sum of distances) 2 1 1,0 3,1 cycle k 1 k+1 2 k+2 k+3 k+4 1 k+5 2 1 3 4 cycles, RecMII = 4 delay(c) = 1 + 3 = 4 distance(c) = 0 + 1 = 1 RecMII = 4/1 = 4 - 21 - RecMII Example 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 6: p1 = cmpp (r1 < r9) 7: brct p1 Loop 1 2 3 4 5 6 7 1,0 1,0 0,0 3,0 2,0 1,1 1,1 1,1 1,1 0,0 <delay, distance> 4  4: 1 / 1 = 1 5  5: 1 / 1 = 1 4  1  4: 1 / 1 = 1 5  3  5: 1 / 1 = 1 RecMII = MAX(1,1,1,1) = 1 Then, MII = MAX(ResMII, RecMII) MII = MAX(2,1) = 2
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved