Download Introduction to Software Pipelining - Lecture Slides | CS 6241 and more Study notes Computer Science in PDF only on Docsity! CS 6241 – Class 15 Intro to Software Pipelining Georgia Tech. February 26, 2008 - 1 - Scalar Scheduling Wrap Up SB scheduling has no bookkeeping, so its simpler » Replicate code during SB formation thus eliminating need for bookkeeping Trace scheduling » In general has less code expansion than SB » But, it can be quite messy due to compensation code Elcor/Impact » Uses SB/HB scheduling » Restricted or general speculation models supported General is default Next topic – Modulo scheduling for loops - 4 - Unroll Then Schedule Larger Body 1,2 3,4 5,6 n-1,nIteration time Schedule each iteration resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, cmpp = 1, mpy=3, ld = 2, st = 1, br = 1 time ops 0 1, 4 1 1’, 6, 4’ 2 2, 6’ 3 2’ 4 - 5 3, 5, 7 6 3’,5’,7’ 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 6: p1 = cmpp (r1 < r9) 7: brct p1 Loop Total time = 7 * n/2 - 5 - Problems With Unrolling Code bloat » Typical unroll is 4-16x » Use profile statistics to only unroll “important” loops » But still, code grows fast Barrier after across unrolled bodies » I.e., for unroll 2, can only overlap iterations 1 and 2, 3 and 4, … Does this mean unrolling is bad? » No, in some settings its very useful Low trip count Lots of branches in the loop body » But, in other settings, there is room for improvement - 6 - Overlap Iterations Using Pipelining 1 2 3 nIteration time 1 2 3 n With hardware pipelining, while one instruction is in fetch, another is in decode, another in execute. Same thing here, multiple iterations are processed simultaneously, with each instruction in a separate stage. 1 iteration still takes the same time, but time to complete n iterations is reduced! - 9 - Creating Software Pipelines (2) Create a schedule for 1 iteration of the loop such that when the same schedule is repeated at intervals of II cycles » No intra-iteration dependence is violated » No inter-iteration dependence is violated » No resource conflict arises between operation in same or distinct iterations We will start out assuming Itanium-style hardware support, then remove it later » Rotating registers » Predicates » Brtop - 10 - Terminology Iter 1 Iter 2 Iter 3 II time Initiation Interval (II) = fixed delay between the start of successive iterations Each iteration can be divided into stages consisting of II cycles each Number of stages in 1 iteration is termed the stage count (SC) Takes SC-1 cycles to fill/drain the pipe - 11 - Resource Usage Legality Need to guarantee that » No resource is used at 2 points in time that are separated by an interval which is a multiple of II » I.E., within a single iteration, the same resource is never used more than 1x at the same time modulo II » Known as modulo constraint, where the name modulo scheduling comes from » Modulo reservation table solves this problem To schedule an op at time T needing resource R The entry for R at T mod II must be free Mark busy at T mod II if schedule 0 1 2 II = 3 alu1 alu2 mem bus0 bus1 br - 14 - Physical Realization of EVRs EVR may contain an unlimited number values » But, only a finite contiguous set of elements of an EVR are ever live at any point in time » These must be given physical registers Conventional register file » Remaps are essentially copies, so each EVR is realized by a set of physical registers and copies are inserted Rotating registers » Direct support for EVRs » No copies needed » File “rotated” after each loop iteration is completed - 15 - Loop Dependence Example 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 6: p1[-1] = cmpp (r1[-1] < r9) remap r1, r2, r3, r4, p1 7: brct p1[-1] Loop 1 2 3 4 5 6 7 In DSA form, there are no inter-iteration anti or output dependences! 1,0 1,0 0,0 3,0 2,0 1,1 1,1 1,1 1,1 0,0 <delay, distance> - 16 - Class Problem 1: r1[-1] = load(r2[0]) 2: r3[-1] = r1[1] – r1[2] 3: store (r3[-1], r2[0]) 4: r2[-1] = r2[0] + 4 5: p1[-1] = cmpp (r2[-1] < 100) remap r1, r2, r3 6: brct p1[-1] Loop Draw the dependence graph showing both intra and inter iteration dependences Latencies: ld = 2, st = 1, add = 1, cmpp = 1, br = 1 - 19 - ResMII Example resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 6: p1 = cmpp (r1 < r9) 7: brct p1 Loop ALU: used by 2, 4, 5, 6 4 ops / 2 units = 2 Mem: used by 1, 3 2 ops / 1 unit = 2 Br: used by 7 1 op / 1 unit = 1 ResMII = MAX(2,2,1) = 2 - 20 - RecMII Approach: Enumerate all irredundant elementary circuits in the dependence graph RecMII = MAX (delay(c) / distance(c)) for all c in C delay(c) = total latency in dependence cycle c (sum of delays) distance(c) = total iteration distance of cycle c (sum of distances) 2 1 1,0 3,1 cycle k 1 k+1 2 k+2 k+3 k+4 1 k+5 2 1 3 4 cycles, RecMII = 4 delay(c) = 1 + 3 = 4 distance(c) = 0 + 1 = 1 RecMII = 4/1 = 4 - 21 - RecMII Example 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 6: p1 = cmpp (r1 < r9) 7: brct p1 Loop 1 2 3 4 5 6 7 1,0 1,0 0,0 3,0 2,0 1,1 1,1 1,1 1,1 0,0 <delay, distance> 4 4: 1 / 1 = 1 5 5: 1 / 1 = 1 4 1 4: 1 / 1 = 1 5 3 5: 1 / 1 = 1 RecMII = MAX(1,1,1,1) = 1 Then, MII = MAX(ResMII, RecMII) MII = MAX(2,1) = 2