Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Introduction to Software Pipelining - Lecture Slides | CS 6241, Study notes of Computer Science

Georgia Institute of Technology - Main Campus Computer Science

Prof. Nathan Clark

Material Type: Notes; Professor: Clark; Class: Compiler Design; Subject: Computer Science; University: Georgia Institute of Technology-Main Campus; Term: Spring 2008;

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-o6p 🇺🇸

10 documents

1 / 23

Partial preview of the text

Download Introduction to Software Pipelining - Lecture Slides | CS 6241 and more Study notes Computer Science in PDF only on Docsity! CS 6241 – Class 15 Intro to Software Pipelining Georgia Tech. February 26, 2008 - 1 - Scalar Scheduling Wrap Up SB scheduling has no bookkeeping, so its simpler » Replicate code during SB formation thus eliminating need for bookkeeping Trace scheduling » In general has less code expansion than SB » But, it can be quite messy due to compensation code Elcor/Impact » Uses SB/HB scheduling » Restricted or general speculation models supported General is default Next topic – Modulo scheduling for loops - 4 - Unroll Then Schedule Larger Body 1,2 3,4 5,6 n-1,nIteration time Schedule each iteration resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, cmpp = 1, mpy=3, ld = 2, st = 1, br = 1 time ops 0 1, 4 1 1’, 6, 4’ 2 2, 6’ 3 2’ 4 - 5 3, 5, 7 6 3’,5’,7’ 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 6: p1 = cmpp (r1 < r9) 7: brct p1 Loop Total time = 7 * n/2 - 5 - Problems With Unrolling Code bloat » Typical unroll is 4-16x » Use profile statistics to only unroll “important” loops » But still, code grows fast Barrier after across unrolled bodies » I.e., for unroll 2, can only overlap iterations 1 and 2, 3 and 4, … Does this mean unrolling is bad? » No, in some settings its very useful Low trip count Lots of branches in the loop body » But, in other settings, there is room for improvement - 6 - Overlap Iterations Using Pipelining 1 2 3 nIteration time 1 2 3 n With hardware pipelining, while one instruction is in fetch, another is in decode, another in execute. Same thing here, multiple iterations are processed simultaneously, with each instruction in a separate stage. 1 iteration still takes the same time, but time to complete n iterations is reduced! - 9 - Creating Software Pipelines (2) Create a schedule for 1 iteration of the loop such that when the same schedule is repeated at intervals of II cycles » No intra-iteration dependence is violated » No inter-iteration dependence is violated » No resource conflict arises between operation in same or distinct iterations We will start out assuming Itanium-style hardware support, then remove it later » Rotating registers » Predicates » Brtop - 10 - Terminology Iter 1 Iter 2 Iter 3 II time Initiation Interval (II) = fixed delay between the start of successive iterations Each iteration can be divided into stages consisting of II cycles each Number of stages in 1 iteration is termed the stage count (SC) Takes SC-1 cycles to fill/drain the pipe - 11 - Resource Usage Legality Need to guarantee that » No resource is used at 2 points in time that are separated by an interval which is a multiple of II » I.E., within a single iteration, the same resource is never used more than 1x at the same time modulo II » Known as modulo constraint, where the name modulo scheduling comes from » Modulo reservation table solves this problem To schedule an op at time T needing resource R The entry for R at T mod II must be free Mark busy at T mod II if schedule 0 1 2 II = 3 alu1 alu2 mem bus0 bus1 br - 14 - Physical Realization of EVRs EVR may contain an unlimited number values » But, only a finite contiguous set of elements of an EVR are ever live at any point in time » These must be given physical registers Conventional register file » Remaps are essentially copies, so each EVR is realized by a set of physical registers and copies are inserted Rotating registers » Direct support for EVRs » No copies needed » File “rotated” after each loop iteration is completed - 15 - Loop Dependence Example 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 6: p1[-1] = cmpp (r1[-1] < r9) remap r1, r2, r3, r4, p1 7: brct p1[-1] Loop 1 2 3 4 5 6 7 In DSA form, there are no inter-iteration anti or output dependences! 1,0 1,0 0,0 3,0 2,0 1,1 1,1 1,1 1,1 0,0 <delay, distance> - 16 - Class Problem 1: r1[-1] = load(r2[0]) 2: r3[-1] = r1[1] – r1[2] 3: store (r3[-1], r2[0]) 4: r2[-1] = r2[0] + 4 5: p1[-1] = cmpp (r2[-1] < 100) remap r1, r2, r3 6: brct p1[-1] Loop Draw the dependence graph showing both intra and inter iteration dependences Latencies: ld = 2, st = 1, add = 1, cmpp = 1, br = 1 - 19 - ResMII Example resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 6: p1 = cmpp (r1 < r9) 7: brct p1 Loop ALU: used by 2, 4, 5, 6 4 ops / 2 units = 2 Mem: used by 1, 3 2 ops / 1 unit = 2 Br: used by 7 1 op / 1 unit = 1 ResMII = MAX(2,2,1) = 2 - 20 - RecMII Approach: Enumerate all irredundant elementary circuits in the dependence graph RecMII = MAX (delay(c) / distance(c)) for all c in C delay(c) = total latency in dependence cycle c (sum of delays) distance(c) = total iteration distance of cycle c (sum of distances) 2 1 1,0 3,1 cycle k 1 k+1 2 k+2 k+3 k+4 1 k+5 2 1 3 4 cycles, RecMII = 4 delay(c) = 1 + 3 = 4 distance(c) = 0 + 1 = 1 RecMII = 4/1 = 4 - 21 - RecMII Example 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 6: p1 = cmpp (r1 < r9) 7: brct p1 Loop 1 2 3 4 5 6 7 1,0 1,0 0,0 3,0 2,0 1,1 1,1 1,1 1,1 0,0 <delay, distance> 4 4: 1 / 1 = 1 5 5: 1 / 1 = 1 4 1 4: 1 / 1 = 1 5 3 5: 1 / 1 = 1 RecMII = MAX(1,1,1,1) = 1 Then, MII = MAX(ResMII, RecMII) MII = MAX(2,1) = 2

Documents

questions

Introduction to Software Pipelining - Lecture Slides | CS 6241, Study notes of Computer Science

Related documents

Partial preview of the text