Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Software Instruction Level Parallelism - Lecture Slides | CMSC 411, Study notes of Computer Science

University of Maryland Computer Science

Prof. Alan L. Sussman

Material Type: Notes; Professor: Sussman; Class: SYSTM ARCHITECTURE; Subject: Computer Science; University: University of Maryland; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-08p 🇺🇸

(1)

10 documents

1 / 8

Partial preview of the text

Download Software Instruction Level Parallelism - Lecture Slides | CMSC 411 and more Study notes Computer Science in PDF only on Docsity! CMSC 411 - A. Sussman (from D. O'Leary) 1 Computer Systems Architecture CMSC 411 Unit 4b – Software instruction- level parallelism Alan Sussman October 30, 2003 CMSC 411 - Alan Sussman 2 Administrivia • Homework #4 due today • Midterms Tuesday – through Unit 4 – questions? • Read Chapter 4 • Homework #4b (Chapter 4) posted soon • Workshop on grad school next Thursday, Nov. 6, 5-7PM, CSIC 1115 CMSC 411 - Alan Sussman 3 Last time • Hardware speculation – register renaming – as an alternative to ROB – don’t speculate instructions that may cause very expensive exceptional event (e.g., a load) – sometimes useful to speculate through multiple branches – to find more ILP in program • Limits of ILP study – start with ideal processor (no ILP constraints), then looked at effects of more realistic constraints • limit window size and maximum issue count • realistic branch and jump prediction • limit number of registers for renaming • imperfect memory address alias analysis CMSC 411 - Alan Sussman 4 Last time (cont.) • P6 microarchitecture – dynamically schedule MIPS-like core – each IA-32 instruction translated into 1 or more MIPS-like uops – pipeline is speculative, with both ROB and register renaming – performance study shows benefits of dynamically scheduled pipeline, and that resource limitations cause most stalls CMSC 411 - Alan Sussman 5 Loop unrolling • To improve performance of pipelines and simple multiple issue processors – for static issue, and for dynamic issue with static scheduling • To fill pipeline stalls – can be done dynamically in hardware, or statically in compiler • Look again at an example we used before CMSC 411 - Alan Sussman 6 Example - loop unrolling original loop: for i=1000, 999, ..., 1 x[i] = x[i] + s; unrolled to a depth of 4: for i=1000, 996, 992,..., 4 x[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s; end for Loop: L.D F0,0(R1) x[i] = x[i] + s ADD.D F4,F0,F2 uses F0 and F4 S.D F4,0(R1) L.D F6,-8(R1) x[i-1] = x[i-1] + s ADD.D F8,F6,F2 uses F6 and F8 S.D F8,-8(R1) L.D F10,-16(R1) x[i-2] = x[i-2] + s ADD.D F12,F10,F2 uses F10 and F12 S.D F12,-16(R1) L.D F14,-24(R1) x[i-3] = x[i-3] + s ADD.D F16,F14,F2 uses F10 and F12 S.D F16,-24(R1) DSUBI R1,R1,#32 point to next element BNE R1,R2,Loop CMSC 411 - A. Sussman (from D. O'Leary) 2 CMSC 411 - Alan Sussman 7 And reschedule the loop Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4, F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DSUBI R1,R1,#32 S.D F12,16(R1) BNE R1,R2,Loop S.D F16, 8(R1) CMSC 411 - Alan Sussman 8 Example (cont.) • Note: if 1000 were not divisible by 4, we would have a loop like this plus added code to take care of the last few elements • How well does the unrolled code pipeline? – uses (14 cycles)/(4 elements), instead of original code that used 6 cycles per element • assuming issue 1 instruction per cycle, and standard MIPS pipeline organization (load delays, functional unit latencies) CMSC 411 - Alan Sussman 9 Loop unrolling (cont.) • Limited only by: – number of available registers – size of instruction cache - want the unrolled loop to fit • What is gained – fewer pipeline stalls/bubbles – less loop overhead - fewer DSUBIs and BNEs • What is lost – longer code – many possibilities to introduce errors – slower compilation – more work for either the programmer or for the compiler writer CMSC 411 - Alan Sussman 10 What did the compiler have to do? • Determine that it was legal to move S.D after DSUBI and BNE, and adjust S.D offsets • Determine that loop unrolling would be useful – improve performance • Use different registers to avoid name dependences • Eliminate extra test and branch instructions, and adjust loop termination and counter code • Determine that loads and stores could be interchanged – the ones from different iterations are independent – requires memory address analysis • Schedule the code, preserving true dependences CMSC 411 - Alan Sussman 11 Dependences limit loop unrolling • In unrolling, removed intermediate DSUBI instructions to reduce the data dependence for the L.D and the control dependence for the BNEZ • There are also antidependences, so also made sure that later copies of the unrolled code used registers other than F0 & F4 - eliminated name dependences Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DSUBI R1,R1,#8 BNE R1,R2,Loop L.D and BNEZ depend on result of DSUBI CMSC 411 - Alan Sussman 12 True data dependences also limit unrolling • First assignment statement is an example of a loop-carried dependence • Second assignment statement doesn't limit unrolling, but makes scheduling trickier for i=1,...,1000 x[i+1] = x[i] + c[i] ;Uses value from previous iteration b[i+1] = d[i] + x[i+1] ;Uses the value just computed end for CMSC 411 - A. Sussman (from D. O'Leary) 5 CMSC 411 - Alan Sussman 25 Software pipelining • Another compiler technique for parallelism • The compiler symbolically unrolls the loop to create one copy that interleaves instructions from different iterations Fig. 4.6 CMSC 411 - Alan Sussman 26 Example - again Loop: L.D F0,0(R1) ; get next element of x ADD.D F4,F0,F2 ; add s to x-element S.D F4,0(R1) ; store result DSUBI R1,R1,#8 ; point to next x-element BNE R1,R2,Loop ; test done CMSC 411 - Alan Sussman 27 Example (cont.) • Three copies of the loop body, unrolled: • Software pipelined loop takes the statements in bold and interleaves them:Loop: L.D F0,0(R1) ; x[i]=x[i]+s ADD.D F4,F0,F2 S.D F4,0(R1) L.D F0,0(R1);x[i-1]=x[i-1]+s ADD.D F4,F0,F2 S.D F4,0(R1) L.D F0,0(R1) ;x[i-2]=x[i-2]+s ADD.D F4,F0,F2 S.D F4,0(R1) Loop: S.D F4, 0(R1) ;store element i ADD.D F4,F0,F2 ;add for i-1 L.D F0,0(R1) ;get element i-2 DSUBI R1,R1,#8 ;next x BNEZ R1,Loop ;test done • with initialization before, and clean-up after, and 2 fewer iterations CMSC 411 - Alan Sussman 28 Software pipelining (cont.) • Doesn't reduce loop overhead (like loop unrolling does) • But reduces data hazards – similar to what hardware dynamic scheduling does, but in software (so works for VLIW, static scheduling) CMSC 411 - Alan Sussman 29 Global code scheduling • Requires moving instructions across branches – e.g., for effective scheduling of a loop body • Want to compact a code fragment with branches (control statements) into the shortest possible sequence and preserve data and control dependences – means finding shortest sequence for critical path – longest sequence of data dependent instructions CMSC 411 - Alan Sussman 30 Global code motion • Need estimates of relative frequency of different paths through the code – since moving code across branches will often affect its frequency of execution • No guarantees that code will be faster, but if frequency info is accurate, compiler can decide if code is likely to be faster CMSC 411 - A. Sussman (from D. O'Leary) 6 CMSC 411 - Alan Sussman 31 Example – inner loop body A[i]=A[i]+B[i] A[i]=0? B[i]= C[i]= Common path Moving B[i]= or C[i]= requires complex analysis CMSC 411 - Alan Sussman 32 Global scheduling • For example, good scheduling may require moving the assignments to B or C before the test on A • To move B assignment: – can’t change data flow or exceptions – for exceptions, don’t move certain types of instructions (e.g., memory refs) that cause exceptions – for data flow, can’t change results of instructions before the test CMSC 411 - Alan Sussman 33 Global scheduling (cont.) • To move C assignment – first move into then part, and also need a copy in else part (to avoid control dependence on A test) – to move above A test, can’t affect any data flow up to A test – can then remove copy in else part • We’ll talk about hardware support for this later CMSC 411 - Alan Sussman 34 Global code scheduling algorithms • Trace scheduling – good for VLIWs – trace selection – pick the likely frequent path of basic blocks – trace compaction – schedule the resulting set of blocks – branches are just jumps into and out of the trace – need extra bookkeeping code to fix up when branching into or out of the trace, but not supposed to happen too often CMSC 411 - Alan Sussman 35 Scheduling algorithms (cont.) • Superblocks – Similar to a trace, but only 1 entry point, multiple exits – Use tail duplication to create a separate block for the part of the trace after the entry – see loop example in H&P, Fig. 4.10 – Disadvantage is possible larger code than for trace scheduling CMSC 411 - Alan Sussman 36 Hardware support for the compiler • Conditional/predicated instructions – to eliminate branches • Methods to help compiler move code past branches – mainly to deal with exceptions properly • Checks for address conflicts – to help with reordering loads and stores CMSC 411 - A. Sussman (from D. O'Leary) 7 CMSC 411 - Alan Sussman 37 Conditional instructions • Condition is evaluated as part of the instruction execution – if condition true, normal execution – if condition false, instruction turned into a no-op • Example: conditional move – move a value from one register to another if condition is true – can eliminate a branch in simple code sequences CMSC 411 - Alan Sussman 38 Example: conditional move • For code: if (A==0) { S=T; } – Assume R1, R2, R3 hold values of A, S, T With branch: BNEZ R1, L ADDU R2, R3, R0 L: With conditional move (if 3rd operand equals zero): CMOVZ R2, R3, R1 • Converts the control dependence into a data dependence • for a pipeline, moves the dependence from near beginning of pipeline (branch resolution) to end (register write) CMSC 411 - Alan Sussman 39 Superscalar execution • Predication helps with scheduling • Example: superscalar that can issue 1 memory reference and 1 ALU op per cycle, or just 1 branch LW R9,0(R8) LW R8,0(R10) BEQZ R10,L ADD R6,R3,R7 ADD R3,R4,R5LW R1,40(R2) 2nd instruction1st instruction 2nd instruction1st instruction LW R9,0(R8) BEQZ R10,L ADD R6,R3,R7LWC R8,0(R10),R10 ADD R3,R4,R5LW R1,40(R2) LWC loads if 3rd operand not 0 CMSC 411 - Alan Sussman 40 Limitations of cond. instructions • Predicated instructions that are squashed still use processor resources – doesn’t matter if resources would have been idle anyway • Most useful when predicate can be evaluated early – want to avoid data hazards replacing control hazards • Hard to do for complex control flow – for example, moving across multiple branches • Conditional instructions may have higher cycle count or slower clock rate than unconditional ones CMSC 411 - Alan Sussman 41 Compiler speculation with hardware support • To move speculated instructions not just before branch, but before condition evaluation • Compiler can help find instructions that can be speculatively moved and not affect program data flow • Hard part is preserving exception behavior – a speculated instruction that is mispredicted should not cause an exception – 4 methods described in Section 4.5, so it can be done CMSC 411 - Alan Sussman 42 Memory reference speculation with hardware support • To move loads across stores, when compiler can’t be sure it is legal • Use a speculative load instruction – hardware saves address of memory location – if a subsequent store changes that location before the check (to end the speculation), then the speculation failed, otherwise it succeeded – on failure, need to redo load and re-execute all speculated instructions after the speculative load

Documents

questions

Software Instruction Level Parallelism - Lecture Slides | CMSC 411, Study notes of Computer Science

Related documents

Partial preview of the text