Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Understanding Computer Performance Metrics in CIS 371, Slides of Design

Design

This document from CIS 371 covers various performance metrics, including latency and throughput, and provides examples of how to calculate speedup and CPI. It also discusses the importance of considering dynamic instruction count and the pitfalls of partial performance metrics.

Typology: Slides

2021/2022

Uploaded on 08/05/2022

nguyen_99 🇻🇳

4.2

(82)

1K documents

1 / 31

Partial preview of the text

Download Understanding Computer Performance Metrics in CIS 371 and more Slides Design in PDF only on Docsity! CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 1 CIS 371 Computer Organization and Design Unit 6: Performance Metrics Based on slides by Profs. Amir Roth, Milo Martin, C.J. Taylor, Benedict Brown CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 2 This Unit • Metrics • Latency and throughput • Speedup • Averaging • CPU Performance • Performance Pitfalls • Benchmarking CPUMem I/O System software AppApp App CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 5 Performance: Latency vs. Throughput • Latency (execution time): time to finish a fixed task • Throughput (bandwidth): number of tasks in fixed time • Different: exploit parallelism for throughput, not latency (e.g., bread) • Often contradictory (latency vs. throughput) • Will see many examples of this • Choose definition of performance that matches your goals • Scientific program? latency. web server? throughput. • Example: move people 10 miles • Car: capacity = 5, speed = 60 miles/hour • Bus: capacity = 60, speed = 20 miles/hour • Latency: car = 10 min, bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH • Fastest way to send 10TB of data? (1+ gbits/second) Amazon Does This… CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 6 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 7 Comparing Performance - Speedup • A is X times faster than B if • X = Latency(B) / Latency(A) (divide by the faster) • X = Throughput(A) / Throughput(B) (divide by the slower) • A is X% faster than B if • Latency(A) = Latency(B) / (1+X/100) • Throughput(A) = Throughput(B) * (1+X/100) • Car/bus example • Latency? Car is 3 times (and 200%) faster than bus • Throughput? Bus is 4 times (and 300%) faster than car Harmonic Mean Example • You drive two miles • 30 miles per hour for the first mile • 90 miles per hour for the second mile • Question: what was your average speed? • Hint: the answer is not 60 miles per hour • 0.03333 hours per mile for 1 mile • 0.01111 hours per mile for 1 mile • 0.02222 hours per mile on average • = 45 miles per hour CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 10 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 11 Mean (Average) Performance Numbers • Arithmetic: (1/N) * ∑P=1..N Latency(P) • For units that are proportional to time (e.g., latency) • Harmonic: N / ∑P=1..N 1/Throughput(P) • For units that are inversely proportional to time (e.g., throughput) • You can add latencies, but not throughputs • Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A) • Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A) • 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour • Average is not 60 miles/hour • Geometric: N√∏P=1..N Speedup(P) • For unitless quantities (e.g., speedup ratios) CPU Performance CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 12 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 15 CPI Example • Assume a processor with instruction frequencies and costs • Integer ALU: 50%, 1 cycle • Load: 20%, 5 cycle • Store: 10%, 1 cycle • Branch: 20%, 2 cycle • Which change would improve performance more? • A. Pipeline change to reduce branch cost to 1 cycle? • B. Faster data memory to reduce load cost to 3 cycles? • Compute CPI • Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 CPI • A = 0.5*1 + 0.2*5 + 0.1*1+ 0.2*1 = 1.8 CPI (1.11x or 11% faster) • B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 CPI (1.25x or 25% faster) • B is faster CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 16 Measuring CPI • How are CPI and execution-time actually measured? • Execution time? stopwatch timer (Unix “time” command) • CPI = (CPU time * clock frequency) / dynamic insn count • How is dynamic instruction count measured? • More useful is CPI breakdown (CPICPU, CPIMEM, etc.) • So we know what performance problems are and what to fix • Hardware event counters • Available in most processors today • One way to measure dynamic instruction count • Calculate CPI using counter frequencies / known event costs • Cycle-level micro-architecture simulation + Measure exactly what you want … and impact of potential fixes! • Method of choice for many micro-architects Pitfalls of Partial Performance Metrics CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 17 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 20 Performance Rules of Thumb • Design for actual performance, not peak performance • Peak performance: “Performance you are guaranteed not to exceed” • Greater than “actual” or “average” or “sustained” performance • Why? Caches misses, branch mispredictions, etc. • For actual performance X, machine capability must be > X • Easier to “buy” bandwidth than latency • Which is easier: to transport more cargo via train: • (1) build another track or (2) make a train that goes twice as fast? • Use bandwidth to reduce latency • Build a balanced system • Don’t over-optimize 1% to the detriment of other 99% • System performance often determined by slowest component CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 21 Performance Rules of Thumb • Amdahl’s Law • Literally: total speedup limited by non-accelerated piece • Speedup(n, p, s) = (s+p) / (s + (p/n)) • p is “parallel fraction”, s is “serial fraction” • Example: can optimize 50% of program A • Even “magic” optimization that makes this 50% disappear… • …only yields a 2X speedup Benchmarking CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 22 Another Example: GeekBench • Set of cross-platform multicore benchmarks • Can run on iPhone, Android, laptop, desktop, etc • Tests integer, floating point, memory, memory bandwidth performance • GeekBench stores all results online • Easy to check scores for many different systems, processors • Pitfall: Workloads are simple, may not be a completely accurate representation of performance • We know they evaluate compared to a baseline benchmark CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 25 GeekBench Numbers • Desktop • Intel “Ivy bridge” at 3.4 GHz (4 cores) – 11,456 • Laptop: • Intel Core i7-3520M at 2.9 GHz (2 cores) – 7,807 • Phones: • iPhone 5 - Apple A6 at 1 GHz (2 cores) – 1,589 • iPhone 4S - Apple A5 at 0.8 GHz (2 cores) – 642 • Samsung Galaxy S III (North America) – Qualcomm Snapdragon S3 – 1.500 GHz (2 cores) – 1,429 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 26 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 27 Other Benchmarks • Parallel benchmarks • SPLASH2: Stanford Parallel Applications for Shared Memory • NAS: another parallel benchmark suite • SPECopenMP: parallelized versions of SPECfp 2000) • SPECjbb: Java multithreaded database-like workload • Transaction Processing Council (TPC) • TPC-C: On-line transaction processing (OLTP) • TPC-H/R: Decision support systems (DSS) • TPC-W: E-commerce database backend workload • Have parallelism (intra-query and inter-query) • Heavy I/O and memory components Measuring Frequency • Use Vivado’s post-implementation timing summary CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 30 Summary • Latency = seconds / program = • (instructions / program) * (cycles / instruction) * (seconds / cycle) • Instructions / program: dynamic instruction count • Function of program, compiler, instruction set architecture (ISA) • Cycles / instruction: CPI • Function of program, compiler, ISA, micro-architecture • Seconds / cycle: clock period • Function of micro-architecture, technology parameters • Optimize each component • This course focuses mostly on CPI (caches, parallelism) • …but some on dynamic instruction count (compiler, ISA) • …and some on clock frequency (pipelining, technology) CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 31

Documents

questions

Understanding Computer Performance Metrics in CIS 371, Slides of Design

Related documents

Partial preview of the text