Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Computer Performance Metrics in CIS 371, Slides of Design

This document from CIS 371 covers various performance metrics, including latency and throughput, and provides examples of how to calculate speedup and CPI. It also discusses the importance of considering dynamic instruction count and the pitfalls of partial performance metrics.

Typology: Slides

2021/2022

Uploaded on 08/05/2022

nguyen_99
nguyen_99 🇻🇳

4.2

(82)

1K documents

1 / 31

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding Computer Performance Metrics in CIS 371 and more Slides Design in PDF only on Docsity! CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 1 CIS 371 Computer Organization and Design Unit 6: Performance Metrics Based on slides by Profs. Amir Roth, Milo Martin, C.J. Taylor, Benedict Brown CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 2 This Unit • Metrics • Latency and throughput • Speedup • Averaging • CPU Performance • Performance Pitfalls • Benchmarking CPUMem I/O System software AppApp App CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 5 Performance: Latency vs. Throughput • Latency (execution time): time to finish a fixed task • Throughput (bandwidth): number of tasks in fixed time • Different: exploit parallelism for throughput, not latency (e.g., bread) • Often contradictory (latency vs. throughput) • Will see many examples of this • Choose definition of performance that matches your goals • Scientific program? latency. web server? throughput. • Example: move people 10 miles • Car: capacity = 5, speed = 60 miles/hour • Bus: capacity = 60, speed = 20 miles/hour • Latency: car = 10 min, bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH • Fastest way to send 10TB of data? (1+ gbits/second) Amazon Does This… CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 6 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 7 Comparing Performance - Speedup • A is X times faster than B if • X = Latency(B) / Latency(A) (divide by the faster) • X = Throughput(A) / Throughput(B) (divide by the slower) • A is X% faster than B if • Latency(A) = Latency(B) / (1+X/100) • Throughput(A) = Throughput(B) * (1+X/100) • Car/bus example • Latency? Car is 3 times (and 200%) faster than bus • Throughput? Bus is 4 times (and 300%) faster than car Harmonic Mean Example • You drive two miles • 30 miles per hour for the first mile • 90 miles per hour for the second mile • Question: what was your average speed? • Hint: the answer is not 60 miles per hour • 0.03333 hours per mile for 1 mile • 0.01111 hours per mile for 1 mile • 0.02222 hours per mile on average • = 45 miles per hour CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 10 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 11 Mean (Average) Performance Numbers • Arithmetic: (1/N) * ∑P=1..N Latency(P) • For units that are proportional to time (e.g., latency) • Harmonic: N / ∑P=1..N 1/Throughput(P) • For units that are inversely proportional to time (e.g., throughput) • You can add latencies, but not throughputs • Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A) • Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A) • 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour • Average is not 60 miles/hour • Geometric: N√∏P=1..N Speedup(P) • For unitless quantities (e.g., speedup ratios) CPU Performance CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 12 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 15 CPI Example • Assume a processor with instruction frequencies and costs • Integer ALU: 50%, 1 cycle • Load: 20%, 5 cycle • Store: 10%, 1 cycle • Branch: 20%, 2 cycle • Which change would improve performance more? • A. Pipeline change to reduce branch cost to 1 cycle? • B. Faster data memory to reduce load cost to 3 cycles? • Compute CPI • Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 CPI • A = 0.5*1 + 0.2*5 + 0.1*1+ 0.2*1 = 1.8 CPI (1.11x or 11% faster) • B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 CPI (1.25x or 25% faster) • B is faster CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 16 Measuring CPI • How are CPI and execution-time actually measured? • Execution time? stopwatch timer (Unix “time” command) • CPI = (CPU time * clock frequency) / dynamic insn count • How is dynamic instruction count measured? • More useful is CPI breakdown (CPICPU, CPIMEM, etc.) • So we know what performance problems are and what to fix • Hardware event counters • Available in most processors today • One way to measure dynamic instruction count • Calculate CPI using counter frequencies / known event costs • Cycle-level micro-architecture simulation + Measure exactly what you want … and impact of potential fixes! • Method of choice for many micro-architects Pitfalls of Partial Performance Metrics CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 17 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 20 Performance Rules of Thumb • Design for actual performance, not peak performance • Peak performance: “Performance you are guaranteed not to exceed” • Greater than “actual” or “average” or “sustained” performance • Why? Caches misses, branch mispredictions, etc. • For actual performance X, machine capability must be > X • Easier to “buy” bandwidth than latency • Which is easier: to transport more cargo via train: • (1) build another track or (2) make a train that goes twice as fast? • Use bandwidth to reduce latency • Build a balanced system • Don’t over-optimize 1% to the detriment of other 99% • System performance often determined by slowest component CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 21 Performance Rules of Thumb • Amdahl’s Law • Literally: total speedup limited by non-accelerated piece • Speedup(n, p, s) = (s+p) / (s + (p/n)) • p is “parallel fraction”, s is “serial fraction” • Example: can optimize 50% of program A • Even “magic” optimization that makes this 50% disappear… • …only yields a 2X speedup Benchmarking CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 22 Another Example: GeekBench • Set of cross-platform multicore benchmarks • Can run on iPhone, Android, laptop, desktop, etc • Tests integer, floating point, memory, memory bandwidth performance • GeekBench stores all results online • Easy to check scores for many different systems, processors • Pitfall: Workloads are simple, may not be a completely accurate representation of performance • We know they evaluate compared to a baseline benchmark CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 25 GeekBench Numbers • Desktop • Intel “Ivy bridge” at 3.4 GHz (4 cores) – 11,456 • Laptop: • Intel Core i7-3520M at 2.9 GHz (2 cores) – 7,807 • Phones: • iPhone 5 - Apple A6 at 1 GHz (2 cores) – 1,589 • iPhone 4S - Apple A5 at 0.8 GHz (2 cores) – 642 • Samsung Galaxy S III (North America) – Qualcomm Snapdragon S3 – 1.500 GHz (2 cores) – 1,429 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 26 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 27 Other Benchmarks • Parallel benchmarks • SPLASH2: Stanford Parallel Applications for Shared Memory • NAS: another parallel benchmark suite • SPECopenMP: parallelized versions of SPECfp 2000) • SPECjbb: Java multithreaded database-like workload • Transaction Processing Council (TPC) • TPC-C: On-line transaction processing (OLTP) • TPC-H/R: Decision support systems (DSS) • TPC-W: E-commerce database backend workload • Have parallelism (intra-query and inter-query) • Heavy I/O and memory components Measuring Frequency • Use Vivado’s post-implementation timing summary CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 30 Summary • Latency = seconds / program = • (instructions / program) * (cycles / instruction) * (seconds / cycle) • Instructions / program: dynamic instruction count • Function of program, compiler, instruction set architecture (ISA) • Cycles / instruction: CPI • Function of program, compiler, ISA, micro-architecture • Seconds / cycle: clock period • Function of micro-architecture, technology parameters • Optimize each component • This course focuses mostly on CPI (caches, parallelism) • …but some on dynamic instruction count (compiler, ISA) • …and some on clock frequency (pipelining, technology) CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 31
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved