Download Understanding Computer Performance Metrics in CIS 371 and more Slides Design in PDF only on Docsity! CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 1 CIS 371 Computer Organization and Design Unit 6: Performance Metrics Based on slides by Profs. Amir Roth, Milo Martin, C.J. Taylor, Benedict Brown CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 2 This Unit • Metrics • Latency and throughput • Speedup • Averaging • CPU Performance • Performance Pitfalls • Benchmarking CPUMem I/O System software AppApp App CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 5 Performance: Latency vs. Throughput • Latency (execution time): time to finish a fixed task • Throughput (bandwidth): number of tasks in fixed time • Different: exploit parallelism for throughput, not latency (e.g., bread) • Often contradictory (latency vs. throughput) • Will see many examples of this • Choose definition of performance that matches your goals • Scientific program? latency. web server? throughput. • Example: move people 10 miles • Car: capacity = 5, speed = 60 miles/hour • Bus: capacity = 60, speed = 20 miles/hour • Latency: car = 10 min, bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH • Fastest way to send 10TB of data? (1+ gbits/second) Amazon Does This… CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 6 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 7 Comparing Performance - Speedup • A is X times faster than B if • X = Latency(B) / Latency(A) (divide by the faster) • X = Throughput(A) / Throughput(B) (divide by the slower) • A is X% faster than B if • Latency(A) = Latency(B) / (1+X/100) • Throughput(A) = Throughput(B) * (1+X/100) • Car/bus example • Latency? Car is 3 times (and 200%) faster than bus • Throughput? Bus is 4 times (and 300%) faster than car Harmonic Mean Example • You drive two miles • 30 miles per hour for the first mile • 90 miles per hour for the second mile • Question: what was your average speed? • Hint: the answer is not 60 miles per hour • 0.03333 hours per mile for 1 mile • 0.01111 hours per mile for 1 mile • 0.02222 hours per mile on average • = 45 miles per hour CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 10 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 11 Mean (Average) Performance Numbers • Arithmetic: (1/N) * ∑P=1..N Latency(P) • For units that are proportional to time (e.g., latency) • Harmonic: N / ∑P=1..N 1/Throughput(P) • For units that are inversely proportional to time (e.g., throughput) • You can add latencies, but not throughputs • Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A) • Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A) • 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour • Average is not 60 miles/hour • Geometric: N√∏P=1..N Speedup(P) • For unitless quantities (e.g., speedup ratios) CPU Performance CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 12 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 15 CPI Example • Assume a processor with instruction frequencies and costs • Integer ALU: 50%, 1 cycle • Load: 20%, 5 cycle • Store: 10%, 1 cycle • Branch: 20%, 2 cycle • Which change would improve performance more? • A. Pipeline change to reduce branch cost to 1 cycle? • B. Faster data memory to reduce load cost to 3 cycles? • Compute CPI • Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 CPI • A = 0.5*1 + 0.2*5 + 0.1*1+ 0.2*1 = 1.8 CPI (1.11x or 11% faster) • B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 CPI (1.25x or 25% faster) • B is faster CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 16 Measuring CPI • How are CPI and execution-time actually measured? • Execution time? stopwatch timer (Unix “time” command) • CPI = (CPU time * clock frequency) / dynamic insn count • How is dynamic instruction count measured? • More useful is CPI breakdown (CPICPU, CPIMEM, etc.) • So we know what performance problems are and what to fix • Hardware event counters • Available in most processors today • One way to measure dynamic instruction count • Calculate CPI using counter frequencies / known event costs • Cycle-level micro-architecture simulation + Measure exactly what you want … and impact of potential fixes! • Method of choice for many micro-architects Pitfalls of Partial Performance Metrics CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 17 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 20 Performance Rules of Thumb • Design for actual performance, not peak performance • Peak performance: “Performance you are guaranteed not to exceed” • Greater than “actual” or “average” or “sustained” performance • Why? Caches misses, branch mispredictions, etc. • For actual performance X, machine capability must be > X • Easier to “buy” bandwidth than latency • Which is easier: to transport more cargo via train: • (1) build another track or (2) make a train that goes twice as fast? • Use bandwidth to reduce latency • Build a balanced system • Don’t over-optimize 1% to the detriment of other 99% • System performance often determined by slowest component CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 21 Performance Rules of Thumb • Amdahl’s Law • Literally: total speedup limited by non-accelerated piece • Speedup(n, p, s) = (s+p) / (s + (p/n)) • p is “parallel fraction”, s is “serial fraction” • Example: can optimize 50% of program A • Even “magic” optimization that makes this 50% disappear… • …only yields a 2X speedup Benchmarking CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 22 Another Example: GeekBench • Set of cross-platform multicore benchmarks • Can run on iPhone, Android, laptop, desktop, etc • Tests integer, floating point, memory, memory bandwidth performance • GeekBench stores all results online • Easy to check scores for many different systems, processors • Pitfall: Workloads are simple, may not be a completely accurate representation of performance • We know they evaluate compared to a baseline benchmark CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 25 GeekBench Numbers • Desktop • Intel “Ivy bridge” at 3.4 GHz (4 cores) – 11,456 • Laptop: • Intel Core i7-3520M at 2.9 GHz (2 cores) – 7,807 • Phones: • iPhone 5 - Apple A6 at 1 GHz (2 cores) – 1,589 • iPhone 4S - Apple A5 at 0.8 GHz (2 cores) – 642 • Samsung Galaxy S III (North America) – Qualcomm Snapdragon S3 – 1.500 GHz (2 cores) – 1,429 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 26 CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 27 Other Benchmarks • Parallel benchmarks • SPLASH2: Stanford Parallel Applications for Shared Memory • NAS: another parallel benchmark suite • SPECopenMP: parallelized versions of SPECfp 2000) • SPECjbb: Java multithreaded database-like workload • Transaction Processing Council (TPC) • TPC-C: On-line transaction processing (OLTP) • TPC-H/R: Decision support systems (DSS) • TPC-W: E-commerce database backend workload • Have parallelism (intra-query and inter-query) • Heavy I/O and memory components Measuring Frequency • Use Vivado’s post-implementation timing summary CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 30 Summary • Latency = seconds / program = • (instructions / program) * (cycles / instruction) * (seconds / cycle) • Instructions / program: dynamic instruction count • Function of program, compiler, instruction set architecture (ISA) • Cycles / instruction: CPI • Function of program, compiler, ISA, micro-architecture • Seconds / cycle: clock period • Function of micro-architecture, technology parameters • Optimize each component • This course focuses mostly on CPI (caches, parallelism) • …but some on dynamic instruction count (compiler, ISA) • …and some on clock frequency (pipelining, technology) CIS 371: Comp. Org. | Dr. Joe Devietti | Performance 31