Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Parallel Processing Techniques: Multi-Core and SIMD Execution, Exercises of Logic

Parallel processing techniques using multi-core and SIMD (Single Instruction, Multiple Data) execution. the concept of using multiple cores to execute different instruction streams and the use of SIMD units to process multiple data elements simultaneously. The document also discusses the benefits and costs of these techniques and provides examples of their implementation. from Carnegie Mellon University's Computer Science course 15-418/618, taught in the Fall of 2018.

Typology: Exercises

2021/2022

Uploaded on 09/12/2022

ashnay
ashnay 🇺🇸

4.8

(9)

5 documents

1 / 78

Toggle sidebar

Related documents


Partial preview of the text

Download Parallel Processing Techniques: Multi-Core and SIMD Execution and more Exercises Logic in PDF only on Docsity! Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2018 Lecture 2: A Modern Multi-Core Processor (Forms of parallelism + understanding latency and bandwidth) CMU 15-418/618, Fall 2018 Today ▪ Today we will talk computer architecture ▪ Four key concepts about how modern computers work - Two concern parallel execution - Two concern challenges of accessing memory ▪ Understanding these architecture basics will help you - Understand and optimize the performance of your parallel programs - Gain intuition about what workloads might benefit from fast parallel machines CMU 15-418/618, Fall 2018 Compile program void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0 x[i] result[i] CMU 15-418/618, Fall 2018 Execute program x[i] Fetch/ Decode Execution Context ALU (Execute) ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0 result[i] CMU 15-418/618, Fall 2018 Execute program x[i] Fetch/ Decode Execution Context ALU (Execute) PC My very simple processor: executes one instruction per clock ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0 result[i] CMU 15-418/618, Fall 2018 Superscalar processor ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0 x[i] Fetch/ Decode 1 Execution Context Exec 1 Recall from last class: instruction level parallelism (ILP) Decode and execute two instructions per clock (if possible) Fetch/ Decode 2 Exec 2 Note: No ILP exists in this region of the program result[i] CMU 15-418/618, Fall 2018 Aside: Pentium 4 Image credit: http://ixbtlabs.com/articles/pentium4/index.html CMU 15-418/618, Fall 2018 Processor: pre multi-core era Fetch/ Decode Execution Context ALU (Execute) Data cache (a big one) Out-of-order control logic Fancy branch predictor Memory pre-fetcher Majority of chip transistors used to perform operations that help a single instruction stream run fast More transistors = larger cache, smarter out-of-order logic, smarter branch predictor, etc. (Also: more transistors → smaller transistors → higher clock frequencies) CMU 15-418/618, Fall 2018 But our program expresses no parallelism void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } This program, compiled with gcc will run as one thread on one of the processor cores. If each of the simpler processor cores was 0.75X as fast as the original single complicated one, our program now has a “speedup” of 0.75 (i.e. it is slower). CMU 15-418/618, Fall 2018 Expressing parallelism using pthreads void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } typedef struct { int N; int terms; float* x; float* result; } my_args; void parallel_sinx(int N, int terms, float* x, float* result) { pthread_t thread_id; my_args args; args.N = N/2; args.terms = terms; args.x = x; args.result = result; pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread sinx(N - args.N, terms, x + args.N, result + args.N); // do work pthread_join(thread_id, NULL); } void my_thread_start(void* thread_arg) { my_args* thread_args = (my_args*)thread_arg; sinx(args->N, args->terms, args->x, args->result); // do work } CMU 15-418/618, Fall 2018 Data-parallel expression void sinx(int N, int terms, float* x, float* result) { // declare independent loop iterations forall (int i from 0 to N-1) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } Loop iterations declared by the programmer to be independent With this information, you could imagine how a compiler might automatically generate parallel threaded code (in our fictitious data-parallel language) CMU 15-418/618, Fall 2018 Core 1 Multi-core examples Intel “Coffee Lake” Core i7 hexa-core CPU (2017) NVIDIA GTX 1080 GPU 20 replicated processing cores (“SM”) (2016) Core 5 Core 2 Core 4 Core 3 Core 6 Integrated GPU CMU 15-418/618, Fall 2018 More multi-core examples Intel Xeon Phi “Knights Landing “ 76-core CPU (2015) Apple A9 dual-core CPU (2015) A9 image credit: Chipworks (obtained via Anandtech) http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/3 Core 1 Core 2 CMU 15-418/618, Fall 2018 Data-parallel expression void sinx(int N, int terms, float* x, float* result) { // declare independent loop iterations forall (int i from 0 to N-1) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } Another interesting property of this code: Parallelism is across iterations of the loop. All the iterations of the loop do the same thing: evaluate the sine of a single input number (in our fictitious data-parallel language) CMU 15-418/618, Fall 2018 Scalar program ld r0, addr[r1] mul r1, r0, r0 mul r1, r1, r0 ... ... ... ... ... ... st addr[r2], r0 void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } Original compiled program: Processes one array element using scalar instructions on scalar registers (e.g., 32-bit floats) CMU 15-418/618, Fall 2018 Vector program (using AVX intrinsics) #include <immintrin.h> void sinx(int N, int terms, float* x, float* result) { float three_fact = 6; // 3! for (int i=0; i<N; i+=8) { __m256 origx = _mm256_load_ps(&x[i]); __m256 value = origx; __m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx)); __m256 denom = _mm256_broadcast_ss(&three_fact); int sign = -1; for (int j=1; j<=terms; j++) { // value += sign * numer / denom __m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_set1ps(sign), numer), denom); value = _mm256_add_ps(value, tmp); numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx)); denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2*j+2) * (2*j+3))); sign *= -1; } _mm256_store_ps(&result[i], value); } } Intrinsics available to C programmers CMU 15-418/618, Fall 2018 Vector program (using AVX intrinsics) #include <immintrin.h> void sinx(int N, int terms, float* x, float* sinx) { float three_fact = 6; // 3! for (int i=0; i<N; i+=8) { __m256 origx = _mm256_load_ps(&x[i]); __m256 value = origx; __m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx)); __m256 denom = _mm256_broadcast_ss(&three_fact); int sign = -1; for (int j=1; j<=terms; j++) { // value += sign * numer / denom __m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_broadcast_ss(sign),numer),denom); value = _mm256_add_ps(value, tmp); numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx)); denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2*j+2) * (2*j+3))); sign *= -1; } _mm256_store_ps(&sinx[i], value); } } vloadps xmm0, addr[r1] vmulps xmm1, xmm0, xmm0 vmulps xmm1, xmm1, xmm0 ... ... ... ... ... ... vstoreps addr[xmm2], xmm0 Compiled program: Processes eight array elements simultaneously using vector instructions on 256-bit vector registers CMU 15-418/618, Fall 2018 What about conditional execution? ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8 if (x > 0) { } else { } <unconditional code> <resume unconditional code> float tmp = exp(x,5.f); tmp *= kMyConst1; x = tmp + kMyConst2; float tmp = kMyConst1; x = 2.f * tmp; float x = A[i]; result[i] = x; (assume logic below is to be executed for each element in input array ‘A’, producing output into the array ‘result’) CMU 15-418/618, Fall 2018 What about conditional execution? ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8 if (x > 0) { } else { } <unconditional code> <resume unconditional code> float tmp = exp(x,5.f); tmp *= kMyConst1; x = tmp + kMyConst2; float tmp = kMyConst1; x = 2.f * tmp; T T T F FF F F float x = A[i]; result[i] = x; (assume logic below is to be executed for each element in input array ‘A’, producing output into the array ‘result’) CMU 15-418/618, Fall 2018 Mask (discard) output of ALU ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8 if (x > 0) { } else { } <unconditional code> <resume unconditional code> T T T F FF F F Not all ALUs do useful work! Worst case: 1/8 peak performance float tmp = exp(x,5.f); tmp *= kMyConst1; x = tmp + kMyConst2; float tmp = kMyConst1; x = 2.f * tmp; float x = A[i]; result[i] = x; (assume logic below is to be executed for each element in input array ‘A’, producing output into the array ‘result’) CMU 15-418/618, Fall 2018 SIMD execution on modern CPUs ▪ SSE instructions: 128-bit operations: 4x32 bits or 2x64 bits (4-wide float vectors) ▪ AVX instructions: 256 bit operations: 8x32 bits or 4x64 bits (8-wide float vectors) ▪ Instructions are generated by the compiler - Parallelism explicitly requested by programmer using intrinsics - Parallelism conveyed using parallel language semantics (e.g., forall example) - Parallelism inferred by dependency analysis of loops (hard problem, even best compilers are not great on arbitrary C/C++ code) ▪ Terminology: “explicit SIMD”: SIMD parallelization is performed at compile time - Can inspect program binary and see instructions (vstoreps, vmulps, etc.) CMU 15-418/618, Fall 2018 SIMD execution on many modern GPUs ▪ “Implicit SIMD” - Compiler generates a scalar binary (scalar instructions) - But N instances of the program are *always run* together on the processor execute(my_function, N) // execute my_function N times - In other words, the interface to the hardware itself is data-parallel - Hardware (not compiler) is responsible for simultaneously executing the same instruction from multiple instances on different data on SIMD ALUs ▪ SIMD width of most modern GPUs ranges from 8 to 32 - Divergence can be a big issue (poorly written code might execute at 1/32 the peak capability of the machine!) CMU 15-418/618, Fall 2018 Example: Intel Core i7 4 cores 8 SIMD ALUs per core (AVX instructions) On campus: GHC machines: 4 cores 8 SIMD ALUs per core Machines in GHC 5207: (old GHC 3000 machines) 6 cores 4 SIMD ALUs per core CPUs in “latedays" cluster: 6 cores 8 SIMD ALUs per code CMU 15-418/618, Spring 2016 Fetch/ Decode ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 Execution Context CMU 15-418/618, Spring 2016 Fetch/ Decode ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 Execution Context CMU 15-418/618, Spring 2016 Fetch/ Decode ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 Execution Context CMU 15-418/618, Spring 2016 Fetch/ Decode ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 Execution Context CMU 15-418/618, Fall 2018 Part 2: accessing memory CMU 15-418/618, Fall 2018 Terminology ▪ Memory latency - The amount of time for a memory request (e.g., load, store) from a processor to be serviced by the memory system - Example: 100 cycles, 100 nsec ▪ Memory bandwidth - The rate at which the memory system can provide data to a processor - Example: 20 GB/s CMU 15-418/618, Fall 2018 Stalls ▪ A processor “stalls” when it cannot run the next instruction in an instruction stream because of a dependency on a previous instruction. ▪ Accessing memory is a major source of stalls ld r0 mem[r2] ld r1 mem[r3] add r0, r0, r1 ▪ Memory access times ~ 100’s of cycles - Memory “access time” is a measure of latency Dependency: cannot execute ‘add’ instruction until data at mem[r2] and mem[r3] have been loaded from memory CMU 15-418/618, Fall 2018 Prefetching reduces stalls (hides latency) ▪ All modern CPUs have logic for prefetching data into caches - Dynamically analyze program’s access patterns, predict what it will access soon ▪ Reduces stalls since data is resident in cache when accessed predict value of r2, initiate load predict value of r3, initiate load ... ... ... ... ... ... ld r0 mem[r2] ld r1 mem[r3] add r0, r0, r1 data arrives in cache data arrives in cache Note: Prefetching can also reduce performance if the guess is wrong (hogs bandwidth, pollutes caches) (more detail later in course) These loads are cache hits CMU 15-418/618, Fall 2018 Multi-threading reduces stalls ▪ Idea: interleave processing of multiple threads on the same core to hide stalls ▪ Like prefetching, multi-threading is a latency hiding, not a latency reducing technique CMU 15-418/618, Fall 2018 Hiding stalls with multi-threading Time Thread 1 Elements 0 … 7 Fetch/ Decode ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 1 Core (1 thread) Exec Ctx CMU 15-418/618, Fall 2018 Hiding stalls with multi-threading Time 1 2 3 4 Stall Runnable Stall Runnable Done! Stall Runnable Stall Runnable Done! Fetch/ Decode ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 1 2 3 4 1 Core (4 hardware threads) Thread 2 Elements 8 … 15 Thread 3 Elements 16 … 23 Thread 4 Elements 24 … 31 Thread 1 Elements 0 … 7 CMU 15-418/618, Fall 2018 Throughput computing trade-off Time Stall Runnable Done! Key idea of throughput-oriented systems: Potentially increase time to complete work by any one any one thread, in order to increase overall system throughput when running multiple threads. During this time, this thread is runnable, but it is not being executed by the processor. (The core is running some other thread.) Thread 2 Elements 8 … 15 Thread 3 Elements 16 … 23 Thread 4 Elements 24 … 31 Thread 1 Elements 0 … 7 CMU 15-418/618, Fall 2018 Storing execution contexts Fetch/ Decode ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 Context storage (or L1 cache) Consider on ship storage of execution contexts a finite resource. CMU 15-418/618, Fall 2018 Hardware-supported multi-threading ▪ Core manages execution contexts for multiple threads - Runs instructions from runnable threads (processor makes decision about which thread to run each clock, not the operating system) - Core still has the same number of ALU resources: multi-threading only helps use them more efficiently in the face of high-latency operations like memory access ▪ Interleaved multi-threading (a.k.a. temporal multi-threading) - What I described on the previous slides: each clock, the core chooses a thread, and runs an instruction from the thread on the ALUs ▪ Simultaneous multi-threading (SMT) - Each clock, core chooses instructions from multiple threads to run on ALUs - Extension of superscalar CPU design - Example: Intel Hyper-threading (2 threads per core) CMU 15-418/618, Fall 2018 Multi-threading summary ▪ Benefit: use a core’s ALU resources more efficiently - Hide memory latency - Fill multiple functional units of superscalar architecture (when one thread has insufficient ILP) ▪ Costs - Requires additional storage for thread contexts - Increases run time of any single thread (often not a problem, we usually care about throughput in parallel apps) - Requires additional independent work in a program (more independent work than ALUs!) - Relies heavily on memory bandwidth - More threads → larger working set → less cache space per thread - May go to memory more often, but can hide the latency CMU 15-418/618, Fall 2018 Our fictitious multi-core chip 16 cores 8 SIMD ALUs per core (128 total) 4 threads per core 16 simultaneous instruction streams 64 total concurrent instruction streams 512 independent pieces of work are needed to run chip with maximal latency hiding ability CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016 CMU 15-418/618, Fall 2018 = SIMD function unit, control shared across 16 units (1 MUL-ADD per clock) “Shared” memory (16+48 KB) Execution contexts (128 KB) Fetch/ Decode • This process occurs on another set of 16 ALUs as well • So there are 32 ALUs per core • 15 cores × 32 = 480 ALUs per chip Source: Fermi Compute Architecture Whitepaper CUDA Programming Guide 3.1, Appendix G NVIDIA GTX 480 core NVIDIA GTX 480: more detail (just for the curious) Fetch/ Decode CMU 15-418/618, Fall 2018 NVIDIA GTX 480 Recall, there are 15 cores on the GTX 480: That’s 23,000 pieces of data being processed concurrently! CMU 15-418/618, Fall 2018 CPU vs. GPU memory hierarchies 25 GB/sec L3 cache (8 MB) L1 cache (32 KB) L2 cache (256 KB) . . . Memory DDR3 DRAM (Gigabytes) Core 1 Core N L1 cache (32 KB) L2 cache (256 KB) CPU: Big caches, few threads, modest memory BW Rely mainly on caches and prefetching GPU: Small caches, many threads, huge memory BW Rely mainly on multi-threading Execution contexts (128 KB) GFX texture cache (12 KB) Scratchpad L1 cache (64 KB) . . . Execution contexts (128 KB) GFX texture cache (12 KB) Scratchpad L1 cache (64 KB) . . . Core 1 Core N L2 cache (768 KB) 177 GB/sec Memory DDR5 DRAM (~1 GB) CMU 15-418/618, Fall 2018 Bandwidth is a critical resource Performant parallel programs will: ▪ Organize computation to fetch data from memory less often - Reuse data previously loaded by the same thread (traditional intra-thread temporal locality optimizations) - Share data across threads (inter-thread cooperation) ▪ Request data less often (instead, do more arithmetic: it’s “free”) - Useful term: “arithmetic intensity” — ratio of math operations to data access operations in an instruction stream - Main point: programs must have high arithmetic intensity to utilize modern processors efficiently CMU 15-418/618, Fall 2018 Summary ▪ Three major ideas that all modern processors employ to varying degrees - Employ multiple processing cores - Simpler cores (embrace thread-level parallelism over instruction-level parallelism) - Amortize instruction stream processing over many ALUs (SIMD) - Increase compute capability with little extra cost - Use multi-threading to make more efficient use of processing resources (hide latencies, fill all available resources) ▪ Due to high arithmetic capability on modern chips, many parallel applications (on both CPUs and GPUs) are bandwidth bound ▪ GPU architectures use the same throughput computing ideas as CPUs: but GPUs push these concepts to extreme scales CMU 15-418/618, Fall 2018 For the rest of this class, know these terms ▪ Multi-core processor ▪ SIMD execution ▪ Coherent control flow ▪ Hardware multi-threading - Interleaved multi-threading - Simultaneous multi-threading ▪ Memory latency ▪ Memory bandwidth ▪ Bandwidth bound application ▪ Arithmetic intensity CMU 15-418/618, Fall 2018 Review: superscalar execution void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom; numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } Unmodified program Execution Context My single core, superscalar processor: executes up to two instructions per clock from a single instruction stream. Fetch/ Decode Exec 1 Fetch/ Decode Exec 2 Independent operations in instruction stream (They are detected by the processor at run-time and may be executed in parallel on execution units 1 and 2) CMU 15-418/618, Fall 2018 Review: multi-core execution (two cores) Modify program to create two threads of control (two instruction streams) My dual-core processor: executes one instruction per clock from an instruction stream on each core. Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) typedef struct { int N; int terms; float* x; float* result; } my_args; void parallel_sinx(int N, int terms, float* x, float* result) { pthread_t thread_id; my_args args; args.N = N/2; args.terms = terms; args.x = x; args.result = result; pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread sinx(N - args.N, terms, x + args.N, result + args.N); // do work pthread_join(thread_id, NULL); } void my_thread_start(void* thread_arg) { my_args* thread_args = (my_args*)thread_arg; sinx(args->N, args->terms, args->x, args->result); // do work } CMU 15-418/618, Fall 2018 Review: multi-core + superscalar execution Modify program to create two threads of control (two instruction streams) My superscalar dual-core processor: executes up to two instructions per clock from an instruction stream on each core. Execution Context typedef struct { int N; int terms; float* x; float* result; } my_args; void parallel_sinx(int N, int terms, float* x, float* result) { pthread_t thread_id; my_args args; args.N = N/2; args.terms = terms; args.x = x; args.result = result; pthread_create(&thread_id, NULL, my_thread_start, &args); // launch thread sinx(N - args.N, terms, x + args.N, result + args.N); // do work pthread_join(thread_id, NULL); } void my_thread_start(void* thread_arg) { my_args* thread_args = (my_args*)thread_arg; sinx(args->N, args->terms, args->x, args->result); // do work } Fetch/ Decode Exec 1 Fetch/ Decode Exec 2 Execution Context Fetch/ Decode Exec 1 Fetch/ Decode Exec 2 CMU 15-418/618, Fall 2018 Review: four SIMD, multi-threaded cores Observation: memory operations have very long latency Solution: hide latency of loading data for one iteration by executing arithmetic instructions from other iterations void sinx(int N, int terms, float* x, float* result) { // declare independent loop iterations forall (int i from 0 to N-1) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } } Fetch/ DecodeMemory load Memory store Execution Context Execution Context Fetch/ Decode Execution Context Execution Context Fetch/ Decode Execution Context Execution Context Fetch/ Decode Execution Context Execution Context My multi-threaded, SIMD quad-core processor: executes one SIMD instruction per clock from one instruction stream on each core. But can switch to processing the other instruction stream when faced with a stall. CMU 15-418/618, Fall 2018 Summary: four superscalar, SIMD, multi-threaded cores Execution Context Execution Context Fetch/ Decode Fetch/ Decode SIMD Exec 2 Exec 1 Execution Context Execution Context Fetch/ Decode Fetch/ Decode SIMD Exec 2 Exec 1 Execution Context Execution Context Fetch/ Decode Fetch/ Decode SIMD Exec 2 Exec 1 Execution Context Execution Context Fetch/ Decode Fetch/ Decode SIMD Exec 2 Exec 1 My multi-threaded, superscalar, SIMD quad-core processor: executes up to two instructions per clock from one instruction stream on each core (in this example: one SIMD instruction + one scalar instruction). Processor can switch to execute the other instruction stream when faced with stall. CMU 15-418/618, Fall 2018 Connecting it all together Our simple quad-core processor: Execution Context Execution Context Fetch/ Decode Fetch/ Decode SIMD Exec 2 Exec 1 L1 Cache L2 Cache Execution Context Execution Context Fetch/ Decode Fetch/ Decode SIMD Exec 2 Exec 1 L1 Cache L2 Cache Execution Context Execution Context Fetch/ Decode Fetch/ Decode SIMD Exec 2 Exec 1 L1 Cache L2 Cache Execution Context Execution Context Fetch/ Decode Fetch/ Decode SIMD Exec 2 Exec 1 L1 Cache L2 Cache L3 Cache Memory Controller Memory Bus (to DRAM) On-chip interconnect Four cores, two-way multi-threading per core (max eight threads active on chip at once), up to two instructions per clock per core (one of those instructions is 8-wide SIMD)
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved