Download Parallel Architecture: Understanding Different Types of Parallel Computing Systems - Prof. and more Assignments Computer Science in PDF only on Docsity! Parallel Architecture Jingke Li Portland State University Jingke Li (Portland State University) CS 415/515 Parallel Architecture 1 / 36 Parallel Computers A “conventional” computer consists of • a single CPU (typically with a set of pipelined functional units) • a single memory hierarchy (i.e. caches and main memory) and operations are performed one after another in a sequential order. A parallel computer, in contrast, has multiple components. There are many ways to organize a parallel computer: • Single core with multiple functional units • Multiple cores with shared cache hierarchy • Multiple processors with shared memory module • Multiple processors with individual memory modules • Clusters of subsystems Jingke Li (Portland State University) CS 415/515 Parallel Architecture 2 / 36 Flynn’s Taxonomy S: Single, M: Multiple, I: Instruction stream, D: Data stream • SISD — Conventional uniprocessors • MISD — Not very realistic • SIMD — Same operation is simultaneously applied to multiple data items • Can take different forms, e.g. single CPU computers, vector computers, and large-scale super-computers • Recent SIMD computers are mostly small-scale • MIMD — Multiple threads of instructions operating on multiple threads of data. In the most general case, threads could be independent programs! • A very broad category; can be refined into many sub-categories • Most current large-scale parallel systems fall into this category Jingke Li (Portland State University) CS 415/515 Parallel Architecture 3 / 36 Parallel Architectures Today • Processor-Level Parallelism • Single CPU with special parallel instructions • Multi-core processors • GPUs • Vector processors • System-Level Architectures • SIMD systems • Symmetric multiprocessors (SMPs) • Non-uniform memory access machines (NUMAs) • Network of workstations (NOWs) • Supercomputer Architectures • Massive Parallel Processing Systems (MPPs) (thousands of processors) • Large-scale clusters (tens of thousands of processors) • Constellations (clusters of powerful vector processors) Jingke Li (Portland State University) CS 415/515 Parallel Architecture 4 / 36 General-Purpose GPUs (GPGPUs) A new trend in this field: • Add more flexibility to GPU’s programming model • Extend to non-graphical, but still matrix-based applications GPGPUs are best-described as co-processors. To handle general applications, CPU hosts are still needed. Two GPU programming languages have emerged: • CUDA (Compute Unified Device Architecture) — developed by nVidia. • OpenCL (Open Computing Language) — initially conceived by Apple, now managed by the non-profit consortium Khronos Group. Challenges: • Data movements between the host CPU and the GPU are slow. • Integer operations are weak. • Programming is still very hard. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 9 / 36 Vector Processors Vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined architecture which can execute special vector instructions very efficiently. They are the “traditional” supercomputers. Key Components: • A set of pipelined functional units. • Special vector registers — Data is read into the vector registers which are FIFO queues capable of holding 50-100 floating point values. cr • Special vector instructions — such as loading/storing a vector register from/to a location in memory, performing operations on elements in the vector registers. Sample Machines: Early CRAY Series, CDC Cyber 205, IBM 3090 family, FPS-164. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 10 / 36 SIMD Systems SIMD systems consist of an array of worker processors and a distinguished control processor. • On each clock cycle, the control processor issues an instruction to all processors using the control bus. • Each processor performs that instruction and (optionally) returns a result to the memory via the data bus. • The individual processors may have their own memory, or the whole system may share a single main memory. P1 Memory Control P3P2 Pn Processor Control Bus Data bus Sample Machines: ILLIAC IV, TMC CM1, CM2, IBM GF11, MasPar MP1. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 11 / 36 SIMD Machine Execution Example Jingke Li (Portland State University) CS 415/515 Parallel Architecture 12 / 36 MIMD Systems MIMD systems consist of a collection of processors: • Each processor is capable of running a distinct thread of computation. • The processors coordinate on a joint program via a shared address space or through message passing. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 13 / 36 Shared-Memory MIMD Systems The processors share a single address space. This address space can be realized either through a single physical memory accessible to all processors, or through a set of distributed memory modules attached to the processors. Respectively, the two sub-categories of systems are called symmetric multiprocessors (SMPs) and non-uniform memory access machines (NUMAs). Advantages: • No need to partition or duplicate data • Less communication overhead • Programming style close to sequential programming style Main Issues: • Scalability of the interconnection network • Memory-cache consistency Jingke Li (Portland State University) CS 415/515 Parallel Architecture 14 / 36 Memory-Cache Consistency • In modern computer architectures, memory hierarchy (main memory plus multi-level of caches) is used to overcome the memory access latency problem. • A single data item may have multiple copies reside in different levels of a memory hierarchy, these copies may not always be identical. • On an uniprocessor, the disagreement between the cache and the memory is not a problem, because the cache copy is always accessed first. • But on a share-memory multiprocessor system, multiple caches are connected to the same memory. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 19 / 36 Consistency Problems — Example 1 With Write-Through Caches: 1. Processor P1 reads x from main memory, bringing a copy to its cache 2. Processor P2 reads x from main memory, bringing a copy to its cache 3. P2 changes x ’s value, the new value will be copied to main memory 4. Processor P1 reads x ’s value again — it gets the old value from its cache! Jingke Li (Portland State University) CS 415/515 Parallel Architecture 20 / 36 Consistency Problems — Example 2 With Write-Back Caches: 3. P2 changes x ’s value, the new value stays in P2’s cache 4. P1 reads x , gets the stale value from its cache 5. Any other processor reads x , will also get the stale value from memory In addition, if multiple processors with distinct values of x to write back, the final value of x in main memory will be determined by the order of the cache lines arrival at the destination, which may not have anything to do with the order of the writes to x . Jingke Li (Portland State University) CS 415/515 Parallel Architecture 21 / 36 Solutions: Invalid or Update • Invalidate — whenever a data item is written, all other copies in the memory system are invalidated. • Update — whenever a data item is written, all other copies are updated. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 22 / 36 Cache-Coherence through Bus Snoopy • Assume multiprocessors with private caches are placed on a shared bus. • Each processor’s cache controller continuously snoops on the bus watching for relevant transaction (i.e. that involves cache lines of which it has a copy in its cache) • Once such a transaction is caught, it takes either of the following actions: invalidate the copy in the cache or update the copy in the cache Jingke Li (Portland State University) CS 415/515 Parallel Architecture 23 / 36 Directory-Based Cache Coherency Use a cache directory to record the locations and states of all cached lines • A directory entry contains locations of all remote copies of the same line and status info • Main advantage is scalability; works also on machines with physically distributed memory Directory Schemes: • Centralized — A single, centralized directory for the whole system • Flat, memory-based — Directory information co-locates with memory module that is home for that memory line; each directory entry contains pointers to all sharers of the line • Flat, cache-based — Also uses home directory, but each directory entry contains only a pointer to the first sharer; the remaining sharers are joined together in a distributed, doubly linked list • Hierarchical — Uses hierarchy of caches; each parent keeps track of exactly which of its immediate children has a copy of the data Jingke Li (Portland State University) CS 415/515 Parallel Architecture 24 / 36 MPPs with a Flat Interconnection Example: ASCI Red Supercomputer (i.e. Intel Paragon) Jingke Li (Portland State University) CS 415/515 Parallel Architecture 29 / 36 ASCI Red System Parameters Compute Nodes 4,640 Service, I/O, System, and Network Nodes 16, 74, 2, 20 System Footprint 2500 Square Feet Number of Cabinets (Computer/Switch/DISK) 104 ( 76/8/20) System RAM (Compute Nodes/I/O Nodes) 606 GB (128MB/256MB) Topology Mesh (38 X 32 X 2) Node Link Bandwidth - Bi-directional 800 MB/s Cross Section Bandwidth - Bi-directional 51.2 GB/s Total Number of PII Xeon Core Processors 9536 Compute Node Peak Performance 666 MOPs System Peak Performance 3.15 TOPs Total RAID Disk Storage 12.5 TB Total RAID I/O Bandwidth 4.0 GB/s All aspects of this system architecture are scalable: communication bandwidth, main memory, internal disk storage capacity, and I/O. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 30 / 36 MPPs with a Hierarchical Interconnection Example: ASCI Blue/White Supercomputers (i.e. IBM SP2) Each cabinet (system frame) holds sixteen nodes, communicating through a SP Switch at 110MB/second peak, full duplex. To make a 128-processor setup, use eight cabinets. IBM SP2 Node and Frame: Jingke Li (Portland State University) CS 415/515 Parallel Architecture 31 / 36 IBM SP2 Communication System Jingke Li (Portland State University) CS 415/515 Parallel Architecture 32 / 36 Large-Scale Cluster Systems Large-scale clusters offer an attractive alternative to MPPs for supercomputing: • The latest processors can easily be incorporated into the system as they become available. • They tend to be more scalable. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 33 / 36 The IBM Roadrunner System World’s fastest computer (since 07/2008). • Is considered an Opteron cluster with Cell accelerators. • Each node consists of a Cell attached to an Opteron core, and the Opterons are connected to each other. • Total of 6,948 dual-core Opterons and 12,960 Cell chips in 294 racks. • The final cluster is made up of 18 connected units, which are connected via eight additional (second-stage) Infiniband ISR2012 switches. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 34 / 36