Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Parallel Architecture: Understanding Different Types of Parallel Computing Systems - Prof., Assignments of Computer Science

Portland State University (PSU)Computer Science

Prof. Jingke Li

This document, from portland state university, provides an overview of parallel computing systems, including their taxonomy, current architectures, and solutions to consistency problems. Topics covered include single instruction multiple data (simd) systems, multiple instruction multiple data (mimd) systems, cache-coherence, and non-cache coherent numa systems. The document also discusses various types of parallel architectures such as processor-level parallelism, system-level architectures, and massive parallel processing systems.

Typology: Assignments

Pre 2010

Uploaded on 08/16/2009

koofers-user-18q 🇺🇸

10 documents

1 / 18

Partial preview of the text

Download Parallel Architecture: Understanding Different Types of Parallel Computing Systems - Prof. and more Assignments Computer Science in PDF only on Docsity! Parallel Architecture Jingke Li Portland State University Jingke Li (Portland State University) CS 415/515 Parallel Architecture 1 / 36 Parallel Computers A “conventional” computer consists of • a single CPU (typically with a set of pipelined functional units) • a single memory hierarchy (i.e. caches and main memory) and operations are performed one after another in a sequential order. A parallel computer, in contrast, has multiple components. There are many ways to organize a parallel computer: • Single core with multiple functional units • Multiple cores with shared cache hierarchy • Multiple processors with shared memory module • Multiple processors with individual memory modules • Clusters of subsystems Jingke Li (Portland State University) CS 415/515 Parallel Architecture 2 / 36 Flynn’s Taxonomy S: Single, M: Multiple, I: Instruction stream, D: Data stream • SISD — Conventional uniprocessors • MISD — Not very realistic • SIMD — Same operation is simultaneously applied to multiple data items • Can take different forms, e.g. single CPU computers, vector computers, and large-scale super-computers • Recent SIMD computers are mostly small-scale • MIMD — Multiple threads of instructions operating on multiple threads of data. In the most general case, threads could be independent programs! • A very broad category; can be refined into many sub-categories • Most current large-scale parallel systems fall into this category Jingke Li (Portland State University) CS 415/515 Parallel Architecture 3 / 36 Parallel Architectures Today • Processor-Level Parallelism • Single CPU with special parallel instructions • Multi-core processors • GPUs • Vector processors • System-Level Architectures • SIMD systems • Symmetric multiprocessors (SMPs) • Non-uniform memory access machines (NUMAs) • Network of workstations (NOWs) • Supercomputer Architectures • Massive Parallel Processing Systems (MPPs) (thousands of processors) • Large-scale clusters (tens of thousands of processors) • Constellations (clusters of powerful vector processors) Jingke Li (Portland State University) CS 415/515 Parallel Architecture 4 / 36 General-Purpose GPUs (GPGPUs) A new trend in this field: • Add more flexibility to GPU’s programming model • Extend to non-graphical, but still matrix-based applications GPGPUs are best-described as co-processors. To handle general applications, CPU hosts are still needed. Two GPU programming languages have emerged: • CUDA (Compute Unified Device Architecture) — developed by nVidia. • OpenCL (Open Computing Language) — initially conceived by Apple, now managed by the non-profit consortium Khronos Group. Challenges: • Data movements between the host CPU and the GPU are slow. • Integer operations are weak. • Programming is still very hard. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 9 / 36 Vector Processors Vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined architecture which can execute special vector instructions very efficiently. They are the “traditional” supercomputers. Key Components: • A set of pipelined functional units. • Special vector registers — Data is read into the vector registers which are FIFO queues capable of holding 50-100 floating point values. cr • Special vector instructions — such as loading/storing a vector register from/to a location in memory, performing operations on elements in the vector registers. Sample Machines: Early CRAY Series, CDC Cyber 205, IBM 3090 family, FPS-164. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 10 / 36 SIMD Systems SIMD systems consist of an array of worker processors and a distinguished control processor. • On each clock cycle, the control processor issues an instruction to all processors using the control bus. • Each processor performs that instruction and (optionally) returns a result to the memory via the data bus. • The individual processors may have their own memory, or the whole system may share a single main memory. P1 Memory Control P3P2 Pn Processor Control Bus Data bus Sample Machines: ILLIAC IV, TMC CM1, CM2, IBM GF11, MasPar MP1. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 11 / 36 SIMD Machine Execution Example Jingke Li (Portland State University) CS 415/515 Parallel Architecture 12 / 36 MIMD Systems MIMD systems consist of a collection of processors: • Each processor is capable of running a distinct thread of computation. • The processors coordinate on a joint program via a shared address space or through message passing. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 13 / 36 Shared-Memory MIMD Systems The processors share a single address space. This address space can be realized either through a single physical memory accessible to all processors, or through a set of distributed memory modules attached to the processors. Respectively, the two sub-categories of systems are called symmetric multiprocessors (SMPs) and non-uniform memory access machines (NUMAs). Advantages: • No need to partition or duplicate data • Less communication overhead • Programming style close to sequential programming style Main Issues: • Scalability of the interconnection network • Memory-cache consistency Jingke Li (Portland State University) CS 415/515 Parallel Architecture 14 / 36 Memory-Cache Consistency • In modern computer architectures, memory hierarchy (main memory plus multi-level of caches) is used to overcome the memory access latency problem. • A single data item may have multiple copies reside in different levels of a memory hierarchy, these copies may not always be identical. • On an uniprocessor, the disagreement between the cache and the memory is not a problem, because the cache copy is always accessed first. • But on a share-memory multiprocessor system, multiple caches are connected to the same memory. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 19 / 36 Consistency Problems — Example 1 With Write-Through Caches: 1. Processor P1 reads x from main memory, bringing a copy to its cache 2. Processor P2 reads x from main memory, bringing a copy to its cache 3. P2 changes x ’s value, the new value will be copied to main memory 4. Processor P1 reads x ’s value again — it gets the old value from its cache! Jingke Li (Portland State University) CS 415/515 Parallel Architecture 20 / 36 Consistency Problems — Example 2 With Write-Back Caches: 3. P2 changes x ’s value, the new value stays in P2’s cache 4. P1 reads x , gets the stale value from its cache 5. Any other processor reads x , will also get the stale value from memory In addition, if multiple processors with distinct values of x to write back, the final value of x in main memory will be determined by the order of the cache lines arrival at the destination, which may not have anything to do with the order of the writes to x . Jingke Li (Portland State University) CS 415/515 Parallel Architecture 21 / 36 Solutions: Invalid or Update • Invalidate — whenever a data item is written, all other copies in the memory system are invalidated. • Update — whenever a data item is written, all other copies are updated. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 22 / 36 Cache-Coherence through Bus Snoopy • Assume multiprocessors with private caches are placed on a shared bus. • Each processor’s cache controller continuously snoops on the bus watching for relevant transaction (i.e. that involves cache lines of which it has a copy in its cache) • Once such a transaction is caught, it takes either of the following actions: invalidate the copy in the cache or update the copy in the cache Jingke Li (Portland State University) CS 415/515 Parallel Architecture 23 / 36 Directory-Based Cache Coherency Use a cache directory to record the locations and states of all cached lines • A directory entry contains locations of all remote copies of the same line and status info • Main advantage is scalability; works also on machines with physically distributed memory Directory Schemes: • Centralized — A single, centralized directory for the whole system • Flat, memory-based — Directory information co-locates with memory module that is home for that memory line; each directory entry contains pointers to all sharers of the line • Flat, cache-based — Also uses home directory, but each directory entry contains only a pointer to the first sharer; the remaining sharers are joined together in a distributed, doubly linked list • Hierarchical — Uses hierarchy of caches; each parent keeps track of exactly which of its immediate children has a copy of the data Jingke Li (Portland State University) CS 415/515 Parallel Architecture 24 / 36 MPPs with a Flat Interconnection Example: ASCI Red Supercomputer (i.e. Intel Paragon) Jingke Li (Portland State University) CS 415/515 Parallel Architecture 29 / 36 ASCI Red System Parameters Compute Nodes 4,640 Service, I/O, System, and Network Nodes 16, 74, 2, 20 System Footprint 2500 Square Feet Number of Cabinets (Computer/Switch/DISK) 104 ( 76/8/20) System RAM (Compute Nodes/I/O Nodes) 606 GB (128MB/256MB) Topology Mesh (38 X 32 X 2) Node Link Bandwidth - Bi-directional 800 MB/s Cross Section Bandwidth - Bi-directional 51.2 GB/s Total Number of PII Xeon Core Processors 9536 Compute Node Peak Performance 666 MOPs System Peak Performance 3.15 TOPs Total RAID Disk Storage 12.5 TB Total RAID I/O Bandwidth 4.0 GB/s All aspects of this system architecture are scalable: communication bandwidth, main memory, internal disk storage capacity, and I/O. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 30 / 36 MPPs with a Hierarchical Interconnection Example: ASCI Blue/White Supercomputers (i.e. IBM SP2) Each cabinet (system frame) holds sixteen nodes, communicating through a SP Switch at 110MB/second peak, full duplex. To make a 128-processor setup, use eight cabinets. IBM SP2 Node and Frame: Jingke Li (Portland State University) CS 415/515 Parallel Architecture 31 / 36 IBM SP2 Communication System Jingke Li (Portland State University) CS 415/515 Parallel Architecture 32 / 36 Large-Scale Cluster Systems Large-scale clusters offer an attractive alternative to MPPs for supercomputing: • The latest processors can easily be incorporated into the system as they become available. • They tend to be more scalable. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 33 / 36 The IBM Roadrunner System World’s fastest computer (since 07/2008). • Is considered an Opteron cluster with Cell accelerators. • Each node consists of a Cell attached to an Opteron core, and the Opterons are connected to each other. • Total of 6,948 dual-core Opterons and 12,960 Cell chips in 294 racks. • The final cluster is made up of 18 connected units, which are connected via eight additional (second-stage) Infiniband ISR2012 switches. Jingke Li (Portland State University) CS 415/515 Parallel Architecture 34 / 36

Documents

questions

Parallel Architecture: Understanding Different Types of Parallel Computing Systems - Prof., Assignments of Computer Science

Related documents

Partial preview of the text