Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Parallel Programming in Computer Science: Making Use of Multiple Processing Units - Prof. , Study notes of Computer Architecture and Organization

This document from rensselaer polytechnic institute's computer science 2500 course explores parallel programming, specifically how to make use of multiple processing units in a computer system. The concept of concurrent programming, the need for communication and synchronization between processes, and the process of achieving parallelism. It also provides examples of shared memory parallelism using threads and introduces the concept of critical sections to prevent race conditions.

Typology: Study notes

Pre 2010

Uploaded on 08/09/2009

koofers-user-jy0
koofers-user-jy0 🇺🇸

10 documents

1 / 9

Toggle sidebar

Related documents


Partial preview of the text

Download Parallel Programming in Computer Science: Making Use of Multiple Processing Units - Prof. and more Study notes Computer Architecture and Organization in PDF only on Docsity! Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes: Parallel Programming Intro Given an multicore/SMT or a computer with multiple processors on separate chips (a symmetric multiprocessor (SMP)), how can we make use of the multiple processing units? This level of parallelism is at a much higher level than the instruction-level parallelism we looked at before. There, the compiler and/or architecture takes a single program made up of a sequential series of instructions, and executes those instructions in parallel in a way that produces the same result as a one-by-one sequential execution of the instructions. For a computer with multiple processors, we need to provide multiple streams of instructions to be executed by the processors. A single stream of instructions will only make use of one of our processors at a time. The easiest way to program this systems is to program them just like a regular single-processor system, but to run multiple programs at once. Each program being run will be assigned to a CPU by the operating system. However, we would like to consider an approach where a single program can make use of these multiple CPUs. If we are going to do this, we first need to think about how we would break down the problem to be solved into components that can be executed in parallel, then write a program to achieve it. Consider some examples: • Taking a census of Troy. One person doing this would visit each house, count the people, and ask whatever questions are supposed to be asked. This person would keep running counts. At the end, this person has gathered everything. If there are two people, they can work concurrently. Each visits some houses, and they need to “report in” along the way or at the end to combine their information. But how to split up the work? – Each person could do what the individual was originally doing, but would check to make sure each house along the way had not yet been counted. – Each person could start at the city hall, get an address that has not yet been visited, go visit it, then go back to the city hall to report the result and get another address to visit. Someone at city hall keeps track of the cumulative totals. This is nice because neither person will be left without work to do until the whole thing is done. This is the master-slave method of breaking up the work. CSCI 2500 Computer Organization Spring 2009 – The city could be split up beforehand. Each could get a randomly selected collection of addresses to visit. Maybe one person takes all houses with even street numbers and the other all houses with odd street numbers. Or perhaps one person would take everything north of Hoosick St. and the other everything south of Hoosick St. The choice of how to divide up the city may have a big effect on the total cost. There could be excessive travel if one person walks right past a house that has not yet been visited. Also, one person could finish completely while the other still has a lot of work to do. This is a domain decomposition approach. • Grading a stack of exams. Suppose each has several questions. Again, assume two graders to start. – Each person could take half of the stack. Simple enough. But we still have the potential of one person finishing before the other. – Each person could take a paper from the “ungraded” stack, grade it, then put it into the “graded” stack. – Perhaps it makes more sense to have each person grade half of the questions instead of half of the exams, maybe because it would be unfair to have the same question graded by different people. Here, we could use variations on the approaches above. Each takes half the stack, grades his own questions, then they swap stacks. – Or we form a pipeline, where each exam goes from one grader to the next to the finished pile. Some time is needed to start up the pipeline and drain it out, especially if we add more graders. These models could be applied to the census example, if different census takers each went to every house to ask different questions. – Suppose we also add in a “grade totaler and recorder” person. Does that make any of the approaches better or worse? • Adding two 1, 000, 000 × 1, 000, 000 matrices. – Each matrix entry in the sum can be computed independently, so we can break this up any way we like. Could use the master-slave approach, though a domain decomposition would probably make more sense. Depending on how many processes we have, we might break it down by individual entries, or maybe by rows or columns. In each of these cases, we have taken what we might normally think of as a sequential process, and taken advantage of the availability of concurrent processing to make use of multiple workers (processing units). Some Terminology Sequential Program: sequence of actions that produce a result (statements + variables), called a process, task, or thread (of control). The state of the program is determined by the code, data, and a single program counter. 2 CSCI 2500 Computer Organization Spring 2009 The initialization can all be done in any order – each i and j combination is independent of each other, and the assignment of a[i][j] and b[i][j] can be done in either order. In the actual matrix-matrix multiply, each c[i][j] must be initialized to 0 before the sum can start to be accumulated. Also, iteration k of the inner loop can only be done after row i of a and column j of b have been initialized. Finally, the sum contribution of each c[i][j] can be added as soon as that c[i][j] has been computed, and after sum has been initialized to 0. That granularity seems a bit cumbersome, so we might step back and just say that we can initialize a and b in any order, but that it should be completed before we start computing values in c. Then we can initialize and compute each c[i][j] in any order, but we do not start accumulating sum until c is completely computed. But all of these dependencies in this case can be determined by a relatively straightforward com- putation. Seems like a job for a compiler! (And in this case, it can be.) Unfortunately, not everything can be parallelized by the compiler: If we change the initialization code to: for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { if ((i == 0) || (j == 0)) { a[i][j] = i+j; b[i][j] = i-j; } else { a[i][j] = a[i-1][j-1] + i + j; b[i][j] = b[i-1][j-1] + i - j; } } } it can’t be parallelized, so no matter how many processors we throw at it, we can’t speed it up. Approaches to Parallelism Automatic parallelism is great, when it’s possible. We got it for free (at least once we bought the compiler)! It does have limitations, though: • some potential parallelization opportunities cannot be detected automatically – can add di- rectives to help • bigger complication – this executable cannot run on distributed-memory systems Parallel programs can be categorized by how the cooperating processes communicate with each other: 5 CSCI 2500 Computer Organization Spring 2009 • Shared Memory – some variables are accessible from multiple processes. Reading and writing these values allow the processes to communicate. • Message Passing – communication requires explicit messages to be sent from one process to the other when they need to communicate. These are functionally equivalent given appropriate operating system support. For example, one can write message-passing software using shared memory constructs, and one can simulate a shared memory by replacing accesses to non-local memory with a series of messages that access or modify the remote memory. The automatic parallelization we have seen to this point is a shared memory parallelization, though we don’t have to think about how it’s done. The main implication is that we have to run the parallelized executable on a computer with multiple processors. Our first tool for explicit parallelization will be shared memory parallelism using threads. A Brief Intro to POSIX threads Multithreading usually allows for the use of shared memory. Many operating systems provide support for threads, and a standard interface has been developed: POSIX Threads or pthreads. A good online tutorial is available at https://computing.llnl.gov/computing/tutorials/ pthreads/. You read through this and remember that it’s there for reference. A Google search for “pthread tutorial” yields many others. Pthreads are available on the Solaris nodes in the cluster, and are standard on most modern Unix- like operating systems. The basic idea is that we can create and destroy threads of execution in a program, on the fly, during its execution. These threads can then be executed in parallel by the operating system scheduler. If we have multiple processors, we should be able to achieve a speedup over the single-threaded equivalent. We start with a look at a pthreads “Hello, world” program: See: /cs/terescoj/shared/cs2500/examples/pthreadhello The most basic functionality involves the creation and destruction of threads: • pthread create(3THR) – This creates a new thread. It takes 4 arguments. The first is a pointer to a variable of type pthread t. Upon return, this contains a thread iden- tifier that may be used later in a call to pthread join(). The second is a pointer to a pthread attr t structure that specifies thread creation attributes. In the pthreadhel- lo program, we pass in NULL, which will request the system default attributes. The third argument is a pointer to a function that will be called when the thread is started. This function 6 CSCI 2500 Computer Organization Spring 2009 must take a single parameter of type void * and return void *. The fourth parameter is the pointer that will be passed as the argument to the thread function. • pthread exit(3THR) – This causes the calling thread to exit. This is called implicitly if the thread function called during the thread creation returns. Its argument is a return status value, which can be retrieved by pthread join(). • pthread join(3THR) – This causes the calling thread to block (wait) until the thread with the identifier passed as the first argument to pthread join() has exited. The second argument is a pointer to a location where the return status passed to pthread exit() can be stored. In the pthreadhello program, we pass in NULL, and hence ignore the value. Prototypes for pthread functions are in pthread.h and programs need to link with libp- thread.a (use -lpthread at link time). When using the Sun compiler, the -mt flag should also be specified to indicate multithreaded code. A slightly more interesting example: See: /cs/terescoj/shared/cs2500/examples/proctree threads This example builds a “tree” of threads to a depth given on the command line. It includes calls to pthread self(). This function returns the thread identifier of the calling thread. Try it out and study the code to make sure you understand how it works. A bit of extra initialization is necessary to make sure the system will allow your threads to make use of all available processors. It may, by default, allow only one thread in your program to be executing at any given time. If your program will create up to n concurrent threads, you should make the call: pthread_setconcurrency(n+1); somewhere before your first thread creation. The “+1” is needed to account for the original thread plus the n you plan to create. You may also want to specify actual attributes as the second argument to pthread create(). To do this, declare a variable for the attributes: pthread_attr_t attr; and initialize it with: pthread_attr_init(&attr); and set parameters on the attributes with calls such as: pthread_attr_setscope(&attr, PTHREAD_SCOPE_PROCESS); 7
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved