Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Lecture Notes on Message Passing Interface - Parallel Computing | CS 432, Study notes of Computer Science

University of Alabama - Birmingham Computer Science

Prof. Purushotham V. Bangalore

Material Type: Notes; Professor: Bangalore; Class: Parallel Computing; Subject: Computer Science; University: University of Alabama - Birmingham; Term: Fall 2005;

Typology: Study notes

2009/2010

Uploaded on 04/12/2010

koofers-user-8vy 🇺🇸

(1)

10 documents

1 / 40

Partial preview of the text

Download Lecture Notes on Message Passing Interface - Parallel Computing | CS 432 and more Study notes Computer Science in PDF only on Docsity! MPI Tutorial Purushotham Bangalore, Ph.D. Anthony Skjellum, Ph.D. Department of Computer and Information Sciences University of Alabama at Birmingham MPI Tutorial 2 Overview • Message Passing Interface - MPI – Point-to-point communication – Collective communication – Communicators – Datatypes – Topologies – Inter-communicators – Profiling MPI Tutorial 3 Message Passing Interface (MPI) • A message-passing library specification – Message-passing model – Not a compiler specification – Not a specific product • For parallel computers, clusters, and heterogeneous networks • Designed to aid the development of portable parallel software libraries • Designed to provide access to advanced parallel hardware for – End users – Library writers – Tool developers MPI Tutorial 4 Message Passing Interface - MPI • MPI-1 standard widely accepted by vendors and programmers – MPI implementations available on most modern platforms – Huge number of MPI applications deployed – Several tools exist to trace and tune MPI applications • MPI provides rich set of functionality to support library writers, tools developers and application programmers MPI Tutorial 5 MPI Salient Features • Point-to-point communication • Collective communication on process groups • Communicators and groups for safe communication • User defined datatypes • Virtual topologies • Support for profiling MPI Tutorial 6 A First MPI Program #include <stdio.h> #include <mpi.h> main( int argc, char **argv ) { MPI_Init ( &argc, &argv ); printf ( “Hello World!\n” ); MPI_Finalize ( ); } program main include ’mpif.h’ integer ierr call MPI_INIT( ierr ) print *, ’Hello world!’ call MPI_FINALIZE( ierr ) end MPI Tutorial 7 Starting the MPI Environment • MPI_INIT ( ) Initializes MPI environment. This function must be called and must be the first MPI function called in a program (exception: MPI_INITIALIZED) Syntax int MPI_Init ( int *argc, char ***argv ) MPI_INIT ( IERROR ) INTEGER IERROR MPI Tutorial 8 Exiting the MPI Environment • MPI_FINALIZE ( ) Cleans up all MPI state. Once this routine has been called, no MPI routine ( even MPI_INIT ) may be called Syntax int MPI_Finalize ( ); MPI_FINALIZE ( IERROR ) INTEGER IERROR MPI Tutorial 17 Sample SGE script #!/bin/bash # #$ -cwd #$ -j y #$ -S /bin/bash # #$ -pe mpi 4 MPI_DIR=/opt/mpipro/bin EXE="/home/puri/examples/psum 1000" $MPI_DIR/mpirun -np $NSLOTS -machinefile $TMPDIR/machines $EXE Point-to-Point Communications MPI Tutorial 19 Sending and Receiving Messages • Basic message passing process • Questions – To whom is data sent? – Where is the data? – What type of data is sent? – How much of data is sent? – How does the receiver identify it? A: Send Receive B: Process 1Process 0 MPI Tutorial 20 Message Organization in MPI • Message is divided into data and envelope • data – buffer – count – datatype • envelope – process identifier (source/destination rank) – message tag – communicator MPI Tutorial 21 Generalizing the Buffer Description • Specified in MPI by starting address, count, and datatype, where datatype is as follows: – Elementary (all C and Fortran datatypes) – Contiguous array of datatypes – Strided blocks of datatypes – Indexed array of blocks of datatypes – General structure • Datatypes are constructed recursively • Specifying application-oriented layout of data allows maximal use of special hardware • Elimination of length in favor of count is clearer – Traditional: send 20 bytes – MPI: send 5 integers MPI Tutorial 22 MPI C Datatypes MPI datatype C datatype MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED_LONG unsigned long_int MPI_UNSIGNED unsigned int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED MPI Tutorial 23 MPI Fortran Datatypes MPI FORTRAN FORTRAN datatypes MPI_INTEGER INTEGER MPI_REAL REAL MPI_REAL8 REAL*8 MPI_DOUBLE_PRECISION DOUBLE PRECISION MPI_COMPLEX COMPLEX MPI_LOGICAL LOGICAL MPI_CHARACTER CHARACTER MPI_BYTE MPI_PACKED MPI Tutorial 24 Process Identifier • MPI communicator consists of a group of processes – Initially “all” processes are in the group – MPI provides group management routines (to create, modify, and delete groups) • All communication takes place among members of a group of processes, as specified by a communicator • Naming a process – destination is specified by ( rank, group ) – Processes are named according to their rank in the group – Groups are enclosed in “communicator” – MPI_ANY_SOURCE wildcard rank permitted in a receive MPI Tutorial 25 Message Tag • Tags allow programmers to deal with the arrival of messages in an orderly manner • MPI tags are guaranteed to range from 0 to 32767 • The upper bound on tag value is provided by the attribute MPI_TAG_UB • MPI_ANY_TAG can be used as a wildcard value MPI Tutorial 26 MPI Basic Send/Receive • Thus the basic (blocking) send has become: MPI_Send ( start, count, datatype, dest, tag, comm ) • And the receive has become: MPI_Recv( start, count, datatype, source, tag, comm, status ) • The source, tag, and the count of the message actually received can be retrieved from status MPI Tutorial 27 Bindings for Send and Receive int MPI_Send( void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm ) MPI_SEND( BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERR ) <type> BUF( * ) INTEGER COUNT, DATATYPE, DEST, COMM, IERR int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) MPI_RECV( BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERR ) <type> BUF ( * ) INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS( MPI_STATUS_SIZE ), IERR MPI Tutorial 28 Getting Information About a Message • The following functions can be used to get information about a message MPI_Status status; MPI_Recv( . . . , &status ); tag_of_received_message = status.MPI_TAG; src_of_received_message = status.MPI_SOURCE; MPI_Get_count( &status, datatype, &count); • MPI_TAG and MPI_SOURCE are primarily of use when MPI_ANY_TAG and/or MPI_ANY_SOURCE is used in the receive • The function MPI_GET_COUNT may be used to determine how much data of a particular type was received MPI Tutorial 37 Non-Blocking Communication • Non-blocking operations return (immediately) ‘‘request handles” that can be waited on and queried MPI_ISEND( start, count, datatype, dest, tag, comm, request ) MPI_IRECV( start, count, datatype, src, tag, comm, request ) MPI_WAIT( request, status ) • Non-blocking operations allow overlapping computation and communication • One can also test without waiting using MPI_TEST MPI_TEST( request, flag, status ) • Anywhere you use MPI_Send or MPI_Recv, you can use the pair of MPI_Isend/MPI_Wait or MPI_Irecv/MPI_Wait MPI Tutorial 38 sender returns @ T3 buffer unavailable Non-Blocking Send-Receive send side receive side T2: MPI_Isend T8 T3 Once receive is called @ T0, buffer unavailable to user MPI_Wait, returns @ T8 here, receive buffer filled High Performance Implementations Offer Low Overhead for Non-blocking Calls T0: MPI_Irecv T7: transfer finishes Internal completion is soon followed by return of MPI_Wait sender completes @ T5 buffer available after MPI_Wait T4: MPI_Wait called T6: MPI_Wait T1: Returns T5 T9: Wait returns T6 MPI Tutorial 39 Multiple Completions • It is often desirable to wait on multiple requests • An example is a worker/manager program, where the manager waits for one or more workers to send it a message MPI_WAITALL( count, array_of_requests, array_of_statuses ) MPI_WAITANY( count, array_of_requests, index, status ) MPI_WAITSOME( incount, array_of_requests, outcount, array_of_indices, array_of_statuses ) • There are corresponding versions of test for each of these viz., MPI_Testall, MPI_Testany, MPI_Testsome MPI Tutorial 40 Probing the Network for Messages • MPI_PROBE and MPI_IPROBE allow the user to check for incoming messages without actually receiving them • MPI_IPROBE returns “flag == TRUE” if there is a matching message available. MPI_PROBE will not return until there is a matching receive available MPI_IPROBE (source, tag, communicator, flag, status) MPI_PROBE ( source, tag, communicator, status ) MPI Tutorial 41 Message Completion and Buffering • A send has completed when the user supplied buffer can be reused • The send mode used (standard, ready, synchronous, buffered) may provide additional information • Just because the send completes does not mean that the receive has completed – Message may be buffered by the system – Message may still be in transit *buf = 3; MPI_Send ( buf, 1, MPI_INT, ... ); *buf = 4; /* OK, receiver will always receive 3 */ *buf = 3; MPI_Isend(buf, 1, MPI_INT, ...); *buf = 4; /* Undefined whether the receiver will get 3 or 4 */ MPI_Wait ( ... ); MPI Tutorial 42 Example-3, I. program main include 'mpif.h' integer ierr, rank, size, tag, num, next, from integer stat1(MPI_STATUS_SIZE), stat2(MPI_STATUS_SIZE) integer req1, req2 call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) tag = 201 next = mod(rank + 1, size) from = mod(rank + size - 1, size) if (rank .EQ. 0) then print *, "Enter the number of times around the ring" read *, num print *, "Process 0 sends", num, " to 1" call MPI_ISEND(num, 1, MPI_INTEGER, next, tag, $ MPI_COMM_WORLD, req1, ierr) call MPI_WAIT(req1, stat1, ierr) endif MPI Tutorial 43 Example-3, II. 10 continue call MPI_IRECV(num, 1, MPI_INTEGER, from, tag, MPI_COMM_WORLD, req2, ierr) call MPI_WAIT(req2, stat2, ierr) print *, "Process ", rank, " received ", num, " from ", from if (rank .EQ. 0) then num = num - 1 print *, "Process 0 decremented num" endif print *, "Process", rank, " sending", num, " to", next call MPI_ISEND(num, 1, MPI_INTEGER, next, tag, MPI_COMM_WORLD, req1, ierr) call MPI_WAIT(req1, stat1, ierr) if (num .EQ. 0) then print *, "Process", rank, " exiting" goto 20 endif goto 10 20 if (rank .EQ. 0) then call MPI_IRECV(num, 1, MPI_INTEGER, from, tag, MPI_COMM_WORLD, req2, ierr) call MPI_WAIT(req2, stat2, ierr) endif call MPI_FINALIZE(ierr) end MPI Tutorial 44 Example-3, I. #include <stdio.h> #include <mpi.h> int main(int argc, char **argv){ int num, rank, size, tag, next, from; MPI_Status status1, status2; MPI_Request req1, req2; MPI_Init(&argc, &argv); MPI_Comm_rank( MPI_COMM_WORLD, &rank); MPI_Comm_size( MPI_COMM_WORLD, &size); tag = 201; next = (rank+1) % size; from = (rank + size - 1) % size; if (rank == 0) { printf("Enter the number of times around the ring: "); scanf("%d", &num); printf("Process %d sending %d to %d\n", rank, num, next); MPI_Isend(&num, 1, MPI_INT, next, tag, MPI_COMM_WORLD,&req1); MPI_Wait(&req1, &status1); } MPI Tutorial 45 Example-3, II. do { MPI_Irecv(&num, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &req2); MPI_Wait(&req2, &status2); printf("Process %d received %d from process %d\n",rank,num,from); if (rank == 0) { num--; printf("Process 0 decremented number\n"); } printf("Process %d sending %d to %d\n", rank, num, next); MPI_Isend(&num, 1, MPI_INT, next, tag, MPI_COMM_WORLD, &req1); MPI_Wait(&req1, &status1); } while (num != 0); if (rank == 0) { MPI_Irecv(&num, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &req2); MPI_Wait(&req2, &status2); } MPI_Finalize(); return 0; } MPI Tutorial 46 Send Modes • Standard mode ( MPI_Send, MPI_Isend ) – The standard MPI Send, the send will not complete until the send buffer is empty • Synchronous mode ( MPI_Ssend, MPI_Issend ) – The send does not complete until after a matching receive has been posted • Buffered mode ( MPI_Bsend, MPI_Ibsend ) – User supplied buffer space is used for system buffering – The send will complete as soon as the send buffer is copied to the system buffer • Ready mode ( MPI_Rsend, MPI_Irsend ) – The send will send eagerly under the assumption that a matching receive has already been posted (an erroneous program otherwise) MPI Tutorial 47 Standard Send-Receive receive side T4: Transfer Complete T2: Sender Returns Once receive is called @ T1, buffer unavailable to user Receiver returns @ T4, buffer filled T1: MPI_Recv Sender returns @ T2, buffer can be reused T3: Transfer Starts T0: MPI_Send send side Internal completion is soon followed by return of MPI_Recv MPI Tutorial 48 Synchronous Send-Receive receive side T4: Transfer Complete T3: Sender Returns Once receive is called @ T1, buffer unavailable to user Receiver returns @ T4, buffer filled T1: MPI_Recv Sender returns @ T3, buffer can be reused (receive has started) T2: Transfer Starts T0: MPI_Ssend send side Internal completion is soon followed by return of MPI_Recv MPI Tutorial 57 Collective Communications • Communication is coordinated among a group of processes, as specified by communicator, not on all processes • All collective operations are blocking and no message tags are used • All processes in the communicator group must call the collective operation • Collective and point-to-point messaging are separated by different “contexts” • Three classes of collective operations – Data movement – Collective computation – Synchronization MPI Tutorial 58 MPI Basic Collective Operations • Two simple collective operations MPI_BCAST( start, count, datatype, root, comm ) MPI_REDUCE( start, result, count, datatype, operation, root, comm ) • The routine MPI_BCAST sends data from one process to all others • The routine MPI_REDUCE combines data from all processes, using a specified operation, and returns the result to a single process MPI Tutorial 59 Broadcast and Reduce Bcast (root=0) A0 ?1 ?2 ?3 Process Ranks Send buffer A0 A1 A2 A3 Process Ranks Send buffer Reduce (root=0) A0 B1 C2 D3 X0 ?1 ?2 ?3 X=A op B op C op D Process Ranks Send buffer Process Ranks Receive buffer MPI Tutorial 60 Scatter and Gather Scatter (root=0) Process Ranks Send buffer A0 B1 C2 D3 Gather (root=0) Process Ranks Receive buffer Process Ranks Send buffer Process Ranks Receive buffer ABCD0 1 2 3 ???? ???? ???? ABCD0 1 2 3 ???? ???? ???? A0 B1 C2 D3 MPI Tutorial 61 Allreduce and Allgather Allgather Process Ranks Send buffer Process Ranks Receive buffer ABCD0 1 2 3 ABCD ABCD ABCD A0 B1 C2 D3 Allreduce A0 B1 C2 D3 X0 X1 X2 X3 X=A op B op C op D Process Ranks Send buffer Process Ranks Receive buffer MPI Tutorial 62 Alltoall and Scan Scan Process Ranks Send buffer Receive buffer Process Ranks 0 1 2 3 WA0 B1 C2 D3 Alltoall Process Ranks Send buffer Process Ranks Receive buffer A0B0C0D00 1 2 3 A1B1C1D1 A2B2C2D2 A3B3C3D3 A0A1A2A30 1 2 3 B0B1B2B3 C0C1C2C3 D0D1D2D3 X Y Z AopBopCopD AopBopC AopB A MPI Tutorial 63 MPI Collective Routines • Several routines: MPI_ALLGATHER MPI_ALLGATHERV MPI_BCAST MPI_ALLTOALL MPI_ALLTOALLV MPI_REDUCE MPI_GATHER MPI_GATHERV MPI_SCATTER MPI_REDUCE_SCATTER MPI_SCAN MPI_SCATTERV MPI_ALLREDUCE • All versions deliver results to all participating processes • “V” versions allow the chunks to have different sizes • MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCE_SCATTER, and MPI_SCAN take both built- in and user-defined combination functions MPI Tutorial 64 Built-In Collective Computation Operations MPI Name Operation MPI_MAX Maximum MPI_MIN Minimum MPI_PROD Product MPI_SUM Sum MPI_LAND Logical and MPI_LOR Logical or MPI_LXOR Logical exclusive or ( xor ) MPI_BAND Bitwise and MPI_BOR Bitwise or MPI_BXOR Bitwise xor MPI_MAXLOC Maximum value and location MPI_MINLOC Minimum value and location MPI Tutorial 65 User defined Collective Computation Operations MPI_OP_CREATE(user_function, commute_flag, user_op) MPI_OP_FREE(user_op) The user_function should look like this: user_function (invec, inoutvec, len, datatype) The user_function should perform the following: for ( i = 0; i < len; i++) inoutvec[i] = invec[i] op inoutvec[i]; do i = 1, len inoutvec(i) = invec(i) op inoutvec(i) end do MPI Tutorial 66 Synchronization • MPI_BARRIER ( comm ) • Function blocks until all processes in “comm” call it • Often not needed at all in many message- passing codes • When needed, mostly for highly asynchronous programs or ones with speculative execution MPI Tutorial 67 Example 5, I. program main include 'mpif.h' integer iwidth, iheight, numpixels, i, val, my_count, ierr integer rank, comm_size, sum, my_sum real rms character recvbuf(65536), pixels(65536) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, comm_size, ierr) if (rank.eq.0) then iheight = 256 iwidth = 256 numpixels = iwidth * iheight C Read the image do i = 1, numpixels pixels(i) = char(i) enddo C Calculate the number of pixels in each sub image my_count = numpixels / comm_size endif MPI Tutorial 68 Example 5, II. C Broadcasts my_count to all the processes call MPI_BCAST(my_count, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr) C Scatter the image call MPI_SCATTER(pixels, my_count, MPI_CHARACTER, recvbuf, $ my_count, MPI_CHARACTER, 0, MPI_COMM_WORLD, ierr) C Take the sum of the squares of the partial image my_sum = 0 do i=1,my_count my_sum = my_sum + ichar(recvbuf(i))*ichar(recvbuf(i)) enddo C Find the global sum of the squares call MPI_REDUCE( my_sum, sum, 1, MPI_INTEGER, MPI_SUM, 0, $ MPI_COMM_WORLD, ierr) C rank 0 calculates the root mean square if (rank.eq.0) then rms = sqrt(real(sum)/real(numpixels)) print *, 'RMS = ', rms endif MPI Tutorial 77 Uses of MPI_COMM_WORLD • Contains all processes available at the time the program was started • Provides initial safe communication space • Simple programs communicate with MPI_COMM_WORLD • Complex programs duplicate and subdivide copies of MPI_COMM_WORLD • MPI_COMM_WORLD provides the basic unit of MIMD concurrency and execution lifetime for MPI-2 MPI Tutorial 78 Uses of MPI_COMM_NULL • An invalid communicator • Cannot be used as input to any operations that expect a communicator • Used as an initial value of communicators to be defined • Returned as a result in certain cases • Value that communicator handles are set to when freed MPI Tutorial 79 Uses of MPI_COMM_SELF • Contains only the local process • Not normally used for communication (since only to oneself) • Holds certain information: – hanging cached attributes appropriate to the process – providing a singleton entry for certain calls (especially MPI-2) MPI Tutorial 80 Duplicating a Communicator: MPI_COMM_DUP • It is a collective operation. All processes in the original communicator must call this function • Duplicates the communicator group, allocates a new context, and selectively duplicates cached attributes • The resulting communicator is not an exact duplicate. It is a whole new separate communication universe with similar structure int MPI_Comm_dup( MPI_Comm comm, MPI_Comm *newcomm) MPI_COMM_DUP( COMM, NEWCOMM, IERR ) INTEGER COMM, NEWCOMM, IERR MPI Tutorial 81 int MPI_Comm_split( MPI_Comm comm, int color, int key, MPI_Comm *newcomm) MPI_COMM_SPLIT( COMM, COLOR, KEY, NEWCOMM, IERR ) INTEGER COMM, COLOR, KEY, NEWCOMM, IERR • MPI_COMM_SPLIT partitions the group associated with the given communicator into disjoint subgroups • Each subgroup contains all processes having the same value for the argument color • Within each subgroup, processes are ranked in the order defined by the value of the argument key, with ties broken according to their rank in old communicator Subdividing a Communicator with MPI_COMM_SPLIT MPI Tutorial 82 Subdividing a Communicator: Example 1 • To divide a communicator into two non- overlapping groups color = (rank < size/2) ? 0 : 1 ; MPI_Comm_split(comm, color, 0, &newcomm) ; 0 1 2 3 4 5 6 7 0 1 2 3 0 1 2 3 comm newcomm newcomm MPI Tutorial 83 Subdividing a Communicator: Example 2 • To divide a communicator such that – all processes with even ranks are in one group – all processes with odd ranks are in the other group – maintain the reverse order by rank color = (rank % 2 == 0) ? 0 : 1 ; key = size - rank ; MPI_Comm_split(comm, color, key, &newcomm) ; 0 1 2 3 4 5 6 7 0 1 2 3 0 1 2 3 comm newcomm newcomm 6 4 2 0 7 5 3 1 MPI Tutorial 84 Subdividing a Communicator with MPI_COMM_CREATE • Creates a new communicators having all the processes in the specified group with a new context • The call is erroneous if all the processes do not provide the same handle • MPI_COMM_NULL is returned to processes not in the group • MPI_COMM_CREATE is useful if we already have a group, otherwise a group must be built using the group manipulation routines int MPI_Comm_create( MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm ) MPI_COMM_CREATE( COMM, GROUP, NEWCOMM, IERR ) INTEGER COMM, GROUP, NEWCOMM, IERR MPI Tutorial 85 Group Manipulation Routines • To obtain an existing group, use MPI_COMM_GROUP ( comm, group ) ; • To free a group, use MPI_GROUP_FREE ( group ) ; • A new group can be created by specifying the members to be included/excluded from an existing group using the following routines – MPI_GROUP_INCL: specified members are included – MPI_GROUP_EXCL: specified members are excluded – MPI_GROUP_RANGE_INCL and MPI_GROUP_RANGE_EXCL: a range of members are included or excluded – MPI_GROUP_UNION and MPI_GROUP_INTERSECTION: a new group is created from two existing groups • Other routines: MPI_GROUP_COMPARE, MPI_GROUP_TRANSLATE_RANKS MPI Tutorial 86 Tools for Writing Libraries • MPI is specifically designed to make it easier to write message-passing libraries • Communicators solve tag/source wild-card problem • Attributes provide a way to attach information to a communicator MPI Tutorial 87 Private Communicators • One of the first things that a library should normally do is create a private communicator • This allows the library to send and receive messages that are known only to the library MPI_Comm_dup( old_comm, &new_comm ); MPI Tutorial 88 Attributes • Attributes are data that can be attached to one or more communicators • Attributes are referenced by keyval. Keyvals are created with MPI_KEYVAL_CREATE • Attributes are attached to a communicator with MPI_ATTR_PUT and their values accessed by MPI_ATTR_GET MPI Tutorial 97 Datatypes in MPI • Elementary: Language-defined types – MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, etc. • Vector: Separated by constant “stride” – MPI_TYPE_VECTOR • Contiguous: Vector with stride of one – MPI_TYPE_CONTIGUOUS • Hvector: Vector, with stride in bytes – MPI_TYPE_HVECTOR • Indexed: Array of indices (for scatter/gather) – MPI_TYPE_INDEXED • Hindexed: Indexed, with indices in bytes – MPI_TYPE_HINDEXED • Struct: General mixed types (for C structs etc.) – MPI_TYPE_STRUCT MPI Tutorial 98 Primitive Datatypes in MPI (C) MPI datatype C datatype MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED_LONG unsigned long_int MPI_UNSIGNED unsigned int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED MPI Tutorial 99 Primitive Datatypes in MPI (FORTRAN) MPI FORTRAN FORTRAN datatypes MPI_INTEGER INTEGER MPI_REAL REAL MPI_DOUBLE_PRECISION DOUBLE PRECISION MPI_COMPLEX COMPLEX MPI_LOGICAL LOGICAL MPI_CHARACTER CHARACTER MPI_BYTE MPI_PACKED MPI Tutorial 100 Example: Building Structures struct { char display[50]; /* Name of display */ int maxiter; /* max # of iterations */ double xmin, ymin; /* lower left corner of rectangle */ double xmax, ymax; /* upper right corner */ int width; /* of display in pixels */ int height; /* of display in pixels */ } cmdline; /* set up 4 blocks */ int blockcounts[4] = {50,1,4,2}; MPI_Datatype types[4]={MPI_CHAR, MPI_INT, MPI_DOUBLE, MPI_INT}; MPI_Aint displs[4]; MPI_Datatype cmdtype; /* initialize types and displs with addresses of items */ MPI_Address(&cmdline.display, &displs[0]); MPI_Address(&cmdline.maxiter, &displs[1]); MPI_Address(&cmdline.xmin, &displs[2]); MPI_Address(&cmdline.width, &displs[3]); for (i = 3; i >= 0; i--) displs[i] -= displs[0]; MPI_Type_struct(4, blockcounts, displs, types, &cmdtype); MPI_Type_commit(&cmdtype); MPI Tutorial 101 Example: Building Structures character display(50) integer maxiter double precision xmin, ymin double precision xmax, ymax integer width integer height common /cmdline/ display,maxiter,xmin,ymin,xmax,ymax,width,height integer blockcounts(4), types(4), displs(4), cmdtype data blockcounts/50,1,4,2/ data types/MPI_CHARACTER,MPI_INTEGER,MPI_DOUBLE_PRECISION, $ MPI_INTEGER/ call MPI_Address(display, displs(1), ierr) call MPI_Address(maxiter, displs(2), ierr) call MPI_Address(xmin, displs(3), ierr) call MPI_Address(width, displs(4), ierr) do i = 4, 1, -1 displs(i) = displs(i) - displs(1) end do call MPI_Type_struct(4, blockcounts, displs, types, cmdtype, ierr) call MPI_Type_commit(cmdtype, ierr) MPI Tutorial 102 Structures • Structures are described by – number of blocks – array of number of elements (array_of_len) – array of displacements or locations (array_of_displs) – array of datatypes (array_of_types) MPI_TYPE_STRUCT(count, array_of_len, array_of_displs, array_of_types, newtype); MPI Tutorial 103 Example: Building Vectors • To specify this column (in row order), use MPI_TYPE_VECTOR(count, blocklen, stride, oldtype, newtype) MPI_TYPE_COMMIT(newtype) • The exact code for this is MPI_TYPE_VECTOR(7, 1, 7, MPI_DOUBLE, newtype); MPI_TYPE_COMMIT(newtype); 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 MPI Tutorial 104 Extents • The extent of a datatype is (normally) the distance between the first and last member. • We can set an artificial extent by using MPI_UB and MPI_LB in MPI_TYPE_STRUCT • The routine MPI_TYPE_EXTENT must be used to obtain the size of a datatype (not sizeof in C) since datatypes are opaque objects – MPI_TYPE_EXTENT ( datatype, extent ) Memory locations specified by datatype EXTENT MPI Tutorial 105 Vectors Revisited • To create a datatype for an arbitrary number of elements in a column of an array stored in row- major format, use int displs[2], sizeofdouble; int blens[2] = {1, 1}; MPI_Datatype types[2] = {MPI_DOUBLE, MPI_UB}; MPI_Datatype coltype; MPI_Type_extent(MPI_DOUBLE, &sizeofdouble); displs[0] = 0; displs[1] = number_in_columns * sizeofdouble; MPI_Type_struct(2, blens, displs, types, &coltype); MPI_Type_commit(&coltype); MPI_Send(buf, n, coltype, ...); • To send n elements, we can use MPI Tutorial 106 Structures Revisited • When sending an array of structures, it is important to ensure that MPI and the compiler have the same value for the size of each structure • Most portable way to do this is to use MPI_UB in the structure definition for the end of the structure. In the previous example, this would be: MPI_Datatype types[5] = {MPI_CHAR, MPI_INT, MPI_DOUBLE, MPI_INT, MPI_UB}; /* initialize types and displs */ MPI_Address(&cmdline.display, &displs[0]); MPI_Address(&cmdline.maxiter, &displs[1]); MPI_Address(&cmdline.xmin, &displs[2]); MPI_Address(&cmdline.width, &displs[3]); MPI_Address(&cmdline[1], &displs[4]); for (i = 4; i >= 0; i--) displs[i] -= displs[0]; MPI_Type_struct(5, blockcounts, displs, types, &cmdtype); MPI_Type_commit(&cmdtype); MPI Tutorial 107 Interleaving Data • We can interleave data by moving the upper bound value inside the data • To distribute a matrix among 4 processes, we can create a block datatype and use MPI_SCATTERV 0 6 12 18 24 30 1 7 13 19 25 31 2 8 14 20 26 32 3 9 15 21 27 33 4 10 16 22 28 34 5 11 17 23 29 35 Process 0 Process 1 Process 2 Process 3 NOTE: Scatterv does the following for all processes (i = 0 to size-1) send(buf+displs(i)*extent(sendtype), sendcounts(i), sendtype,.....) MPI Tutorial 108 An Interleaved Datatype - C Example • Define a vector datatype MPI_Type_vector (3, 3, 6, MPI_DOUBLE, &vectype); • Define a block whose extent is just one entry int sizeofdouble; int blens[2] = {1, 1}; MPI_Type_extent (MPI_DOUBLE, &sizeofdouble); int indices[2] = {0, sizeofdouble}; MPI_Datatype types[2] = {vectype, MPI_UB}; MPI_Type_struct (2, blens, indices, types, &block); MPI_Type_commit (&block); int len[4] = {1,1,1,1}; int displs[4] = {0,3,18,21}; MPI_Scatterv(sendbuf, len, displs, block, recvbuf, 9, MPI_DOUBLE, 0, comm); MPI Tutorial 117 Cartesian Topologies 0 (0,0) 1 (0,1) 2 (0,2) 3 (1,0) 4 (1,1) 5 (1,2) 6 (2,0) 7 (2,1) 8 (2,2) 9 (3,0) 10 (3,1) 11 (3,2) 4 x 3 cartesian grid MPI Tutorial 118 Defining a Cartesian Topology • The routine MPI_CART_CREATE creates a Cartesian decomposition of the processes MPI_CART_CREATE(MPI_COMM_WORLD, ndim, dims, periods, reorder, comm2d) – ndim - no. of cartesian dimensions – dims - an array of size ndims to specify no. of processes in each dimension – periods - an array of size ndims to specify the periodicity in each dimension – reorder - flag to specify ordering of ranks for better performance – comm2d - new communicator with the cartesian information cached MPI Tutorial 119 The Periods Argument • In the non-periodic case, a neighbor may not exist, which is indicated by a rank of MPI_PROC_NULL • This rank may be used in send and receive calls in MPI • The action in both cases is as if the call was not made MPI_PROC_NULL MPI Tutorial 120 Defining a Cartesian Topology ndim = 2 dims(1) = 4 dims(2) = 3 periods(1) = .false. periods(2) = .false. reorder = .true. call MPI_CART_CREATE(MPI_COMM_WORLD, ndim, dims, $ periods, reorder, comm2d, ierr) ndim = 2; dims[0] = 4; dims[1] = 3; periods[0] = 0; periods[1] = 0; reorder = 1; MPI_CART_CREATE(MPI_COMM_WORLD, ndim, dims, periods, reorder, &comm2d); MPI Tutorial 121 Finding Neighbors • MPI_CART_CREATE creates a new communicator with the same processes as the input communicator, but with the specified topology • The question, Who are my neighbors, can be answered with MPI_CART_SHIFT • The values returned are the ranks, in the communicator comm2d, of the neighbors shifted by +/- 1 in the two dimensions • The values returned can be used in a MPI_SENDRECV call as the ranks of source and destination MPI_CART_SHIFT(comm, direction, displacement, src_rank, dest_rank) MPI_CART_SHIFT(comm2d, 0, 1, nbrtop, nbrbottom) MPI_CART_SHIFT(comm2d, 1, 1, nbrleft, nbrright) MPI Tutorial 122 Partitioning a Cartesian Topology • A cartesian topology can be divided using MPI_CART_SUB on the communicator returned by MPI_CART_CREATE • MPI_CART_SUB is closely related to MPI_COMM_SPLIT • To create a communicator with all processes in dimension-1, use remain_dims(1) = .false. remain_dims(2) = .true. MPI_Cart_sub(comm2d, remain_dims, comm_row, ierr) remain_dims[0] = 0; remain_dims[1] = 1; MPI_Cart_sub(comm2d, remain_dims, &comm_row); MPI Tutorial 123 Partitioning a Cartesian Topology (continued) remain_dims(1) = .true. remain_dims(2) = .false. MPI_Cart_sub(comm2d, remain_dims, comm_col, ierr) remain_dims[0] = 1; remain_dims[1] = 0; MPI_Cart_sub(comm2d, remain_dims, &comm_col); • To create a communicator with all processes in dimension-0, use MPI Tutorial 124 Cartesian Topologies 0 (0,0) 1 (0,1) 2 (0,2) 3 (1,0) 4 (1,1) 5 (1,2) 6 (2,0) 7 (2,1) 8 (2,2) 9 (3,0) 10 (3,1) 11 (3,2) 4 x 3 cartesian grid comm2d comm_row comm_col MPI Tutorial 125 Other Topology Routines • MPI_CART_COORDS: Returns the cartesian coordinates of the calling process given the rank • MPI_CART_RANK: Translates the cartesian coordinates to process ranks as they are used by the point-to-point routines • MPI_DIMS_CREATE: Returns a good choice for the decomposition of the processors • MPI_CART_GET: Returns the cartesian topology information that was associated with the communicator • MPI_GRAPH_CREATE: allows the creation of a general graph topology • Several routines similar to cartesian topology routines for general graph topology MPI Tutorial 126 Example 8, I. program topology include "mpif.h" integer NDIMS parameter (NDIMS = 2) integer dims(NDIMS), local(NDIMS) logical periods(NDIMS), reorder, remain_dims(2) integer comm2d, row_comm, col_comm, rowsize, colsize integer nprow, npcol, myrow, mycol, numnodes, ierr integer left, right, top, bottom, sum_row, sum_col call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myrank, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numnodes, ierr ) dims(1) = 0 dims(2) = 0 call MPI_DIMS_CREATE( numnodes, NDIMS, dims, ierr ) nprow = dims(1) npcol = dims(2) periods(1) = .TRUE. periods(2) = .TRUE. reorder = .TRUE. call MPI_CART_CREATE( MPI_COMM_WORLD, NDIMS, dims, periods, $ reorder, comm2d, ierr ) MPI Tutorial 127 Example 8, II. call MPI_CART_COORDS( comm2d, myrank, DIMS, local, ierr ) myrow = local(1) mycol = local(2) remain_dims(1) = .FALSE. remain_dims(2) = .TRUE. call MPI_CART_SUB( comm2d, remain_dims, row_comm, ierr ) remain_dims(1) = .TRUE. remain_dims(2) = .FALSE. call MPI_CART_SUB( comm2d, remain_dims, col_comm, ierr ) call MPI_Comm_size(row_comm, rowsize,ierr) call MPI_Comm_size(col_comm, colsize,ierr) if (myrank.eq.0) print*, rowsize = ',rowsize,' colsize = ',colsize call MPI_CART_SHIFT(comm2d, 1, 1, left, right, ierr) call MPI_CART_SHIFT(comm2d, 0, 1, top, bottom, ierr) print *,'myrank[',myrank,'] (p,q) = (',myrow,mycol,' )' print *,'myrank[',myrank,'] left ',left,' right ',right print *,'myrank[',myrank,'] top ',top,' bottom ',bottom call MPI_Finalize(ierr) end MPI Tutorial 128 Example 8, I #include <mpi.h> #include <stdio.h> typedef enum{FALSE, TRUE} BOOLEAN; #define N_DIMS 2 main(int argc, char **argv) { MPI_Comm comm_2d, row_comm, col_comm; int myrank, size, P, Q, p, q, reorder, left, right, bottom, top, rowsize, colsize; int dims[N_DIMS], /* number of dimensions */ local[N_DIMS], /* local row and column positions */ period[N_DIMS], /* aperiodic flags */ remain_dims[N_DIMS]; /* sub-dimension computation flags */ MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &myrank); MPI_Comm_size (MPI_COMM_WORLD, &size); /* Generate a new communicator with virtual topology */ dims[0] = dims[1] = 0; MPI_Dims_create( size, N_DIMS, dims ); P = dims[0]; Q = dims[1]; reorder = TRUE; period[0] = period[1] = TRUE; MPI_Cart_create(MPI_COMM_WORLD, N_DIMS, dims, period, reorder, &comm_2d); MPI Tutorial 137 Inter-communicator Merge • MPI_INTERCOMM_MERGE creates an intra- communicator by merging the local and remote groups of an inter-communicator – MPI_INTERCOMM_MERGE(intercomm, high, newintracomm) • The process groups are ordered based on the value of high • All processes in one group should have the same value for high intercomm newintercomm Profiling Interface MPI Tutorial 139 Profiling Interface • The objective of the MPI profiling interface is to assist profiling tools to interface their code to different MPI implementations • Profiling tools can obtain performance information without access to the underlying MPI implementation • All MPI routines have two entry points: MPI_.. and PMPI_.. • Users can use the profiling interface without modification to the source code by linking with a profiling library • A log file can be generated by replacing -lmpi with -llmpi -lpmpi -lm MPI_Send MPI_Send PMPI_Send MPI_Send PMPI_Send User Program Profile Library MPI Library MPI Tutorial 140 Timing MPI Programs • MPI_WTIME returns a floating-point number of seconds, representing elapsed wall-clock time since some time in the past double MPI_Wtime( void ) DOUBLE PRECISION MPI_WTIME( ) • MPI_WTICK returns the resolution of MPI_WTIME in seconds. It returns, as a double precision value, the number of seconds between successive clock ticks. double MPI_Wtick( void ) DOUBLE PRECISION MPI_WTICK( ) MPI Tutorial 141 Output Servers • Portable, moderate- to high-performance output capability for distributed memory programs • Master-slave approach – Reserve one “master” processor for I/O – Waits to receive messages from workers – Instead of printing, workers send data to master – Master receives messages and assembles single, global output file • MPI-2 I/O functions Case Study MPI Tutorial 143 2-D Laplace solver • Mathematical Formulation • Numerical Method • Implementation – Topologies for structured meshes – MPI datatypes – Optimizing point-to-point communication MPI Tutorial 144 Mathematical Formulation • The poisson equation can be written as (2)boundary on the ),(),( (1)interior in the ),( 2 yxgyxu yxfu = =∇ • Using a 5-point finite difference Laplace scheme eq. (1) can be discretized as y x h , 4 ,2 ,,11,1,,1 ∆=∆== −+++ +−+− ji jijijijiji f h uuuuu MPI Tutorial 145 Numerical Method • A simple algorithm to solve the poisson equation is shown below Initialize right hand side Setup an initial solution guess do for all the grid points compute uk+1i,j compute norm until convergence printout solution )( 4 1 , 2 ,11,1,,1, 1 jiji k ji k ji k ji k ji k fhuuuuu −+++= +−+−+ • The Poisson equation can be solved using a Jacobi iteration MPI Tutorial 146 Implementation Details i,j i+1,ji-1,j i,j-1 i,j+1 MPI Tutorial 147 Parallel Implementation MPI Tutorial 148 Parallel Algorithm – Initialize right hand side – Setup an initial solution guess – do – for all the grid points – compute uk+1i,j – exchange data across process boundaries – compute norm – until convergence – printout solution

Documents

questions

Lecture Notes on Message Passing Interface - Parallel Computing | CS 432, Study notes of Computer Science

Related documents

Partial preview of the text