Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Wrapup - Parallel Computing - Lecture Slides, Slides of Parallel Computing and Programming

Aligarh Muslim University Parallel Computing and Programming

Parallel Computing is emerging subject in filed of computer science. This course is designed to introduce architecture and basic concepts of parallel computing. This lecture includes: Wrapup, Parallel Computing Wrapup, Dissemination Barrier, Fourier Transform, Dissemination Barrier Algorithm, Dissemination Barrier in Action, Dissemination Pattern, 1D Discrete Fourier Transform, Butterfly Pattern of Computation, Spmd Implementation of Fft

Typology: Slides

2012/2013

Uploaded on 09/28/2013

dhanvant 🇮🇳

4.9

(9)

90 documents

1 / 28

Partial preview of the text

Download Wrapup - Parallel Computing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity! Parallel Computing Wrapup docsity.com 2 Topics for Today • Dissemination barrier • Fourier transform • Review • Some things we didn’t cover • Parallel programming top ten list docsity.com Dissemination Pattern • Strength —avoids reduction tree + broadcast tree – reduction + broadcast tree: critical path length = 2 log2 P —faster by a factor of two in practice for that reason • Also useful for idempotent reductions —barrier —min, max —or, and 5 docsity.com 6 1D Discrete Fourier Transform • Discrete Fourier transform on complex numbers Figure credit: http://upload.wikimedia.org/wikipedia/commons/5/54/FFT_Butterfly_radix8.svg Figure License: http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License • Widely employed in signal processing, solving PDEs, multiplying large integers, ... • Fast Fourier Transform (FFT): O(N log N) implementation docsity.com 7 Productive Parallel 1D FFT (n = 2k) subroutine fft(c, n) implicit complex(c) dimension c(0:n-1), irev(0:n-1) !HPF$ processors p(number_of_processors()) !HPF$ template t(0:n-1) !HPF$ align c(i) with t(i) !HPF$ align irev(i) with t(i) !HPF$ distribute t(block) onto p two_pi = 2.0d0 * acos(-1.0d0) levels = number_of_bits(n) - 1 irev = (/ (bitreverse(i,levels), i= 0, n-1) /) forall (i=0:n-1) c(i) = c(irev(i)) do l = 1, levels ! --- for each level in the FFT m = ishft(1, l) m2 = ishft(1, l - 1) do k = 0, n - 1, m ! --- for each butterfly in a level do j = k, k + m2 - 1 ! --- for each point in a half bfly ce = exp(cmplx(0.0,(j - k) * -two_pi/real(m))) cr = ce * c(j + m2) cl = c(j) c(j) = cl + cr c(j + m2) = cl - cr end do end do enddo end subroutine fft Two stages: —bit reversal permutation —butterfly pattern of element combinations docsity.com • Radix 2 FFT implementation • Block distribution of coarray “c” across all processors • Sketch in CAF 2.0: complex, allocatable :: c(:,2)[*], spare(:)[*] ... ! permute data to bit-reversed indices (uses team_alltoall) call bitreverse(c, n_world_size, world_size, spare) ! local FFT computation for levels that fit in the memory of an image do l = 1, loc_comm-1 ... ! transpose from block to cyclic data distribution (uses team_alltoall) call transpose(c, n_world_size, world_size, spare) ! local FFT computation for remaining levels do l = loc_comm, levels ... ! transpose back from cyclic to block data distribution (uses team_alltoall) call transpose(c, n_world_size, n_local_size/world_size,spare) SPMD Implementation of FFT in CAF 2.0 10 docsity.com SPMD Element-wise Bitreverse in CAF Each processor performs reversal of a block [ rank * n / npe ... (rank + 1) * n/npe -1] local_n = n / npe global_j = rank * local_n do local_j = 0, local_n - 1 jrev = bitreverse(global_j, n_bits) processor = ishft(jrev, -local_n_bits) index = iand(jrev, local_mask) crev(index)[processor] = c(local_j) global_j = global_j + 1 enddo 11 log n - log p bits log P bits docsity.com All-to-All Personalized Communication : bh Py : a ? R “a * - n P, é # ¥ & 12 docsity.com 15 Topics for Today • Dissemination barrier • Fourier transform • Review • Some things we didn’t cover • Parallel programming top ten list docsity.com 16 Course Objectives • Learn fundamentals of parallel computing —principles of parallel algorithm design —programming models and methods —parallel computer architectures —modeling and analysis of parallel programs and systems —parallel algorithms • Develop skill writing parallel programs —programming assignments employing a variety of models • Develop skill analyzing parallel computing problems —develop parallelizations for different styles of computations docsity.com 17 Principles of Parallel Algorithm Design • Algorithm models —data-parallel, task graph, work pool —master slave, pipeline, hybrid • Decomposition techniques —recursive —data driven: input data, output data, intermediate data —hybrid decomposition —exploratory decomposition —speculative decomposition • Task generation —static vs. dynamic assignment 4 assignment 2 assignment 3 assignment 1 4 1,4 2,3 assignment 1 docsity.com 20 Parallel Architectures • Control structure and communication models —control structure: SIMD, MIMD —communication models – shared address space – message passing platforms • Network topologies —static/direct vs. dynamic/indirect networks —bus, crossbar, omega, hypercube, fat tree, mesh, Kautz graph —hybrid interconnects —evaluation metrics – degree, diameter, bisection width, channel width & rate, cost • Coherence, routing, and network embeddings —blocking vs. non-blocking networks —routing techniques: store & forward, packet, wormhole —cache coherence: protocols, snoopy caches, directories, SCI —embeddings: dilation, congestion docsity.com 21 Analytical Modeling of Parallel Systems • Overheads, —idling: serialization, load imbalance —data movement, resource sharing, extra computation • Metrics —time, total overhead, speedup, efficiency • Scalability —cost optimality —isoefficiency – how must problem grow as function of PEs to maintain efficiency —Amdahl's law and limits on speedup • Asymptotic analysis —algorithm complexity —scalability docsity.com 22 Synchronization • Insufficient synchronization causes data races —unordered, conflicting operations • Mutual exclusion: classical algorithms for locks —explore formal reasoning about concurrent operations • Lock synchronization with atomic primitives —practical algorithms for pairwise coordination • Barrier synchronization —separate phases to prevent overlap of conflicting operations —strategies for fast, primitive collective synchronization docsity.com 25 Reasoning about Concurrent Operations • Serializability —as if operations were executed in some sequential order • Sequential consistency —as if all operations were executed in some sequential order —operations of each processor appear in program order • Linearizability —each operation appears to take effect instantaneously —interleaving consistent with operation execution in time docsity.com 26 Active Areas of Research • Architectures —the quest for exascale performance: 1018 operations per second —multi-core processors – number of cores and threads is rapidly increasing – hardware support for simplifying parallelism —heterogeneity is already here; more is coming • Compilers for parallel systems • Operating systems for scalable architectures —lightweight microkernels: ZeptoOS, Compute Node Linux • Multiscale algorithms: regular and irregular variants • Transactional memory —first transactional hardware delivered on Blue Gene Q —coming in 2013 on Intel Haswell docsity.com 27 Top Ten Tips for Parallel Computing It’s all about the performance • Use an efficient algorithm —clever implementation will yield to asymptotic inefficiency at scale • Partition your data and computation carefully —the wrong data partitioning can yield high communication volume —the wrong computation partitioning can lead to load imbalance – work stealing can help • Choose your programming model judiciously —shared-memory models make irregular problems easier • Avoid serialization —efficiency requires all processors to be computing —may require changes to algorithm and partitioning of data & computation • Choose the proper grain size for computation —wrong grain size can lead to excessive communication frequency docsity.com

Documents

questions

Wrapup - Parallel Computing - Lecture Slides, Slides of Parallel Computing and Programming

Related documents

Partial preview of the text