Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Fast Fourier Transforms and Graphical Processing Units - Outline | CMSC 828, Study notes of Computer Science

University of Maryland Computer Science

Material Type: Notes; Subject: Computer Science; University: University of Maryland; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-vor 🇺🇸

10 documents

1 / 32

Partial preview of the text

Download Fast Fourier Transforms and Graphical Processing Units - Outline | CMSC 828 and more Study notes Computer Science in PDF only on Docsity! Fast Fourier Transforms (FFTs) and Graphical Processing Units (GPUs) Kate Despain CMSC828e – p. 1/32 Outline • Motivation • Introduction to FFTs • Discrete Fourier Transforms (DFTs) • Cooley-Tukey Algorithm • CUFFT Library • High Performance DFTs on GPUs by Microsoft Corporation • Coalescing • Use of Shared Memory • Calculation-rich Kernels – p. 2/32 Introduciton: What is an FFT? • Algorithm to compute Discrete Fourier Transform (DFT) • Straightforward implementation requires O ( N2 ) MADD operations Xk = N−1 ∑ n=0 xn exp− 2πi N kn – p. 5/32 Introduction: Cooley-Tukey • FFTs are a subset of efficient algorithms that only require O (N log N) MADD operations • Most FFTs based on Cooley-Tukey algorithm (originally discovered by Gauss and rediscovered several times by a host of other people) Consider N as a composite, N = r1r2. Let k = k1r1 + k0 and n = n1r2 + n0. Then, X (k1, k0) = r2−1 X n0=0 r1−1 X n1=0 x (n1, n0) exp− 2πi N k (n1r2 + n0) The sum over n1 only depends on k0, leading to r1 operations for calculating a single k0 output value. The second sum over n0 requires r2 operations for calculating a single k1 output value, for a total of r1 + r2 operations per output element. If you divide the transform into m equal composites, you ultimately get rN log r N operations. – p. 6/32 Cartoon Math for FFT - I For each element of the output vector F (k), we need to multiply each element of the input vector, f (n) by the correct exponential term, e− 2πi N nk where n is the corresponding index of the element of the input vector and k is the index of the element of the output vector. – p. 7/32 Cartoon Math for FFT - IV If we factor appropriately, we can eliminate one multiplication per input element! (This translates to a net savings of N/2 MADD operations.) But wait, there’s more... – p. 10/32 Cartoon Math for FFT - V We could take the output vector and divide it into two this time - according to whether k is even or odd. – p. 11/32 Cartoon Math for FFT - VI For any given k we now have something that looks similar to our original Fourier Transform. We can repeat this procedure recursively. Each output element requires ∼ log2 N operations, and since there are N output elements, we get O(N log2 N) operations as promised. – p. 12/32 CUFFT - FFT for CUDA • Library for performing FFTs on GPU • Can Handle: • 1D, 2D or 3D data • Complex-to-Complex, Complex-to-Real, and Real-to-Complex transforms • Batch execution in 1D • In-place or out-of-place transforms • Up to 8 million elements in 1D • Between 2 and 16384 elements in any direction for 2D and 3D – p. 15/32 1D Complex-to-Complex Example for batched, in-place case: #include <cufft.h> #define NX 256 #define BATCH 10 cufftHandle plan; cufftComplex *data; cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH); /* Create a 1D FFT plan. */ cufftPlan1d(&plan, NX, CUFFT C2C, BATCH); /* Use the CUFFT plan to transform the signal in place. */ cufftExecC2C(plan, data, data, CUFFT FORWARD); /* Destroy the CUFFT plan. */ cufftDestroy(plan); cudaFree(data); CUDA CUFFT Library, v. 2.1 (2008) Santa Clara, CA: NVIDIA Corporation – p. 16/32 2D Complex-to-Real Example for out-of-place case: #define NX 256 #define NY 128 cufftHandle plan; cufftComplex *idata; cufftReal *odata; cudaMalloc((void**)&idata, sizeof(cufftComplex)*NX*NY); cudaMalloc((void**)&odata, sizeof(cufftReal)*NX*NY); /* Create a 2D FFT plan. */ cufftPlan2d(&plan, NX, NY, CUFFT C2R); /* Use the CUFFT plan to transform the signal out of place. */ cufftExecC2R(plan, idata, odata); /* Destroy the CUFFT plan. */ cufftDestroy(plan); cudaFree(idata); cudaFree(odata); CUDA CUFFT Library, v. 2.1 (2008) Santa Clara, CA: NVIDIA Corporation– p. 17/32 CUFFT Performance CUFFT seems to be a sort of "first pass" implementation. It doesn’t appear to fully exploit the strengths of mature FFT algorithms or the hardware of the GPU. For example, "Many FFT algorithms for real data exploit the conjugate symmetry property to reduce computation and memory cost by roughly half. However, CUFFT does not implement any specialized algorithms for real data, and so there is no direct performance benefit to using real-to-complex (or complex-to-real) plans instead of complex-to-complex." - CUDA CUFFT Library, v. 2.1 (2008) Santa Clara, CA: NVIDIA Corporation – p. 20/32 Latest Developments "High Performance Discrete Fourier Transforms on Graphics Processors" – Govindaraju, NK, et al. Presented at SC08. • Use the Stockham algorithm (requires out-of-place transform) • Exploit coalescing with global memory • Exploit fast access to shared memory • Calculation-rich kernels – p. 21/32 Coalescing I • Refers to global memory access • Memory transferred in "segments" • For Compute Capability 1.0 or 1.1(e.g. 9800s and below), segment size = 64- or 128-bytes • For Compute Capability 1.2 and higher, segment size = 32-, 64-, or 128-bytes • To achieve coalescing • A half-warp should utilize all bytes in a given memory transfer (e.g. each thread accesses a 16-bit word) • Adjacent threads should access adjacent memory NVIDIA CUDA Programming Guide, v.21. (2008). – p. 22/32 Challenges of Shared Memory • Limited to roughly 16kB/multiprocessor (or block) • Organized into 16 banks, and 32-bit words are distributed in a round-robin fashion • Bank conflicts if two threads from the same half-warp try to access the same bank at the same time (anything bigger than float is going to have problems) • Bank conflicts are handled through serialization • Some overhead resides there • Function arguments • Execution configurations NVIDIA CUDA Programming Guide, v.21. (2008). – p. 25/32 Bank Conflicts NVIDIA CUDA Programming Guide, v.21. (2008). p. 69 – p. 26/32 Advantages of Shared Memory • Access can take as little as two clock cycles! • Atomic operations allowed • Employs a broadcast mechanism NVIDIA CUDA Programming Guide, v.21. (2008). – p. 27/32 Additional Tricks • Hierarchal FFTs - break large FFT into small FFTs that will fit into shared memory • Mixed-radix FFTs - handles non-power-of-two sizes • Multi-dimensional FFTs- handled similar to hierarchal FFTs • Real FFTs - exploit symmetry – p. 30/32 Batched 1D Power-of-Two Data "For large N ... our FFT’s are up to 4 times faster than CUFFT and 19 times faster than MKL" "High Performance Discrete Fourier Transforms on Graphics Processors" – Govindaraju, NK, et al. SC08). – p. 31/32 2D Power-of-Two Data Top: Single 2D FFTs of size NxN; Middle: Batched 2D FFTs; Bottom: 2D FFTs of fixed size 224 "High Performance Discrete Fourier Transforms on Graphics Processors" – Govindaraju, NK, et al. SC08). – p. 32/32

Documents

questions

Fast Fourier Transforms and Graphical Processing Units - Outline | CMSC 828, Study notes of Computer Science

Related documents

Partial preview of the text