基于CUDA的FFT并行計算研究

發(fā)布時間：2018-07-22 12:24

【摘要】：離散傅立葉變換是數(shù)字信號處理系統(tǒng)中常用的重要數(shù)學變換，算法的可行性、復雜度和運行效率等都是影響計算結果的重要因素。近年來，GPU正在以大大超過摩爾定律的速度高速發(fā)展，主流GPU的單精度浮點處理能力和外部存儲器帶寬相對于同時期的CPU都有明顯的優(yōu)勢，基于圖形硬件GPU的通用計算正成為并行領域的研究熱點。特別是NVIDIA公司于2007年推出的CUDA統(tǒng)一計算設備架構，在編程、優(yōu)化等方面都得到了顯著的提升，極大地增強了GPU的通用計算能力。CUDA不需要借助于圖形API，采用類C語言進行開發(fā)，使開發(fā)人員比較容易的從CPU編程模式過渡到GPU編程模式。隨著GPU可編程能力、并行處理能力以及應用范圍的不斷提升和擴展，GPU已發(fā)展成為一種高度并行化、多線程、多核的處理器。利用GPU的并行處理能力，以CPU+GPU混合加速為特征的異構并行計算系統(tǒng)將會成為未來高性能計算的主流。本文首先分析了CUDA硬件架構和編程模型，在分析GPU通用計算現(xiàn)狀的基礎上，提出CUDA程序設計的方法。然后深入探討了快速傅立葉變換的基本原理，詳細介紹了時域抽取基2-FFT算法的實現(xiàn)過程及相關性質，根據(jù)快速傅立葉算法高度并行分治的特征，結合CUDA編程模型及實現(xiàn)機制，用CUDA的類C語言設計了快速傅立葉變換的并行算法。改進算法采用CPU+GPU異構模型方式，將GPU引入到計算中來，讓GPU承擔程序中的大規(guī)模計算——復數(shù)的加法與算數(shù)的乘法。傳統(tǒng)串行算法實現(xiàn)N點序列的快速傅立葉變換需要三層循環(huán)，時間復雜度為O (Nlog2N)。改進后的算法采用線程層次組織結構，將同一級中相互獨立的N/2個蝶形運算實現(xiàn)并行操作，原有的三層循環(huán)可以用兩層循環(huán)來完成，，時間復雜度變?yōu)镺 (N)，從而實現(xiàn)對快速傅立葉變換的加速與優(yōu)化。文章最后搭建CUDA實驗運行環(huán)境，實現(xiàn)傳統(tǒng)快速傅立葉算法在CPU上的運行，以及改進后的算法在GPU上的運行，同時還調用了FFTW函數(shù)庫的程序代碼和CUFFT函數(shù)庫的程序代碼，并將以上結果進行比較，通過對實驗數(shù)據(jù)的分析證明了運用CUDA架構實現(xiàn)快速傅立葉算法的優(yōu)越性，也驗證了GPU在處理大量數(shù)據(jù)計算時所占的優(yōu)勢。
[Abstract]:Discrete Fourier transform (DFT) is an important mathematical transformation commonly used in digital signal processing system. The feasibility, complexity and efficiency of the algorithm are the important factors that affect the calculation results. In recent years, GPU is developing at a speed that greatly exceeds Moore's Law. The single precision floating-point processing capability and external memory bandwidth of mainstream GPUs are obviously superior to those of CPU in the same period. General computing based on graphics hardware GPU is becoming a research hotspot in parallel field. In particular, NVIDIA introduced the CUDA unified computing equipment architecture in 2007, which has been greatly improved in programming, optimization and so on, greatly enhanced the GPU's general computing capability .CUDA does not need to use graphics API, using C language to develop. Make it easier for developers to transition from CPU programming mode to GPU programming mode. With the development of GPU's programmable ability, parallel processing ability and application scope, GPU has developed into a highly parallel, multithreaded, multi-core processor. Using the parallel processing capability of GPU, heterogeneous parallel computing systems characterized by CPU-GPU hybrid acceleration will become the mainstream of high-performance computing in the future. In this paper, the hardware architecture and programming model of CUDA are analyzed. Based on the analysis of the current situation of GPU general computing, the method of CUDA programming is put forward. Then the basic principle of Fast Fourier transform (FFT) is discussed in depth, and the implementation process and related properties of 2-FFT algorithm in time domain are introduced in detail. According to the characteristics of high parallel division and control of FFT algorithm, combined with CUDA programming model and implementation mechanism, The parallel algorithm of fast Fourier transform is designed with C-like language of CUDA. The improved algorithm adopts the CPU GPU heterogeneous model, and introduces GPU into the calculation, which allows GPU to undertake the addition and multiplication of the complex number and the large scale computation in the program. The traditional serial algorithm for the fast Fourier transform of N-point sequence needs three-layer cycle, and the time complexity is O (Nlog2N). The improved algorithm uses thread-level organization structure to realize parallel operation of independent N / 2 butterfly operations in the same level, and the original three-layer loop can be completed by two-layer loop. The time complexity becomes O (N), which can accelerate and optimize the fast Fourier transform (FFT). Finally, the paper builds up the CUDA experimental running environment, realizes the traditional fast Fourier algorithm running on CPU, and the improved algorithm running on GPU. At the same time, the program code of FFTW function library and CUFFT function library are also called. Through the analysis of experimental data, the superiority of fast Fourier algorithm using CUDA architecture is proved, and the advantage of GPU in dealing with a large amount of data is verified.
【學位授予單位】：湖南大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP338.6

【參考文獻】

相關期刊論文前8條

1 李偉;孫進平;王俊;李少洪;;一種基于FPGA的超高速32k點FFT處理器[J];北京航空航天大學學報;2007年12期

2 韓博;周秉鋒;;GPGPU性能模型及應用實例分析[J];計算機輔助設計與圖形學學報;2009年09期

3 趙麗麗;張盛兵;張萌;姚濤;;基于CUDA的高速FFT計算[J];計算機應用研究;2011年04期

4 王芳;張學鋒;程增會;;快速傅立葉變換中的一種倒位序生成法[J];計算機應用與軟件;2011年02期

5 林一松;楊學軍;唐滔;王桂彬;徐新海;;一種基于并行度分析模型的GPU功耗優(yōu)化技術[J];計算機學報;2011年04期

6 朱林;王志凌;黃天戍;;基于DSP并行系統(tǒng)的FFT算法實現(xiàn)[J];武漢理工大學學報;2009年20期

7 董惠;衛(wèi)銘斐;江麗;曾俊;;基于FPGA的FFT處理器的設計與仿真[J];微電子學與計算機;2008年11期

8 王潤澤;王穎;楊棟毅;;大規(guī)模FFT并行計算中二維SRAM的設計[J];中國科學院研究生院學報;2008年01期

本文編號：2137456

資料下載