计算机科学 ›› 2020, Vol. 47 ›› Issue (1): 31-39.doi: 10.11896/jsjkx.190900179

• 计算机体系结构 • 上一篇    下一篇

一种偶数基Cooley-Tukey FFT高性能实现方法

龚彤艳1,2,张广婷2,贾海鹏2,袁良2   

  1. (贵州财经大学信息学院 贵阳 550025)1;
    (中国科学院计算技术研究所计算机体系结构国家重点实验室 北京100190)2
  • 收稿日期:2019-08-21 发布日期:2020-01-19
  • 通讯作者: 张广婷([email protected])
  • 基金资助:
    国家重点研发计划(2018YFC0809306);国家自然科学基金青年科学基金(61602443);国家自然科学基金重点项目(61432018)

High-performance Implementation Method for Even Basis of Cooley-Tukey FFT

GONG Tong-yan1,2,ZHANG Guang-ting2,JIA Hai-peng2,YUAN Liang2   

  1. (School of Information,Guizhou University of Finance and Economics,Guiyang 550025,China)1;
    (State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)2
  • Received:2019-08-21 Published:2020-01-19
  • About author:GONG Tong-yan,born in 1992,postgraduate,not member of China Computer Federation (CCF).Her main research interests include parallel processing and high-performance computing;ZHANG Guang-ting,born in 1987,postgraduate,is member of China Computer Federation (CCF).Her main research interests include parallel algorithms and big data.
  • Supported by:
    This work was supported by the National Key Research and Development Program of China (2018YFC0809306),Young Scientists Fund of the National Natural Science Foundation of China (61602442) and Key Program of the National Natural Science Foundation of China (61432018).

摘要: 快速傅里叶变换(Fast Fourier Transform,FFT)是最重要的基础算法之一,在科学计算、信号处理、图像处理等领域都有着广泛的应用。随着这些应用领域对实时性需求的进一步提高,FFT算法面临着越来越高的性能要求。在现有的FFT算法库中,FFT算法的求解速度和计算精度受到一定程度的限制,而且也少有研究者对偶数基Cooley-Tukey FFT的高性能实现提出相应的优化策略并对技术进行深入研究。基于此,文中提出了一套针对偶数基的Cooley-Tukey FFT的优化策略和方法。首先构建一个SIMD(Single Instruction Multiple Data)友好、支持混合基的蝶形网络,然后根据偶数基旋转因子特性最大限度地降低蝶形计算的复杂度,接着通过SIMD汇编优化、汇编指令重排及选择、寄存器分配策略制定、高性能矩阵转置算法等方法来优化应用,最后实现一个高性能的FFT算法库。目前,最流行、应用最广的FFT有FFTW和Intel MKL。实验结果表明,在X86计算平台上,新提出的这套针对偶数基Cooley-Tukey FFT的技术所实现的FFT算法库的性能全面优于MKL和FFTW。所提出的这套高性能算法优化和实现技术体系,可推广到除偶数基以外的其他基的实现和优化上,为进一步的研究开发工作奠定一定的基础,进而突破FFT算法在硬件平台上的性能瓶颈,实现一套针对特定平台的高性能FFT算法库。

关键词: SIMD汇编优化, 蝶形计算优化, 蝶形网络优化, 高性能FFT库, 快速傅里叶变换算法, 偶数基

Abstract: Fast Fourier transform (FFT) is one of the most important basic algorithms,which is widely used in scientific calculation,signal processing,image processing and other fields.With the further improvement of real-time requirements in these application fields,fast Fourier transform algorithms are facing higher and higher performance requirements.In the existing FFT algorithm library,the solution speed and calculation accuracy of FFT algorithm are limited to a certain extent,and few researchers put forward corresponding optimization strategies and conducted in-depth research on the implementation of cooley-tukey fast Fourier transform based on even Numbers.Based on this,this paper put forward a set of for even basis of optimization strategy and methodfor Colley-Turkey fast Fourier transform.Firstly,a friendly butterfly network supporting SIMD mixed is constructed.Secondly,according to the even base rotation factor characteristics,the complexity of the butterfly calculation is reduced to a maximum degree.Thirdly,through the SIMD assembly optimization,assembly instruction rearrangement and selection,register allocation strategy and high performance matrix transpose algorithm method,the application is optimized .Finally a high performance FFT algorithm library is achieved.Currently,the most popular and widely used FFT are FFTW and Intel MKL.Experimental results show that on X86 computing platform,the performance of FFT library based on cooley-tukey FFT is better than MKL and FFTW.The high performance algorithm is put forward by the new optimization method and implementation technology system,which can be generalized to other except the even base based on the realization and optimization of a certain basis for further research and development work,to break through the FFT algorithm performance bottlenecks in the hardware platform,to achieve a high performance FFT algorithms library for a specific platform.

Key words: Butterfly calculation optimization, Butterfly network optimization, Even basis, Fast Fourier transform algorithm, High performance FFT library, SIMD assembly optimization

中图分类号: 

  • TP311.52
[1]COOLEY J W,TUKEY J W.An algorithm for the machine calculation of complex Fourier series[J].Mathematics of Computation,1965,19(90):297-301.
[2]COCHRAN W T,COOLEY J W,FAVIN D L,et al.What is the fast Fourier transform?[J].Proceedings of the IEEE,1967,55(10):1664-1674.
[3]DUHAMEL P,VETTERLI M.Fast Fourier transforms:a tutorial review and a state of the art[J].Signal Processing,1990,19(4):259-299.
[4]STRANG,GILBERT.“Wavelets”[J].American Scientist, 1994,2(3):250-255.
[5]KENT R D,READ C.Acoustic Analysis of Speech[M].Singular Publishing Group,2002.
[6]DONGARRA J,SULLIVAN F.Guest editors' introduction:The top 10 algorithms[J].Computing in Science & Engineering,2000,2(1):22-23.
[7]FRIGO M,JOHNSON S G.FFTW:An adaptive software architecture for the FFT[C]∥Proceedings of the 1998 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,1998,3:1381-1384.
[8]Intel® Math Kernel Library Release Notes and New Features .(2017-09-10) .https://software.intel.com/en-us/articles/intel-math-kernel-library-release-notes-and-new-features.
[9]Developer Reference for Intel® Math Kernel Library - C .(2019-09-10) .https://software.intel.com/zhcn/download/developer-reference-for-intel-math-kernel-libra-ry-c.
[10]FRIGO M,JOHNSON S G.The Design and Implementation of FFTW3∥Proceedings of the IEEE.2005,216-231.
[11]JOHNSON S G,FRIGO M.Implementing FFTs in practice.https://scholar.google.com/scholar?q=Implementing+FFTs+in+practice&hl=zh-CN&as_sdt=0&as_vis=1&oi=scholart.
[12]MOHAMED K,ALI F H H M,ARIFFIN S,et al.An Improved AES S-box Based on Fibonacci Numbers and Prime Factor.IJ Network Security,2018,20(6):1206-1214.
[13]JOHNSON S G,FRIGO M.A modified split-radix FFT with fewer arithmetic operations.Signal Processing,2007,55(1):111-119.
[14]DUHAMEL P,HOLLMANN H.Split radix'FFT algorithm.Electronics letters,1984,20(1):14-16.
[15]BADAR S,DANDEKAR D R.High speed FFT processor design using radix? 4 pipelined architecture[C]∥2015 International Conference on Industrial Instrumentation and Control (ICIC).IEEE,2015:1050-1055.
[16]NUGTEREN C.Clblast:A tuned opencl BLAS library[C]∥Proceedings of the International Workshop on OpenCL.ACM,2018:5.
[17]NAKATA M.Basics and Practice of Linear Algebra Calculation Library BLAS and LAPACK[M]∥The Art of High Performance Computing for Computational Science. Singapore:Springer,2019:83-112.
[18]BUJANOVIC' Z,DRMACˇ Z.New robust ScaLAPACK routine for computing the QR factorization with column pivoting.arXiv:1910.05623,2019.
[19]MULTIPLICATION M,SICARD-RAMÍREZ A.Intel R G Math Kernel Library.Universidad EAFIT,2018,2018:3-21.
[20]INTEL R.Math kernel library[J/OL].https://scholar.google.com.hk/scholar?hl=zh-CN&as_sdt=0%2C5&as_yhi=2009&q=Intel+Math+Kernel+Library&btnG=.
[21]FOG A. Intel’s “cripple AMD” function.(2009-12-30). https://www.agner.org/optimize/blog/read.php?i=49#49.
[22]YANG M R.MKL has bad performances on an AMD cpu[EB/QL].(2019-02-02).https://sites.google.com/a/uci.edu/mi ngru-yang/programming/mkl-has-bad-performance-on-an-amd-cpu.
[23]The NVIDIA CUDA Fast Fourier Transform library [EB/OL].(2019-08-14) .https://developer.nvidia.com/cufft.
[24]ALGLIB User Guide online [EB/OL].(2019-02-21) .http://www.alglib.net/fasttransforms/fft.php.
[25]EISL J,GRIMMER M,SIMON D,et al.Trace-based Register Allocation in a JIT Compiler[C]∥Proceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform:Virtual Machines,Languages,and Tools.ACM,2016:14.
[26]EISL J,LEOPOLDSEDER D,MÖSSENBÖCK H.Parallel trace register allocation[C]∥Conference on Managed Languages & Runtimes (ManLang’18).2018.
[1] 余敦辉, 成涛, 袁旭.
基于排序学习的软件众包任务推荐算法
Software Crowdsourcing Task Recommendation Algorithm Based on Learning to Rank
计算机科学, 2020, 47(12): 106-113. https://doi.org/10.11896/jsjkx.200300107
[2] 王栋, 商红慧, 张云泉, 李琨, 贺新福, 贾丽霞.
原子动力学蒙特卡洛程序MISA-KMC在反应堆压力容器钢辐照损伤研究中的应用
Application of Atomic Dynamics Monte Carlo Program MISA-KMC in Study of Irradiation Damage of Reactor Pressure Vessel Steel
计算机科学, 2020, 47(4): 30-35. https://doi.org/10.11896/jsjkx.191100045
[3] 曲广强, 孙斌.
基于区块链的实验教学经费可信任回溯机制研究
Study on Trustworthy Backtracking Mechanism of Experimental Teaching Fund Based on Blockchain
计算机科学, 2019, 46(11A): 553-556.
[4] 侯禹臣, 吴伟.
静态图像行为标注众包系统的设计与实现
Design and Implementation of Crowdsourcing System for Still Image Activity Annotation
计算机科学, 2019, 46(11A): 580-583.
[5] 吕小虎, 韩笑冬, 宫江雷, 王志杰, 刘小鲲.
基于系统多维要素的安全关键软件验证方法
Systemic Muti-factors Based Verification Method for Safety-critical Software
计算机科学, 2019, 46(9): 156-161. https://doi.org/10.11896/j.issn.1002-137X.2019.09.022
[6] 姚庆, 郑凯, 刘垚, 王肃, 孙军, 徐梦轩.
SOM算法在申威众核上的实现和优化
Implementation and Optimization of SOM Algorithm on Sunway Many-core Processors
计算机科学, 2018, 45(11A): 591-596.
[7] 刘正浩, 张广泉.
基于WSAN的物联网软件分布式知识框架及其实现方法
Distributed Knowledge Framework of IOT Software Based on WSAN and Its Implementation Strategy
计算机科学, 2019, 46(1): 148-154. https://doi.org/10.11896/j.issn.1002-137X.2019.01.023
[8] 郝俊生,李冰锋,陈曦,高文娟.
基于Android平台的高校网络订餐系统的设计与实现
Design and Implementation of Network Subscription System Based on Android Platform
计算机科学, 2018, 45(6A): 591-594.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!