Lightning Talk: Automatic Generation of 3-D FFTs, Brian Duff, SpiralGen, Inc.
#### Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

**DOWNLOAD PRESENTATION**

**WATCH VIDEO**

Automatic Generation of 3-D FFTs Brian Duff[a], Jason Larkin[a], Mike Franusich[a], Franz Franchetti[a][b] [a] SpiralGen, Inc. [b] Dept. of Electrical and Computer Engineering, Carnegie Mellon University BACKGROUND Parallel software development is notoriously difficult. The quest for exascale computing has led to fast-changing, increasingly complex, and diverse supercomputing architectures, which poses a central problem in parallel scientific computing: how can portable optimal performance be achieved with reasonable effort? One possible solution is to generate highly optimized code from a high level specification. Spiral [1, 2] is such a tool for the performance-critical domain of linear transforms, such as the ubiquitous Fourier transform. For a specified transform, Spiral automatically generates high performance code that is tuned to a given architecture. Spiral formulates the tuning as an optimization problem, and exploits the domain-specific mathematical structure of transform algorithms to implement a feedback-driven optimizer. Similar to a human expert, for a specified transform, Spiral “intelligently” generates and explores algorithmic and implementation choices to find the best match to the computer’s micro-architecture. The “intelligence” is provided by a search and learning technique that exploits the structure of the algorithm and implementation space to guide the exploration and optimization. Spiral generates high performance code for a broad set of transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by Spiral competes with, and consistently outperforms, the best available human tuned library code. In this work we extend Spiral to the computer generation of 3D-FFT code crucial in oil exploration and other domains. RESULTS We present results obtained by SpiralGen, Inc., the corporate face of Spiral, on a Blue Gene/Q system for a three dimensional fast Fourier transform (FFT). While Spiral can generate code for a range of different transform algorithms, the FFT is chosen as an example because of its ubiquitous application in diverse scientific fields, including oil exploration. The code was generated for one node of an IBM Blue Gene/Q with up to 64 threads in a batch filling half of the node’s memory. The FFT was performed on n x n x n data cubes of varying size n (shown on the x-axis) and the performance is reported in giga-floating point operations per second (Gflops/s). Figure 1 shows a comparison of Spiral-generated code results against FFTW, a well-known C library for calculating FFTs. The Spiral-generated code is consistently more than two times faster. Reasons include FFTW’s possible suboptimal support for BlueGene’s vector extensions, and Spiral’s ability to detailed tuning for vector extensions and multicore. Figure 2 compares Spiral-generated code with IBM’s own Engineering and Scientific Subroutine Library (ESSL). The ESSL implementation has overhead which causes it to be non-competitive at small FFT sizes. While EESL is competitive at some larger data sizes, the results are again consistently below those from the Spiral-generated code. CONCLUSIONS We show results with highly optimized 3-D-FFT code generated by Spiral for a BlueGene/Q platform. The focus was on a single node and we demonstrated significant speed-up compared to alternative libraries due to full support of all 64 nodes and BlueGene’s vector extension. An extension to more nodes is possible as we already did for 1-D FFTs as part of winning the HPC challenge supporting 128k cores [3]. We note that all Spiral code is fully generated. This implies that customization (e.g., when parts of the input are known to be zero) or porting (to future BlueGene platform) can be done quickly by retargeting the generator. 1. Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo SPIRAL: Code Generation for DSP Transforms Proceedings of the IEEE special issue on "Program Generation, Optimization, and Adaptation," Vol. 93, No. 2, 2005, pp. 232-275 2. Markus Püschel, Franz Franchetti, and Yevgen Voronenko Spiral in Encyclopedia of Parallel Computing, Eds. David Padua, Springer 2011 3. Franz Franchetti, Yevgen Voronenko, and Gheorghe Almasi Automatic Generation of the HPC Challenge’s Global FFT Benchmark for BlueGene/P High Performance Computing for Computational Science – VECPAR 2012, Eds. Michel Daydé, Osni Marques, Kengo Nakajima, Springer 2013, pp. 187-200

## Attendees
(0)

Automatic Generation of 3-D FFTs Brian Duff[a], Jason Larkin[a], Mike Franusich[a], Franz Franchetti[a][b] [a] SpiralGen, Inc. [b] Dept. of Electrical and Computer Engineering, Carnegie Mellon University BACKGROUND Parallel software development is notoriously difficult. The quest for exascale computing has led to fast-changing, increasingly complex, and diverse supercomputing architectures, which poses a central problem in parallel scientific computing: how can portable optimal performance be achieved with reasonable effort? One possible solution is to generate highly optimized code from a high level specification. Spiral [1, 2] is such a tool for the performance-critical domain of linear transforms, such as the ubiquitous Fourier transform. For a specified transform, Spiral automatically generates high performance code that is tuned to a given architecture. Spiral formulates the tuning as an optimization problem, and exploits the domain-specific mathematical structure of transform algorithms to implement a feedback-driven optimizer. Similar to a human expert, for a specified transform, Spiral “intelligently” generates and explores algorithmic and implementation choices to find the best match to the computer’s micro-architecture. The “intelligence” is provided by a search and learning technique that exploits the structure of the algorithm and implementation space to guide the exploration and optimization. Spiral generates high performance code for a broad set of transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by Spiral competes with, and consistently outperforms, the best available human tuned library code. In this work we extend Spiral to the computer generation of 3D-FFT code crucial in oil exploration and other domains. RESULTS We present results obtained by SpiralGen, Inc., the corporate face of Spiral, on a Blue Gene/Q system for a three dimensional fast Fourier transform (FFT). While Spiral can generate code for a range of different transform algorithms, the FFT is chosen as an example because of its ubiquitous application in diverse scientific fields, including oil exploration. The code was generated for one node of an IBM Blue Gene/Q with up to 64 threads in a batch filling half of the node’s memory. The FFT was performed on n x n x n data cubes of varying size n (shown on the x-axis) and the performance is reported in giga-floating point operations per second (Gflops/s). Figure 1 shows a comparison of Spiral-generated code results against FFTW, a well-known C library for calculating FFTs. The Spiral-generated code is consistently more than two times faster. Reasons include FFTW’s possible suboptimal support for BlueGene’s vector extensions, and Spiral’s ability to detailed tuning for vector extensions and multicore. Figure 2 compares Spiral-generated code with IBM’s own Engineering and Scientific Subroutine Library (ESSL). The ESSL implementation has overhead which causes it to be non-competitive at small FFT sizes. While EESL is competitive at some larger data sizes, the results are again consistently below those from the Spiral-generated code. CONCLUSIONS We show results with highly optimized 3-D-FFT code generated by Spiral for a BlueGene/Q platform. The focus was on a single node and we demonstrated significant speed-up compared to alternative libraries due to full support of all 64 nodes and BlueGene’s vector extension. An extension to more nodes is possible as we already did for 1-D FFTs as part of winning the HPC challenge supporting 128k cores [3]. We note that all Spiral code is fully generated. This implies that customization (e.g., when parts of the input are known to be zero) or porting (to future BlueGene platform) can be done quickly by retargeting the generator. 1. Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo SPIRAL: Code Generation for DSP Transforms Proceedings of the IEEE special issue on "Program Generation, Optimization, and Adaptation," Vol. 93, No. 2, 2005, pp. 232-275 2. Markus Püschel, Franz Franchetti, and Yevgen Voronenko Spiral in Encyclopedia of Parallel Computing, Eds. David Padua, Springer 2011 3. Franz Franchetti, Yevgen Voronenko, and Gheorghe Almasi Automatic Generation of the HPC Challenge’s Global FFT Benchmark for BlueGene/P High Performance Computing for Computational Science – VECPAR 2012, Eds. Michel Daydé, Osni Marques, Kengo Nakajima, Springer 2013, pp. 187-200

Thursday March 6, 2014 11:45am - 11:50am PST

BRC 103*Rice University 6500 Main Street at University, Houston, TX 77030*

BRC 103