Loading…
2014 Rice Oil & Gas HPC has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Industry Technology Solution Pitch Session [clear filter]
Thursday, March 6
 

11:30am PST

Lightning Talk: Reverse Time Migration with Manycore Coprocessors, Leonardo Borges, Intel

Moderators
Speakers
avatar for Leo Borges

Leo Borges

Sr. Staff Engineer, Intel


Thursday March 6, 2014 11:30am - 11:35am PST
BRC 103 Rice University 6500 Main Street at University, Houston, TX 77030

11:35am PST

Lightning Talk: Accelerating Reverse Time Migration: A Dataflow Approach, Hicham Lahlou, Xcelerit

PRESENTATION NOT AVAILABLE

VIDEO NOT AVAILABLE

As the age of harvesting easily accessible Oil and Gas resources is coming to an end, more complex geologies have to be explored to find new reservoirs. These geologies often violate the assumptions underlying the Kirchhoff Time Migration (KTM) algorithm, calling for more complex algorithms to reconstruct the Earth's subsurface from seismic wave measurement data. Hence, Reverse Time Migration (RTM) is the current state of the art algorithm for seismic imaging, giving more accurate 2D and 3D images of the subsurface than KTM. Until recently, the enormous computational complexity involved hindered the widespread application of the RTM algorithm in the industry. With hardware advances of multi-core CPUs as well as increased use of high performance accelerator processors such as GPUs or the Xeon Phi, it is now possible to reconstruct subsurface images within reasonable time frames. However, most programming approaches available for these processors do not provide enough hardware abstraction for end-users, i.e., geophysicists. This poses a significant barrier to adopting advanced HPC hardware and using it efficiently. We briefly explain the RTM algorithm and how it is typically implemented. The algorithm is analyzed to identify the key performance bottlenecks, both for computation and data access. The main implementation challenges are detailed, such as managing the data, parallelizing and distributing the computation, and exploiting hardware capabilities of multi-core CPUs, GPUs, and Xeon Phi. To cope with these challenges, we propose to model the RTM as a dataflow graph and automate the performance optimizations and execution management. Dataflow graphs are directed graphs of processing stages (actors), where data is streamed along the edges and processed by the actors. This model exposes several types of parallelism and optimization opportunities, such as pipeline parallelism, data parallelism, and memory locality. Using this model, programmers can focus on the algorithm itself and the performance optimizations and execution management can be left to an automated tool. Further, the actors themselves can be implemented independently of the execution device, enabling code portability between different hardware. We give a mapping of RTM algorithms to a dataflow graph and show that this is independent of the target execution hardware. The full algorithm is captured in the model, and data and task dependencies are fully exposed - without explicitly using parallel programming concepts. The benefits of this approach and how it can overcome the implementation challenges mentioned earlier are explained in detail. Using an example implementation, important aspects of the execution management, such as memory access patterns, data transfers, cache efficiency, and asynchronous execution are detailed. We give mappings of these aspects to multi-core CPUs, GPUs, and Xeon Phi, explaining the similarities and differences. As typical systems have more than one accelerator processor, we also cover scheduling dataflow graphs to multiple execution devices. As a practical example, we use the Xcelerit SDK as an implementation framework that is based on a dataflow programming model. It exploits the mentioned optimization opportunities and abstracts the hardware specifics from the user. The performance has been measured for both multi-core CPUs and GPUs for a range of algorithm parameters. It is within 5% of equivalent hand-tuned implementations of the algorithm, but achieved with a significantly lower implementation effort. This shows the potential of employing a dataflow approach for RTM.

Moderators
Speakers

Thursday March 6, 2014 11:35am - 11:40am PST
BRC 103 Rice University 6500 Main Street at University, Houston, TX 77030

11:40am PST

Lightning Talk: Accelerating Compute Intense Applications, Geoff Clark, Acceleware Ltd.

Moderators
Speakers

Thursday March 6, 2014 11:40am - 11:45am PST
BRC 103 Rice University 6500 Main Street at University, Houston, TX 77030

11:45am PST

Lightning Talk: Automatic Generation of 3-D FFTs, Brian Duff, SpiralGen, Inc.

DOWNLOAD PRESENTATION

WATCH VIDEO

Automatic Generation of 3-D FFTs Brian Duff[a], Jason Larkin[a], Mike Franusich[a], Franz Franchetti[a][b] [a] SpiralGen, Inc. [b] Dept. of Electrical and Computer Engineering, Carnegie Mellon University BACKGROUND Parallel software development is notoriously difficult. The quest for exascale computing has led to fast-changing, increasingly complex, and diverse supercomputing architectures, which poses a central problem in parallel scientific computing: how can portable optimal performance be achieved with reasonable effort? One possible solution is to generate highly optimized code from a high level specification. Spiral [1, 2] is such a tool for the performance-critical domain of linear transforms, such as the ubiquitous Fourier transform. For a specified transform, Spiral automatically generates high performance code that is tuned to a given architecture. Spiral formulates the tuning as an optimization problem, and exploits the domain-specific mathematical structure of transform algorithms to implement a feedback-driven optimizer. Similar to a human expert, for a specified transform, Spiral “intelligently” generates and explores algorithmic and implementation choices to find the best match to the computer’s micro-architecture. The “intelligence” is provided by a search and learning technique that exploits the structure of the algorithm and implementation space to guide the exploration and optimization. Spiral generates high performance code for a broad set of transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by Spiral competes with, and consistently outperforms, the best available human tuned library code. In this work we extend Spiral to the computer generation of 3D-FFT code crucial in oil exploration and other domains. RESULTS We present results obtained by SpiralGen, Inc., the corporate face of Spiral, on a Blue Gene/Q system for a three dimensional fast Fourier transform (FFT). While Spiral can generate code for a range of different transform algorithms, the FFT is chosen as an example because of its ubiquitous application in diverse scientific fields, including oil exploration. The code was generated for one node of an IBM Blue Gene/Q with up to 64 threads in a batch filling half of the node’s memory. The FFT was performed on n x n x n data cubes of varying size n (shown on the x-axis) and the performance is reported in giga-floating point operations per second (Gflops/s). Figure 1 shows a comparison of Spiral-generated code results against FFTW, a well-known C library for calculating FFTs. The Spiral-generated code is consistently more than two times faster. Reasons include FFTW’s possible suboptimal support for BlueGene’s vector extensions, and Spiral’s ability to detailed tuning for vector extensions and multicore. Figure 2 compares Spiral-generated code with IBM’s own Engineering and Scientific Subroutine Library (ESSL). The ESSL implementation has overhead which causes it to be non-competitive at small FFT sizes. While EESL is competitive at some larger data sizes, the results are again consistently below those from the Spiral-generated code. CONCLUSIONS We show results with highly optimized 3-D-FFT code generated by Spiral for a BlueGene/Q platform. The focus was on a single node and we demonstrated significant speed-up compared to alternative libraries due to full support of all 64 nodes and BlueGene’s vector extension. An extension to more nodes is possible as we already did for 1-D FFTs as part of winning the HPC challenge supporting 128k cores [3]. We note that all Spiral code is fully generated. This implies that customization (e.g., when parts of the input are known to be zero) or porting (to future BlueGene platform) can be done quickly by retargeting the generator. 1. Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo SPIRAL: Code Generation for DSP Transforms Proceedings of the IEEE special issue on "Program Generation, Optimization, and Adaptation," Vol. 93, No. 2, 2005, pp. 232-275 2. Markus Püschel, Franz Franchetti, and Yevgen Voronenko Spiral in Encyclopedia of Parallel Computing, Eds. David Padua, Springer 2011 3. Franz Franchetti, Yevgen Voronenko, and Gheorghe Almasi Automatic Generation of the HPC Challenge’s Global FFT Benchmark for BlueGene/P High Performance Computing for Computational Science – VECPAR 2012, Eds. Michel Daydé, Osni Marques, Kengo Nakajima, Springer 2013, pp. 187-200

Moderators
Speakers
BD

Brian Duff

Software Engineer, SpiralGen Inc.


Thursday March 6, 2014 11:45am - 11:50am PST
BRC 103 Rice University 6500 Main Street at University, Houston, TX 77030

11:50am PST

Lighting Talk: HueSpace: The next generation software development platform for E&P Visual Computing, Michele Isernia, HUE AS

DOWNLOAD PRESENTATION

WATCH VIDEO

Building modern, high performance and interactive visual computing software is quite difficult and this is proven by the major technology gap between the E&P software currently in use and the latest computing technologies available. HueSpace comes from the 3D interactive gaming industry and it is being adopted by major commercial ISVs and oil majors to develop the next generation of E&P software. HUE has been developing HueSpace since 2001 and the platform is solid and validated in production environments, it is also broad as it spans from Seismic to Reservoir to Drilling and Production. The 3D gaming industry in the last 10 years adopted the "engine" model where a single engine controls visualization, computation and large data streaming. This model has enabled the gaming industry to expand and grow tremendously. Huespace is the only commercial solution for E&P bringing this model to the industry. HueSpace enables practically unlimited data size, utilizing intelligent streaming and advanced wavelet compression to stream data on demand and apply advanced computing algorithms to the data "in flight", driven by the interactive user experience and workflow. This approach is so powerful and so different that is literally changing many of the traditional workflows in E&P. HueSpace takes care of all the data management around computing and visualization, automatically takes advantage of multiple accelerators and the data decomposition required. During the presentation we will cover the core architecture and programming model. We will then demonstrate an application that will handle massive TB datasets, apply interactive computing and visualization that normally requires cluster computing and multiple hours or days to solve, and show some of the most advanced 3D visualization available to date. We will also demo the same application working in the cloud and collaboratively across multiple users, from laptops, browsers, tablets, etc... HueSpace supports Linux and Windows, C, C++, .Net/C#, Java and Python and can be used to develop brand new interactive visual applications a well as extend existing software. HueSpace supports hybrid architectures, enabling GPU computing and other accelerators.

Moderators
Speakers

Thursday March 6, 2014 11:50am - 11:55am PST
BRC 103 Rice University 6500 Main Street at University, Houston, TX 77030

11:55am PST

Lightning Talk: Kalray MPPA-256 scalable compute cartridge: an efficient architecture applied to Oil & Gas HPC, Benoît Ganne, Kalray SA

DOWNLOAD PRESENTATION

WATCH VIDEO

Kalray MPPA-256 scalable compute cartridge: an efficient architecture applied to Oil & Gas HPC Benoît Ganne, Christian Chabrerie, Thierry Strudel benoit.ganne@kalray.eu, christian.chabrerie@kalray.eu, thierry.strudel@kalray.eu Introduction Kalray MPPA-256 is a manycore, low-power, dis- tributed memory supercomputer-on-a-chip (SCoC). It is composed of 16 clusters of 17 cores - 16 dedi- cated computational cores and 1 control core - shar- ing 2MB of SRAM, and of several Input/Output (I/O) capabilities controlled by 4 SMP quad-cores such as 2 PCIe Gen3 controller, 2 DDR3 ECC 1600 controllers and 2 40Gbps Ethernet (GbE) controllers among others. Each core implements the Kalray-1 VLIW architecture with a dedicated IEEE-754 sin- gle precision (SP) and double precision (DP) floating point unit (FPU). The 16 clusters and the 4 SMP quad-cores are interconnected through a high band- width, low latency network-on-a-chip (NoC) using a 2D-torus topology. In addition to the standard I/O capabilities, a single MPPA-256 is able to intercon- nect its NoC to 4 MPPA-256 neighbors using Kalray NoCX interconnect. This capability allows to present a single virtual manycore to the programmer, com- posed of multiple MPPA-256. Multiple MPPA-256 can be traversed in each direction transparently. The MPPA-256 topology is depicted on figure 1. Figure 1: Kalray MPPA-256 topology This architecture can be used as a building block for highly energy-efficient supercomputer: a cartridge with 4 MPPA-256, as depicted on figure 2. The 4 MPPA-256 are interconnected together on-board using NoCX, actually presenting a single 64 clus- ters (1024 computational cores), 16 SMP quad-cores manycore to the programmer. The boards can be further interconnected together through NoCX with external connectors or using a chassis interconnect to build an even bigger virtual manycore. Programming model The Kalray MPPA-256 supports C, C++ and For- tran with different programming models and Appli- cation Programming Interfaces (APIs), and can be programmed with MPI and OpenMP. Each MPPA- 256 cluster is an MPI process and in this MPI process OpenMP can be used to easily exploit the 16 compu- tational cores. Due to the distributed and asymmetric nature of the MPPA-256, the best programming model for Oil & Gas algorithms such as Reverse Time Migra- tion (RTM) or Full Waveform Inversion (FWI) is a double-buffering model (application pipeline of depth 2) as depicted by figure 3: each cluster divide its 2MB SRAM space by 2 so that while the 16 computational cores are working on a SRAM half, the next data can be pushed by DMA to the other half. System architecture The seismic data are stored on storage servers, they are sent through multiple 10GbE links to the Kalray MPPA-256 scalable compute cartridges DDR during the initialization phase. The Kalray MPPA-256 scal- able compute cartridges can then be partitioned or paired as needed depending of the workload memory size and required computational power. All the com- putation is then done locally, with frontier exchanges happening between the Kalray MPPA-256 scalable compute cartridges DDR involved. For example, using a single Kalray MPPA-256 scal- able compute cartridge with 32GB of DDR (4GB per DDR interface, 8GB per MPPA) a typical RTM shot might be computed. If the shot does exceed this amount of memory, multiple cartridges can be paired together, and on the contrary multiple shots can be computed on a single cartridge if it does fit in mem- ory. During the computation phase, snapshots can be sent back to the storage server through the 10GbE links. First experiments We experimented typical HPC workloads on a single Kalray MPPA-256 scalable compute cartridge proto- type based on 4 MPPA-256 interconnected together with each MPPA-256 using 4GB of DDR, 2GB per DDR interface. The achieved GFLOPS/W in sin- gle precision and scalability are measured for each experiment. The GFLOPS are measured using hard- ware performance counters and the power consump- tion is measured using an on-board power consump- tion measurement circuit. The first experiment is a general matrix multiply algorithm (GEMM[5]) on a 4096x4096 matrix, scal- ing from a single cluster on a single MPPA-256 to the 64 clusters available on the 4 MPPA-256. The results are presented on figure 4. The following table compares the GFLOPS/W between different architectures[1][2]: Platform & GFLOPS & Power & GFLOPS/W nVidia M2090 Fermi & 780 & 225 & 3.5 Intel i7-3820 & 209 & 95 & 2.2 DSP: TI C6678 & 93 & 10 & 9.3 MPPA-256 & 123 & 10 & 11.9 4x MPPA-256 & 433 & 41 & 10.5 The Intel results are measured using OpenBLAS[6] on the MPPA developer workstation host CPU. The scalability is nearly linear, demonstrating the archi- tecture scalability, whereas the GFLOPS/W are one of the best available today. The second experiment is a complex fast Fourier transform (FFT) of 1K points to 256K points, scaling from a single cluster to the 16 clusters available on a single MPPA. The results are presented on figure 5. The scalability is nearly linear, once again demon- strating the architecture scalability. More experi- ments will be done, to scale up to 4 MPPA-256 and to compare to other architectures. Results using benchmarks more relevant for Oil & Gas HPC such as 3-dimensional finite difference (3DFD) algorithms will be shown. Conclusion We showed that the Kalray MPPA-256 scalable com- pute cartridge expose 2 key characteristics to support future Oil & Gas Exascale HPC: Scalability: allowing to build a system as a stack- ing of well-known, more simple, systems Power efficiency: Exascale system will need more than 50GFLOPS/W[4] Still, the Kalray MPPA-256 scalable compute car- tridge is only a first step in the direction of the Oil & Gas Exascale HPC. More power efficiency will be needed in coming years, and the authors think that the model of having simple, power efficient building blocks such as scalable interconnections of multiple manycores[3] will remain. The distributed memory nature of this architecture guarantees its scalability, and as such the system can be precisely sized and expanded as needed. This paved the way to a new paradigm for scalable software-defined systems. References [1] NVIDIA, NVIDIA CUBLAS performance, available at https://developer.nvidia.com/ cublas. [2] Francisco D. Igual, Murtaza Ali, Arnon Fried- mann, Eric Stotzer, Timothy Wentz and Robert van de Geijn, Unleashing DSPs for General- Purpose HPC, available at http://www.cs. utexas.edu/users/flame/pubs/FLAWN61.pdf. [3] US National Academy of Science, The New Global Ecosystem in Advanced Computing: Implications for U.S. Competitiveness and National Security (2012). [4] DARPA, Power Efficiency Revolution For Em- bedded Computing Technologies (PERFECT) program. [5] National Science Foundation, available at http: //www.netlib.org/blas/. [6] OpenBLAS, available at http://www.openblas. net/.

Moderators
Speakers

Thursday March 6, 2014 11:55am - 12:00pm PST
BRC 103 Rice University 6500 Main Street at University, Houston, TX 77030