Loading…
2014 Rice Oil & Gas HPC has ended
Thursday, March 6 • 2:00pm - 2:20pm
Applications Session II: Automatic Performance Tuning of Reverse Time Migration Using The Abstract Data and Communication Library, Saber Feki, KAUST

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!


DOWNLOAD PRESENTATION

WATCH VIDEO

With the increased complexity and diversity of mainstream HPC systems, significant effort is required to tune applications in order to achieve the best possible performance for each particular platform. This task becomes more and more challenging and requiring a larger set of skills. Automatic performance tuning is becoming a must for optimizing applications such as Reverse Time Migration (RTM) widely used in seismic imaging for oil and gas exploration. In the RTM application, the time-dependent partial differential acoustic wave equation is discretized in space and time, and the resulting system of linear equations is solved for each time step using an explicit scheme. The 3–D version of RTM is computationally intensive and its execution time becomes reasonable for field data only with a parallel implementation using domain decomposition: the simulation grid is split for each shot into smaller 3–D blocks across multiple MPI processes. At each time step, the computation of the boundary grid points requires neighboring processes to exchange the values of the needed stencil points belonging to neighboring subdomains. Typical implementations make use of the Message Passing Interface (MPI) routines for data exchange and therefore implying an extra execution time for the communication operations. The communication overhead that stem from the parallelization of the RTM algorithm would be considerably reduced using an auto-tuning tool, for instance, the Abstract Data and Communication Library (ADCL) [1, 2]. ADCL is an MPI-based communication library that aims at providing the lowest possible execution time for the communication operations and to ease the software development process with high data abstraction and predefined routines. ADCL allows the parallel code to adapt itself to the current architecture and software environment at runtime. The idea behind ADCL is to select the fastest of the available implementations for a given communication pattern during the (regular) execution of the application. For example, ADCL provides 20 different implementations for multi-dimensional (e.g., 2-D, 3-D) neighborhood communication using different combinations of (i) number of simultaneous communication partners, (ii) handling of non-contiguous messages, and (iii) MPI data transfer primitive. ADCL uses the first iterations of the application to determine the fastest neighborhood communication routine for the current execution conditions. Once performance data on a sufficient number of iterations is available, ADCL can make at runtime a decision on which alternative to use throughout the rest of the simulation. There are three main steps to carry in order to use ADCL: preparation, communication and finalization steps. Through this work, we showcase the performance benefit that come out of auto-tuning the parallel RTM application. For that purpose, we implement two versions of the RTM code for each of (i) isotropic (ISO) and (ii) tilted transversely isotropic media (TTI). The first version is the classic scenario where the commonly used MPI implementation of neighborhood communications is utilized. The second is the automatic performance-tuning version where ADCL is used to transparently select the best MPI implementation of neighborhood communications according to the runtime environment. The numerical scheme used is finite difference with a discretization at the 2nd order in time and 8th order in space. We run the simulations for a total of 720 time steps. We carry out our tests on two different parallel platforms at TOTAL E&P Research and Technology USA, LLC. The first cluster (Appro) is based on AMD CPUs, with 2GB of memory per core and an InfiniBand DDR interconnect. The second (IBM) is an Intel based cluster, with 3GB of memory per core and an InfiniBand QDR interconnect. The InfiniBand network in both clusters has a fat tree network topology. We report the MPI communications times of both ISO and TTI kernels, for both platforms and for each version of the code (with and without ADCL). The main advantage of using ADCL is performance, which consists here in decreasing the execution time of the communication operations. First, we would like to point out is that ADCL is able to select a different implementation of 3–dimensional neighborhood communication for each of the different execution environments and each of the ISO and TTI kernels. Second, the auto-tuned versions using ADCL provides up to 40% improvement in the communication time of RTM as detailed in Figure 2. Another advantage of using ADCL is productivity; namely, ADCL allows developers to implement the neighborhood communication related functions of RTM algorithm very easily. The developer does not need to worry about the choice of MPI communication routines and the memory management required for the halo cells (handling non-contiguous data). By keeping track of the memory addresses of the data structures that are passed to the main RTM function, one can easily integrate ADCL into both isotropic and tilted transversely isotropic RTM algorithms with minor changes to the original code. We are currently working on the optimization of the MPI runtime parameters using the Open Tool for Parameters Optimization (OTPO) [3] based on ADCL, for further improvement of the MPI communication performance. We are also looking into automatic tuning of the OpenACC accelerated kernels on the latest NVIDIA GPUs. Encouraging preliminary results will be presented [4]. References: [1] E. Gabriel, S. Feki, K. Benkert, M. Chaarawi. The Abstract Data and Communication Library, Journal of Algorithms and Computational Technology, Vol. 2-No. 4, page 581-600, December 2008. [2] E. Gabriel, S. Feki, K. Benkert, M. Resch. Towards Performance and Portability through Runtime Adaption for High Performance Computing Applications, 'Concurrency and Computation - Practice and Experience' journal, Vol. 22, no. 16, pp. 2230-2246, 2010. [3] M.Chaarawi,J. Squyres,E. Gabriel,S.Feki,A Tool for Optimizing Runtime Parameters of Open MPI, Recent Advances in Parallel Virtual Machine and Message Passing Interface; Lecture Notes in Computer Science Volume 5205, 2008, pp 210-217 [4] S. Feki, S. Siddiqui, “Towards Automatic Performance Tuning of OpenACC Accelerated Scientific Applications” NVIDIA GPU Technology Conference, San Jose, California, USA, March 18-21, 2013. Acknowledgement: This work has been done while Hakan Haberdar was an intern and Saber Feki was an employee in TOTAL E&P USA Research & Technology. The authors would like to thank Total for the support of this work and the help and the advising of senior HPC advisor, Terrence Liao.

Moderators
avatar for Simanti Das

Simanti Das

Manager, High Performance Computing Software Development & Support, ExxonMobil Technical Computing Company
Simanti Das, is currently the manager of High Performance Computing software development and support group in ExxonMobil Upstream IT organization. She is responsible for providing software development, optimization and support for massively parallel seismic imaging technologies for... Read More →

Speakers
avatar for Saber Feki

Saber Feki

Computational Scientist, KAUST Supercomputing Laboratory
Saber Feki received his PhD and M.S in computer science at the University of Houston in 2008 and 2010 respectively. In 2011, he joined the oil and gas industry with TOTAL as an HPC Research Scientist working on seismic imaging applications using different programming models including... Read More →


Thursday March 6, 2014 2:00pm - 2:20pm PST
BRC 282 Rice University 6500 Main Street at University, Houston, TX 77030

Attendees (0)