This event has ended. Create your own event → Check it out
This event has ended. Create your own
View analytic
Thursday, March 6 • 11:55am - 12:00pm
Lightning Talk: Kalray MPPA-256 scalable compute cartridge: an efficient architecture applied to Oil & Gas HPC, Benoît Ganne, Kalray SA

Sign up or log in to save this to your schedule and see who's attending!



Kalray MPPA-256 scalable compute cartridge: an efficient architecture applied to Oil & Gas HPC Benoît Ganne, Christian Chabrerie, Thierry Strudel benoit.ganne@kalray.eu, christian.chabrerie@kalray.eu, thierry.strudel@kalray.eu Introduction Kalray MPPA-256 is a manycore, low-power, dis- tributed memory supercomputer-on-a-chip (SCoC). It is composed of 16 clusters of 17 cores - 16 dedi- cated computational cores and 1 control core - shar- ing 2MB of SRAM, and of several Input/Output (I/O) capabilities controlled by 4 SMP quad-cores such as 2 PCIe Gen3 controller, 2 DDR3 ECC 1600 controllers and 2 40Gbps Ethernet (GbE) controllers among others. Each core implements the Kalray-1 VLIW architecture with a dedicated IEEE-754 sin- gle precision (SP) and double precision (DP) floating point unit (FPU). The 16 clusters and the 4 SMP quad-cores are interconnected through a high band- width, low latency network-on-a-chip (NoC) using a 2D-torus topology. In addition to the standard I/O capabilities, a single MPPA-256 is able to intercon- nect its NoC to 4 MPPA-256 neighbors using Kalray NoCX interconnect. This capability allows to present a single virtual manycore to the programmer, com- posed of multiple MPPA-256. Multiple MPPA-256 can be traversed in each direction transparently. The MPPA-256 topology is depicted on figure 1. Figure 1: Kalray MPPA-256 topology This architecture can be used as a building block for highly energy-efficient supercomputer: a cartridge with 4 MPPA-256, as depicted on figure 2. The 4 MPPA-256 are interconnected together on-board using NoCX, actually presenting a single 64 clus- ters (1024 computational cores), 16 SMP quad-cores manycore to the programmer. The boards can be further interconnected together through NoCX with external connectors or using a chassis interconnect to build an even bigger virtual manycore. Programming model The Kalray MPPA-256 supports C, C++ and For- tran with different programming models and Appli- cation Programming Interfaces (APIs), and can be programmed with MPI and OpenMP. Each MPPA- 256 cluster is an MPI process and in this MPI process OpenMP can be used to easily exploit the 16 compu- tational cores. Due to the distributed and asymmetric nature of the MPPA-256, the best programming model for Oil & Gas algorithms such as Reverse Time Migra- tion (RTM) or Full Waveform Inversion (FWI) is a double-buffering model (application pipeline of depth 2) as depicted by figure 3: each cluster divide its 2MB SRAM space by 2 so that while the 16 computational cores are working on a SRAM half, the next data can be pushed by DMA to the other half. System architecture The seismic data are stored on storage servers, they are sent through multiple 10GbE links to the Kalray MPPA-256 scalable compute cartridges DDR during the initialization phase. The Kalray MPPA-256 scal- able compute cartridges can then be partitioned or paired as needed depending of the workload memory size and required computational power. All the com- putation is then done locally, with frontier exchanges happening between the Kalray MPPA-256 scalable compute cartridges DDR involved. For example, using a single Kalray MPPA-256 scal- able compute cartridge with 32GB of DDR (4GB per DDR interface, 8GB per MPPA) a typical RTM shot might be computed. If the shot does exceed this amount of memory, multiple cartridges can be paired together, and on the contrary multiple shots can be computed on a single cartridge if it does fit in mem- ory. During the computation phase, snapshots can be sent back to the storage server through the 10GbE links. First experiments We experimented typical HPC workloads on a single Kalray MPPA-256 scalable compute cartridge proto- type based on 4 MPPA-256 interconnected together with each MPPA-256 using 4GB of DDR, 2GB per DDR interface. The achieved GFLOPS/W in sin- gle precision and scalability are measured for each experiment. The GFLOPS are measured using hard- ware performance counters and the power consump- tion is measured using an on-board power consump- tion measurement circuit. The first experiment is a general matrix multiply algorithm (GEMM[5]) on a 4096x4096 matrix, scal- ing from a single cluster on a single MPPA-256 to the 64 clusters available on the 4 MPPA-256. The results are presented on figure 4. The following table compares the GFLOPS/W between different architectures[1][2]: Platform & GFLOPS & Power & GFLOPS/W nVidia M2090 Fermi & 780 & 225 & 3.5 Intel i7-3820 & 209 & 95 & 2.2 DSP: TI C6678 & 93 & 10 & 9.3 MPPA-256 & 123 & 10 & 11.9 4x MPPA-256 & 433 & 41 & 10.5 The Intel results are measured using OpenBLAS[6] on the MPPA developer workstation host CPU. The scalability is nearly linear, demonstrating the archi- tecture scalability, whereas the GFLOPS/W are one of the best available today. The second experiment is a complex fast Fourier transform (FFT) of 1K points to 256K points, scaling from a single cluster to the 16 clusters available on a single MPPA. The results are presented on figure 5. The scalability is nearly linear, once again demon- strating the architecture scalability. More experi- ments will be done, to scale up to 4 MPPA-256 and to compare to other architectures. Results using benchmarks more relevant for Oil & Gas HPC such as 3-dimensional finite difference (3DFD) algorithms will be shown. Conclusion We showed that the Kalray MPPA-256 scalable com- pute cartridge expose 2 key characteristics to support future Oil & Gas Exascale HPC: Scalability: allowing to build a system as a stack- ing of well-known, more simple, systems Power efficiency: Exascale system will need more than 50GFLOPS/W[4] Still, the Kalray MPPA-256 scalable compute car- tridge is only a first step in the direction of the Oil & Gas Exascale HPC. More power efficiency will be needed in coming years, and the authors think that the model of having simple, power efficient building blocks such as scalable interconnections of multiple manycores[3] will remain. The distributed memory nature of this architecture guarantees its scalability, and as such the system can be precisely sized and expanded as needed. This paved the way to a new paradigm for scalable software-defined systems. References [1] NVIDIA, NVIDIA CUBLAS performance, available at https://developer.nvidia.com/ cublas. [2] Francisco D. Igual, Murtaza Ali, Arnon Fried- mann, Eric Stotzer, Timothy Wentz and Robert van de Geijn, Unleashing DSPs for General- Purpose HPC, available at http://www.cs. utexas.edu/users/flame/pubs/FLAWN61.pdf. [3] US National Academy of Science, The New Global Ecosystem in Advanced Computing: Implications for U.S. Competitiveness and National Security (2012). [4] DARPA, Power Efficiency Revolution For Em- bedded Computing Technologies (PERFECT) program. [5] National Science Foundation, available at http: //www.netlib.org/blas/. [6] OpenBLAS, available at http://www.openblas. net/.


Thursday March 6, 2014 11:55am - 12:00pm
BRC 103 Rice University 6500 Main Street at University, Houston, TX 77030

Attendees (6)