This workshop aims at briniging together research and practitiones from academia and industry to present and discuss on-going projects that take or can take advantage of hybrid computing systems. This includes computational systems that include a mix of tradiional CPUs (either many-core or multi-core) and/or GPUs and FPGAs and even ASICs.
The workshop will be co-located with The International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) and will take place on Friday, Oct. 20 in the Room #5 in two sessions of 2 hours each, namely from 8:00 to 10:00 AM and from 14:30 to 16:30. This dual session format will allow the workshop participants to attend SBAC-PAD invite speaker sessions.
Workshop participation requires SBAC-PAD symposium registration. Registration fees for Brasilian researchers can be found here.
Lucas Mello Schnorr, Arnaud Legrand, Samuel Thibault, Luka Stanisic, Vinícius Garcia Pinto, Vincent Danjean (Univ. Federal do Rio Grande do Sul)
Programming paradigms in High-Performance Computing have been shifting towards task-based models which are capable of adapting readily to heterogeneous and scalable supercomputers. Detecting performance outliers in such environments is particularly difficult because it must consider architecture heterogeneity and variability. In this work we present how we have employed a very simple performance model to highlight task outliers of the well-known tiled-based dense Cholesky factorization running on top of StarPU-MPI, a runtime for task-based applications. Such work has been integrated into our visualization framework based on the R programming language and the tidyverse meta-package. Experiments have been conducted in a controlled environment using the Chifflet cluster at Lille, part of the Grid’5000 infrastructure, using up to eight nodes, each one equipped with 28 cores and two GPUs. The preliminary results, derived from collected traces, indicate that explicit binding for the MPI and GPU-managing threads, within StarPU, alleviate the issue, leading to performance gains.
João V. Lima (Univ. Federal de Santa Maria) jvlima@inf.ufsm.br
Task-based parallelism model allows applications to loose synchronization and efficiently exploit parallelism on modern CPU-GPU platforms. This talk presents examples of scientific applications designed using task-based programming models in order to exploit CPU-GPU platforms.
Sabrina B. M. Sambatti (INPE, São José dos Campos) sabrinabms@gmail.com
Numerical prediction can be performed by time integration of mathematical equations. A procedure to combine observations and the model background is called data assimilation. Artificial neural network is applied to carry out the data assimilation on a shallow water system, representing the ocean dynamics. The neural network (NN) is designed to emulate the Kalman filter. Data assimilation system is implemented using hybrid computer system. The NN is implemented in FPGA. The shallow water system is codified using finite difference executed on the CPU. Data assimilation by using hardware device has similar performance in comparison with software implementation.
Haroldo F. de Campos Velho (INPE, São José dos Campos) haroldo.camposvelho@inpe.br
Unmanned Aerial Vehicles (UAVs) can do autonomous navigation using Inertial Navigation System (INS) associated with a Global Navigation Satellite System (GNSS). However, for critical missions, where the GNSS signal could desappear, other methodologies have being developed. One scheme for UAV positioning is using the image processing. From a comparison of a satellite georeferenced image and a UAV image, it is possible to identify the UAV position. After an edge extraction from both images and applying a correlation between the two images, the UAV position is determined. The pre-processing (image filtering, scaling, and rotation) and the correlation process is done by the CPU. The edge extraction is performed by an optimal supervised neural network. The hybrid computation has been employed with reduction of computer time and power consumption.
Víctor Martínez (1), Fabrice Dupros (2), Philippe O. A. Navaux (1) (1. Universidade Federal do Rio Grande do Sul, Brazil, 2. BRGM, France)
Simulation of seismic wave propagation is a crucial tool in geophysics for efficient strong motion analysis and risk mitigation. Because of its simplicity and numerical efficiency, the finite-differences method is one of the standard techniques implemented to consider the elastodynamics equation. We analyze two implementations of the finite-differences discretization for the elastodynamics equation method. We use Ondes3D, a seismic model with a standard implementation in CUDA, and a task-based implementation in StarPU. Firstly, we analyze the load of standard implementation on a Graphics Processing Units (GPUs) cluster, to find which factors are affecting the performance of hybrid architectures when the number of GPUs is increased. Secondly, we compare the performance obtained with the classical CPU or GPU only versions and the impact of load balancers in the heterogenous implementation. Results demonstrate significant speedups in comparison with the best implementation suitable for homogeneous cores. Finally, we analyze the use of a low-power manycore architecture, the NVIDIA Jetson TK1 board. Considering energy-to-solution metrics, NVIDIA Jetson platform provides significant improvement in comparison with standard hybrid CPU+GPU platforms.
Eduardo Barbosa, Silvio Fernandes and Guido Araujo (UniCamp)
Data-structures are importante constructs used in all computer programs. Acceleration of data-structure operations is a relevant performance goal that can benefit most applications. This talk will present a set of new hardware engines designed with the final goal of accelerating STL based data-structure operations in the Intel HARP2 architecture.
Jeronimo Penha, Jansen Silva and Lucas Bragança (Univ. Federal de Vicosa)
This presentation introduces a Coarse-Grained Reconfigurable Architecture as an overlay approach to allow fast dynamic programming on heterogeneous CPU FPGA platform. Our proposal was validated on an Altera Intel Cluster Node which consists of a Xeon 10 cores superscalar multiprocessor attached to a STRATIX V FPGA by using a coherent shared memory interface. The CGRA consists of 16 functional units. The CGRA could be configured by high-level languages like Java or C/C++.
Ricardo Ferreira (Univ. Federal de Viçosa) ricardo@ufv.br
Recent advances in Systems Biology pose computational challenges that surpass the capabilities of current computing platforms based on conventional CPUs and GPUs. A particular challenge in this regard is simulating the dynamics of Gene Regulatory Networks (GRNs). This is because of the number of possible network states, and thus the required computational time grows exponentially with the number of network components. FPGA-based accelerators appear as promising alternatives to simulations on conventional software platforms. We present a high-level framework that takes advantage of hardware acceleration for the simulation GRNs, without compromising flexibility.
Ciro Ceissler, Ramon Nepomuceno and Guido Araujo (UniCamp)
The computing industry has recently proposed the usage of FPGAs as a way to improve energy efficiency in modern cloud clusters. Unfortunately, using such FPGA clusters is a very hard and complex task. In this talk we propose a novel and simple mechanism to offload computation to the FPGAs available in the Intel HARP2 architecture. This is done by extending OpenMP directives in such a way that the FPGA becomes just another OpenMP acceleration device that can be used directly from any user program.
Rommel A. Q. Cruz, Lucia Drummond, Esteban Clua (Universidade Federal Fluminense) and Cristiana Bentes (Univ. do Estado do Rio de Janeiro) romeluko@gmail.com
Nowaday GPUs continuously become more powerful processing devices. To the extent that a lot of CUDA kernels failed to take advantage of all this computational power. Modern NVIDIA microarchitectures support by default kernel concurrent execution hoping to get better overall utilization of the GPU. However, not all subset of kernels can run efficiently in a concurrent way because of they shared and potentially compete for several resources. We propose a slowdown estimation model and present preliminary results obtained on both, synthetic and real-world applications that demonstrate that interference between concurrent kernels is a critical issue and we must take into account those interactions to choose an appropriate way of scheduling based on the kernel resources.
Haroldo F. de Campos Velho (INPE, São José dos Campos)
Unmanned Aerial Vehicles (UAVs) can do autonomous navigation using Inertial Navigation System (INS) associated with a Global Navigation Satellite System (GNSS). However, for critical missions, where the GNSS signal could desappear, other methodologies have being developed. One scheme for UAV positioning is using the image processing. From a comparison of a satellite georeferenced image and a UAV image, it is possible to identify the UAV position. After an edge extraction from both images and applying a correlation between the two images, the UAV position is determined. The pre-processing (image filtering, scaling, and rotation) and the correlation process is done by the CPU. The edge extraction is performed by an optimal supervised neural network. The hybrid computation has been employed with reduction of computer time and power consumption.
Fredy Alves (Univ. Federal de Viçosa) fredyamalves1@gmail.com
Collision detection algorithms are used to detect when virtual objects collide with one another and 135 calculate the results of these collisions. These types of algorithms are, typically, critical 136 real-time calculations needed for applications such as simulation, tolerance checking, and video games. 137 In this presentation, we present an implementation of a Sphere Collision Detection Algorithm on the 138 Intel Heterogeneous CPU-FPGA Platform HARP. The FPGA is used to accelerate a particular collision 139 stage as an accelerated part of a complete collision detection pipeline on a real system to demonstrate 140 how collision detection can benefit from co-processing.
Dr. Pedro C. Diniz ( Instituto de Engenharia de Sistemas e Computadores - ID, Lisboa, Portugal)