Section: New Results

Large scale GPU-centric optimization

Participants: J. Gmys, T. C. Pessoa and N. Melab, external collaborators: M. Mezmaz, D. Tuyttens from University of Mons (BELGIUM) and F.H. De Carvalho Junior from Universidade Federal Do Cearà (BRAZIL)

Nowadays, accelerator-centric architectures offer orders-of-magnitude performance and energy improvements. The interest of those parallel resources has been recently accentuated by the advent of deep learning making them definitely key-building blocks of modern supercomputers. During the year 2018, in collaboration with A. Zomaya (The Univ. of Sydney) and I. Chakroun (IMEC, Leuven) N. Melab has (guest-)edited a special issue on this hot topic (editorial in [16]). In addition, we have put the focus on the investigation of these specific devices within the context of parallel optimization. In the following, two major contributions are reported: (1) Many-core Branch-and-Bound for GPU accelerators and MIC coprocessors; (2) Cuda Dynamic Parallelism (CDP) for backtracking.

  • Many-core Branch-and-Bound for GPU accelerators and MIC coprocessors. Solving large optimization problems results in the generation of a very large pool of subproblems and the time-intensive evaluation of their associated lower bounds. Generating and evaluating those subproblems on coprocessors raises several issues including processor-coprocessor data transfer optimization, vectorization, thread divergence, etc. In [15], [32], we have investigated the offload-based parallel design and implementation of B&B algorithms for coprocessors addressing these issues. Two major many-core architectures are considered and compared: Nvidia GPU and Intel MIC. The proposed approaches have been experimented using the Flow-Shop scheduling problem and two hardware configurations equivalent in terms of energy consumption: Nvidia Tesla K40 and Intel Xeon Phi 5110P. The reported results show that the GPU-accelerated approach outperforms the MIC offload-based one even in its vectorized version. Moreover, vectorization improves the efficiency of the MIC offload-based approach with a factor of two.

  • Dynamic Configuration of CUDA Runtime Variables for CDP-based Divide-and-Conquer Algorithms. CUDA Dynamic Parallelism (CDP) is an extension of the GPGPU programming model proposed to better address irregular applications and recursive patterns of computation. However, processing memory-demanding problems by using CDP is not straightforward, because of its particular memory organization. We have proposed in [23] (extension of [13]) an algorithm to deal with such an issue which dynamically calculates and configures the CDP runtime variables and the GPU heap on the basis of an analysis of the partial backtracking tree. We have implemented the algorithm for solving permutation problems and experimented on two test-cases: N-Queens and the Asymmetric Travelling Salesman Problem. The proposed algorithm allows different CDP-based backtracking from the literature to solve memory-demanding problems, adaptively with respect to the number of recursive kernel generations and the presence of dynamic allocations on GPU.