Section: New Results
Design of Experiments
Performance engineering of scientific HPC applications requires to measure repeatedly the performance of applications or of computation kernels, which consume a large amount of time and resources. It is essential to design experiments so as to reduce this cost as much as possible. Our contribution along this axis is twofold: (1) the investigation sound exploration techniques and (2) the control of experiments to ensure the measurements are as representative as possible of real workload.
Writing, porting, and optimizing scientific applications makes
autotuning techniques fundamental to lower the cost of leveraging the
improvements on execution time and power consumption provided by the
latest software and hardware platforms. Despite the need for economy,
most autotuning techniques still require large budgets of costly
experimental measurements to provide good results, while rarely
providing exploitable knowledge after
optimization. In [16], we investigate the use of
Design
of Experiments to propose a user-transparent autotuning technique that
operates under tight budget constraints by significantly reducing the
measurements needed to find good optimizations. Our approach enables
users to make informed decisions on which optimizations to pursue and
when to stop. We present an experimental evaluation of our approach
and show it is capable of leveraging user decisions to find the best
global configuration of a GPU Laplacian kernel using half of the
measurement budget used by other common autotuning techniques. We show
that our approach is also capable of finding speedups of up to
Our second contribution is related to the control of measurements. In [40], we relate a surprising observation on the performance of the highly optimized and regular DGEMM function on modern processors. The DGEMM function is a widely used implementation of the matrix product. While the asymptotic complexity of the algorithm only depends on the sizes of the matrices, we show that the performance is significantly impacted by the matrices content. Although it would be expected that special values like 1 or 0 may yield to specific behevior, we show that arbitrary constant values are no different and that random values incur a significant performance drop. Our experiments show that this may be due to bit flips in the CPU causing an energy consumption overhead. Such phenomenon reminds the importance of thoroughly randomizing every single parameter of experiments to avoid bias toward specific behavior.