摘要
Generalized matrix multiplication (GEMM) is one of the most widely utilized algorithms in many fields such as deep learning, astrophysics, signal processing, and advanced physical analysis. It plays an extremely important role in deep learning, especially for convolutional neural networks, because many of the calculations involved are converted into matrix multiplications in order to speed up the computation process leveraging the parallel processing power of GPUs. However, the sizes of the converted matrices are generally too small to fully occupy the GPU. In this paper, we focus on the impact of GEMM on deep learning and propose a framework for calculating a batch of GEMMs in one kernel function so as to increase GPU occupancy. A suite of tiling strategies is designed for a batch of matrices with small dimensions and variable sizes. The tiling strategy is determined by considering Kernel Occupancy for each GEMM to fit different matrix sizes and GPU architectures. Then the GoogLeNet is implemented using MIOpen as a representative case and the batched GEMM framework is integrated into it. The experimental results show that compared with MAGMA, the elapsed time of the GoogLeNet optimized with our framework obtains 2.60x and 2.79x speedup on AMD Radeon Instinct MI50 and MI100 GPU, respectively.
