![]() In transposeNaive the reads from idata are coalesced as in the copy kernel, but for our 1024×1024 test matrix the writes to odata have a stride of 1024 elements or 4096 bytes between contiguous threads. ![]() attributes(global) subroutine transposeNaive(odata, idata) Our first transpose kernel looks very similar to the copy kernel. The only difference is that the indices for odata are swapped. The loop iterates over the second dimension and not the first so that contiguous threads load and store contiguous data, and all reads from idata and writes to odata are coalesced. Note also that TILE_DIM must be used in the calculation of the matrix index y rather than BLOCK_ROWS or blockDim%y. Y = (blockIdx%y-1) * TILE_DIM + threadIdx%yĮach thread copies four elements of the matrix in a loop at the end of this routine because the number of threads in a block is smaller by a factor of four ( TILE_DIM/BLOCK_ROWS) than the number of elements in a tile. X = (blockIdx%x-1) * TILE_DIM + threadIdx%x attributes(global) subroutine copy(odata, idata) Let’s start by looking at the matrix copy kernel. In Fortran contiguous addresses correspond to the first index of a multidimensional array, and threadIdx%x and blockIdx%x vary quickest within blocks and grids, respectively. This mapping is up to the programmer the important thing to remember is that to ensure memory coalescing we want to map the quickest varying component to contiguous elements in memory. The kernels in this example map threads to matrix elements using a Cartesian (x,y) mapping rather than a row/column mapping to simplify the meaning of the components of the automatic variables in CUDA Fortran: threadIdx%x is horizontal and threadIdx%y is vertical. Using a thread block with fewer threads than elements in a tile is advantageous for the matrix transpose because each thread transposes four matrix elements, so much of the index calculation cost is amortized over these elements. ![]() For both matrix copy and transpose, the relevant performance metric is effective bandwidth, calculated in GB/s by dividing twice the size in GB of the matrix (once for loading the matrix and once for storing) by time in seconds of execution.Īll kernels in this study launch blocks of 32×8 threads ( TILE_DIM=32, BLOCK_ROWS=8 in the code), and each thread block transposes (or copies) a tile of size 32×32. In addition to performing several different matrix transposes, we run simple matrix copy kernels because copy performance indicates the performance that we would like the matrix transpose to achieve. In this post I’ll only include the kernel code you can view the rest or try it out on Github. The entire code is available on Github. It consists several kernels as well as host code to perform typical tasks such as allocation and data transfers between host and device, launches and timings of several kernels as well as validation of their results, and deallocation of host and device memory. For simplicity of presentation, we’ll consider only square matrices whose dimensions are integral multiples of 32 on a side. The code we wish to optimize is a transpose of a matrix of single precision values that operates out-of-place, i.e. the input and output are separate arrays in memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. In this post I will show some of the performance gains achievable using shared memory. My previous CUDA Fortran post covered the mechanics of using shared memory, including static and dynamic allocation. CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.
0 Comments
Leave a Reply. |