CUDA Memory Types CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 - PDF

Please download to get full document.

View again

of 27
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Concepts & Trends

Published:

Views: 5 | Pages: 27

Extension: PDF | Download: 0

Share
Related documents
Description
CUDA Memory Types CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Outline 1 Device memory CUDA memory types and uses CUDA Type
Transcript
CUDA Memory Types CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Outline 1 Device memory CUDA memory types and uses CUDA Type qualifiers Programming Scenarios 2 Matrix multiplication Matrix-matrix multiplication Global memory version Shared memory version CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Acknowledgements Some material used in creating these slides comes from NVIDIA s CUDA C Programming Guide memory.pdf CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Outline 1 Device memory CUDA memory types and uses CUDA Type qualifiers Programming Scenarios 2 Matrix multiplication Matrix-matrix multiplication Global memory version Shared memory version CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Compute Capability 1.x Global memory (read and write) slow and uncached requires sequential and aligned 16 byte read/writes to be fast (coalesed read/write) Texture memory (read only) cache optimized for 2D access pattern Constant memory where constants and kernel arguments are stored slow, but with cache Shared memory (16KB per SM) fast, but subject to bank conflicts permits exchange of data between threads in block Local memory used for whatever doesn t fit in to registers part of global memory, so slow and uncached Registers fast, has only thread scope CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Compute Capability 2.x Global memory (read and write) slow, but cached Texture memory (read only) cache optimized for 2D access pattern Constant memory where constants and kernel arguments are stored special LoaD Uniform (LDU) instruction Shared memory (48KB per SM) fast, but subject to (differnt) bank conflicts Local memory used for whatever doesn t fit in to registers part of global memory; slow but now cached Registers bit registers per SM CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Memory limitations Global Memory Best if 64 or 128 bytes (16 or 32 single-precision, 8 or 16 double-precision) are read... Coalesced read/writes: parallel read/writes from threads in a block sequential memory locations......with appropriate alignment...otherwise up to 10x slower! Shared Memory Fastest if all threads read from same shared memory location and/or all threads index a shared aray via permutation (e.g. linear read/writes) otherwise there can be bank conflicts CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Outline 1 Device memory CUDA memory types and uses CUDA Type qualifiers Programming Scenarios 2 Matrix multiplication Matrix-matrix multiplication Global memory version Shared memory version CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 CUDA Type qualifiers Variable declaration Memory Scope Lifetime int localvar; register thread thread int localarray[10]; local thread thread shared int sharedvar; shared block block device int globalvar; global grid application constant int constantvar; constant grid application Automatic variables without any qualifier reside in a register......except arrays (reside in local memory)...or if there are not enough registers CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 CUDA Type performance Variable declaration Memory Performance penalty int localvar; register 1x int localarray[10]; local 100x shared int sharedvar; shared 1x device int globalvar; global 100x constant int constantvar; constant 1x CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Outline 1 Device memory CUDA memory types and uses CUDA Type qualifiers Programming Scenarios 2 Matrix multiplication Matrix-matrix multiplication Global memory version Shared memory version CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Scenario 1 Task: Load data from global memory Do thread-local computations Store result to global memory Solution: Load data from global memory (coalesced) float a = d_ptr [ blockidx.x* blockdim.x + threadidx.x]; Do computation with registers float res = f( a); Store result (coalesced) d_ptr [ blockidx.x* blockdim.x + threadidx.x] = res ; CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Scenario 2 Task: Load data from global memory Do block-local computations Store result to global memory Solution: Load data to shared memory shared float a_sh [ BLOCK_SIZE ]; int idx = blockidx. x* blockdim. x + threadidx. x; a_sh [ threadidx.x] = d_ptr [ idx ]; syncthreads (); // important! Do computation float res = f( a_sh [ threadidx.x ]); Store result (coalesced) d_ptr [ idx ] = res ; CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Outline 1 Device memory CUDA memory types and uses CUDA Type qualifiers Programming Scenarios 2 Matrix multiplication Matrix-matrix multiplication Global memory version Shared memory version CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix-matrix multiplication Consider our familiar matrix product C = AB: for ( i = 0; i A. height ; i ++ ) for ( j = 0; j B. width ; j ++ ) c[i][j] = 0; for ( k = 0; k A. width ; k ++ ) c[i][j] += a[i][k] * b[k][j]; How many times is each element of matrix A accessed? How many times is each element of matrix B accessed? How many times is each element of matrix C accessed? CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix-matrix multiplication Consider our familiar matrix product C = AB: for ( i = 0; i A. height ; i ++ ) for ( j = 0; j B. width ; j ++ ) c[i][j] = 0; for ( k = 0; k A. width ; k ++ ) c[i][j] += a[i][k] * b[k][j]; How many times is each element of matrix A accessed? B.width How many times is each element of matrix B accessed? How many times is each element of matrix C accessed? CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix-matrix multiplication Consider our familiar matrix product C = AB: for ( i = 0; i A. height ; i ++ ) for ( j = 0; j B. width ; j ++ ) c[i][j] = 0; for ( k = 0; k A. width ; k ++ ) c[i][j] += a[i][k] * b[k][j]; How many times is each element of matrix A accessed? B.width How many times is each element of matrix B accessed? A.height How many times is each element of matrix C accessed? CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix-matrix multiplication Consider our familiar matrix product C = AB: for ( i = 0; i A. height ; i ++ ) for ( j = 0; j B. width ; j ++ ) c[i][j] = 0; for ( k = 0; k A. width ; k ++ ) c[i][j] += a[i][k] * b[k][j]; How many times is each element of matrix A accessed? B.width How many times is each element of matrix B accessed? A.height How many times is each element of matrix C accessed? A.width CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix-matrix multiplication Consider an element c[row][col]. There are B.width elements on a row of C and A.height elements in a column of C. To compute each of these elements, we access a row of A and a column of B. We therefore access each row of A B.width times and each column of B A.height times. CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Outline 1 Device memory CUDA memory types and uses CUDA Type qualifiers Programming Scenarios 2 Matrix multiplication Matrix-matrix multiplication Global memory version Shared memory version CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Kernel development A CUDA kernel to compute the matrix product is straightforward In this simple implementation we assume that our matrices are square N N and stored using linear arrays Access to the (i, j) element is faciliated via the macro # define IDX (i,j,n) ((i)*( n)+j) CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix multiply: Global memory version kernel // matrix - matrix kernel using only global memory global void matmulglobal ( float * c, float * a, float * b, int N ) // compute row and column for our matrix element int col = blockidx. x * blockdim. x + threadidx. x; int row = blockidx. y * blockdim. y + threadidx. y; if ( col N && row N ) float sum = 0.0; for ( int k = 0; k N; k ++ ) sum += a[ IDX (row,k,n)] * b[ IDX (k,col,n )]; c[ IDX (row,col,n)] = sum ; Note: sum will be stored in a register so we this kernel only makes one reference to C. CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Outline 1 Device memory CUDA memory types and uses CUDA Type qualifiers Programming Scenarios 2 Matrix multiplication Matrix-matrix multiplication Global memory version Shared memory version CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix-matrix multiplication CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix multiply: Shared memory version kernel 1 // matrix - matrix kernel using global and shared memory global void matmulshared ( float * c, float * a, float * b, int N ) // compute row and column for our matrix element int col = blockidx. x * blockdim. x + threadidx. x; int row = blockidx. y * blockdim. y + threadidx. y; // compute the number of blocks we need int M = ( N + BlockSize - 1 ) / BlockSize ; CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix multiply: Shared memory version kernel 2 float sum = 0.0; for ( int m = 0; m M; m ++ ) // all threads in block copy their element from // matrix a and matrix b to shared memory shared float a_s [ BlockSize ][ BlockSize ]; shared float b_s [ BlockSize ][ BlockSize ]; int c = m * BlockSize + threadidx. x; int r = m * BlockSize + threadidx. y; a_s [ threadidx.y][ threadidx.x] = a[ IDX (row,c,n )]; b_s [ threadidx.y][ threadidx.x] = b[ IDX (r,col,n )]; // make sure all threads are finished syncthreads (); CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24 Matrix multiply: Shared memory version kernel 3 // compute partial sum using shared memory block // K is block size except at right or bottom since we // may not have a full block of data there int K = ( m == M - 1? N - m * BlockSize : BlockSize ); for ( int k = 0; k K; k ++ ) sum += a_s [ threadidx.y][k] * b_s [k][ threadidx.x]; syncthreads (); if ( col N && row N ) c[ IDX (row,col,n)] = sum ; CPS343 (Parallel and HPC) CUDA Memory Types Spring / 24
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks