c++ - Improving asynchronous execution in CUDA -

- May 15, 2012

i writing programme performs large simulations on gpu using cuda api. in order accelerate performance, tried run kernels simultaneously , asynchronously copy result host memory again. code looks this:

#define nstreams   8 #define blockdimx  16 #define blockdimy  16  void domainupdate(float* domain_cpu,       // pointer domain on host                   float* domain_gpu,       // pointer domain on device                   const unsigned int dimx,                   const unsigned int dimy,                   const unsigned int dimz) {     dim3 blocks((dimx + blockdimx - 1) / blockdimx, (dimy + blockdimy - 1) / blockdimy);     dim3 threads(blockdimx, blockdimy);      (unsigned int ii = 0; ii < nstreams; ++ii) {          updatedomain3d<<<blocks,threads, 0, streams[ii]>>>(domain_gpu,                                                            dimx, 0,  dimx - 1, // dimx, minx, maxx                                                            dimy, 0,  dimy - 1, // dimy, miny, maxy                                                            dimz, dimz * ii / nstreams,  dimz * (ii + 1) / nstreams - 1); // dimz, minz, maxz          unsigned int offset = dimx * dimy * dimz * ii / nstreams;         cudamemcpyasync(domain_cpu + offset ,                         domain_gpu+ offset ,                         sizeof(float) * dimx * dimy * dimz / nstreams,                         cudamemcpydevicetohost, streams[ii]);     }      cudadevicesynchronize(); }

all in simple for-loop, looping on streams (8 in case) , dividing work. deal faster (up 30% performance gain), although maybe less had hoped. analysed typical cycle in nvidia's compute visual profiler, , execution looks this:

cuda api trace in compute visual profiler

as can seen in picture, kernels overlap, although never more 2 kernels running @ same time. tried same thing different numbers of streams , different sizes of simulation domain, case.

so question is: there way encourage/force gpu scheduler run more 2 things @ same time? or limitation dependent on gpu device cannot represented in code?

my system specifications are: 64-bit windows 7, , geforce gtx 670 graphics card (that's kepler architecture, compute capability 3.0).

kernels overlap if gpu has resources left run second kernel. once gpu loaded, there no gain running more kernels in parallel, driver not that.

Search This Blog

Employment & Recruiting

c++ - Improving asynchronous execution in CUDA -

Popular posts from this blog

Php - Delimiter must not be alphanumeric or backslash -

Delphi interface implements -

java - How to create Table using Apache PDFBox -