c++ - Improving asynchronous execution in CUDA -
i writing programme performs large simulations on gpu using cuda api. in order accelerate performance, tried run kernels simultaneously , asynchronously copy result host memory again. code looks this:
#define nstreams 8 #define blockdimx 16 #define blockdimy 16 void domainupdate(float* domain_cpu, // pointer domain on host float* domain_gpu, // pointer domain on device const unsigned int dimx, const unsigned int dimy, const unsigned int dimz) { dim3 blocks((dimx + blockdimx - 1) / blockdimx, (dimy + blockdimy - 1) / blockdimy); dim3 threads(blockdimx, blockdimy); (unsigned int ii = 0; ii < nstreams; ++ii) { updatedomain3d<<<blocks,threads, 0, streams[ii]>>>(domain_gpu, dimx, 0, dimx - 1, // dimx, minx, maxx dimy, 0, dimy - 1, // dimy, miny, maxy dimz, dimz * ii / nstreams, dimz * (ii + 1) / nstreams - 1); // dimz, minz, maxz unsigned int offset = dimx * dimy * dimz * ii / nstreams; cudamemcpyasync(domain_cpu + offset , domain_gpu+ offset , sizeof(float) * dimx * dimy * dimz / nstreams, cudamemcpydevicetohost, streams[ii]); } cudadevicesynchronize(); }
all in simple for-loop, looping on streams (8 in case) , dividing work. deal faster (up 30% performance gain), although maybe less had hoped. analysed typical cycle in nvidia's compute visual profiler, , execution looks this:
as can seen in picture, kernels overlap, although never more 2 kernels running @ same time. tried same thing different numbers of streams , different sizes of simulation domain, case.
so question is: there way encourage/force gpu scheduler run more 2 things @ same time? or limitation dependent on gpu device cannot represented in code?
my system specifications are: 64-bit windows 7, , geforce gtx 670 graphics card (that's kepler architecture, compute capability 3.0).
kernels overlap if gpu has resources left run second kernel. once gpu loaded, there no gain running more kernels in parallel, driver not that.