c++ - Improving asynchronous execution in CUDA -


i writing programme performs large simulations on gpu using cuda api. in order accelerate performance, tried run kernels simultaneously , asynchronously copy result host memory again. code looks this:

#define nstreams   8 #define blockdimx  16 #define blockdimy  16  void domainupdate(float* domain_cpu,       // pointer domain on host                   float* domain_gpu,       // pointer domain on device                   const unsigned int dimx,                   const unsigned int dimy,                   const unsigned int dimz) {     dim3 blocks((dimx + blockdimx - 1) / blockdimx, (dimy + blockdimy - 1) / blockdimy);     dim3 threads(blockdimx, blockdimy);      (unsigned int ii = 0; ii < nstreams; ++ii) {          updatedomain3d<<<blocks,threads, 0, streams[ii]>>>(domain_gpu,                                                            dimx, 0,  dimx - 1, // dimx, minx, maxx                                                            dimy, 0,  dimy - 1, // dimy, miny, maxy                                                            dimz, dimz * ii / nstreams,  dimz * (ii + 1) / nstreams - 1); // dimz, minz, maxz          unsigned int offset = dimx * dimy * dimz * ii / nstreams;         cudamemcpyasync(domain_cpu + offset ,                         domain_gpu+ offset ,                         sizeof(float) * dimx * dimy * dimz / nstreams,                         cudamemcpydevicetohost, streams[ii]);     }      cudadevicesynchronize(); } 

all in simple for-loop, looping on streams (8 in case) , dividing work. deal faster (up 30% performance gain), although maybe less had hoped. analysed typical cycle in nvidia's compute visual profiler, , execution looks this:

cuda api trace in compute visual profiler

as can seen in picture, kernels overlap, although never more 2 kernels running @ same time. tried same thing different numbers of streams , different sizes of simulation domain, case.

so question is: there way encourage/force gpu scheduler run more 2 things @ same time? or limitation dependent on gpu device cannot represented in code?

my system specifications are: 64-bit windows 7, , geforce gtx 670 graphics card (that's kepler architecture, compute capability 3.0).

kernels overlap if gpu has resources left run second kernel. once gpu loaded, there no gain running more kernels in parallel, driver not that.


Popular posts from this blog

How to calculate SNR of signals in MATLAB? -

c# - Attempting to upload to FTP: System.Net.WebException: System error -

ios - UISlider customization: how to properly add shadow to custom knob image -