CUDA compute and copy engine queue limits -


i seem encounter limit number of asynchronous kernel launches can queued in compute engine queue. after limit host blocked , gpu-cpu concurrency lost. not mentioned in cuda programming guide.

  • what maximum number of asynchronous kernel launches can queued in compute engine queue?
  • does maximum number depend in way on kernel being launched?
  • does time takes cpu put kernel launch in compute engine queue depend on kernel being launched?
  • what maximum number of asynchronous memcpy's can queued in copy engine queue?

i not sure there universal answer question, degree platform , cuda version specific afaik. answer bullet points

  • the limit queue size, believe, there maximum number of queue operations rather kernel launches. same total limit should apply combination of kernels, copy operations , stream events. total number of operations depends on platform , cuda version
  • no
  • no, once driver queue filled, time taken submit asynchronous operation considerably increased
  • see first point. don't believe driver distinguishes between copies, kernel launches, or events.

i can recall doing benchmarking circa cuda 2.1 , finding ran until 24 operations had been queued, time taken subsequent operations queued slowed. time cuda 3.0 had been released, didn't have code hit limit existed in older versions, changed. should trivial write benchmark check more modern cuda versions do.


Popular posts from this blog

How to calculate SNR of signals in MATLAB? -

c# - Attempting to upload to FTP: System.Net.WebException: System error -

ios - UISlider customization: how to properly add shadow to custom knob image -