Idle Time Breakdown

Understanding how much time the GPU is idle and its causes can help direct optimization strategies. A GPU is considered idle when no kernel is running on it. We developed an algorithm to categorize the Idle time into 3 categories:

  1. Host wait: is the idle duration on the GPU due to the CPU not enqueuing kernels fast enough to keep the GPU busy. These kinds of inefficiencies can be resolved by examining the CPU operators that are contributing to the slow down, increasing the batch size and applying operator fusion.

  2. Kernel wait: constitutes the short overhead to launch consecutive kernels on the GPU. The idle time attributed to this category can be minimized by using CUDA Graph optimizations.

  3. Other wait: Lastly, this category includes idle we could not currently attribute due to insufficient information. The likely causes include synchronization among CUDA streams using CUDA events and delays in launching kernels.

The host wait time can be interpreted as the time when the GPU is stalling due to the CPU. To attribute the idle time as kernel wait we use the following heuristic:

gap between consecutive kernels < threshold

The default threshold value is 30 nanoseconds and can be configured using the consecutive_kernel_delay argument. By default, the idle time breakdown is computed for rank 0 only. In order to calculate the breakdown for other ranks, use the ranks argument in the get_idle_time_breakdown function. The idle time breakdown can be generated as follows:

analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
idle_time_df = analyzer.get_idle_time_breakdown()
../../_images/idle_time_breakdown_percentage.png

The function returns a tuple of dataframes. The first dataframe contains the idle time by category on each stream for each rank.

../../_images/idle_time.png

The second dataframe is generated when show_idle_interval_stats is set to True. It contains the summary statistics of the idle time for each stream on each rank.

../../_images/idle_time_summary.png

Tip

By default, the idle time breakdown presents the percentage of each of the idle time categories. Setting the visualize_pctg argument to False, the function renders with absolute time on the y-axis. See image below.

../../_images/idle_time_breakdown.png