Trace Analysis API

class hta.trace_analysis.TraceAnalysis(trace_files: Optional[Dict[int, str]] = None, trace_dir: str = '/tmp/trace')[source]

generate_trace_with_counters(time_series: Optional[TimeSeriesTypes] = None, ranks: Optional[List[int]] = None, output_suffix: str = '_with_counters') → None[source]

Adds a set of time series to the trace in order to aid debugging traces. Creates a new trace file for each requested rank with the a suffix ‘_with_counters.json’. The following time series are available in TimeSeriesTypes flag type.

Queue length - adds a time series to the trace indicating the size of the queue at any given time on each CUDA stream.
Memory copy bandwidth - adds a time series to the trace indicating the memory bandwidth used for device to host, host to device and device to device operations.

Either or both of the above can be enabled.

Parameters

time_series (Flag) – Used to set the requested time series. Available values are TimeSeriesTypes.QUEUE_LENGTH and TimeSeriesTypes.MEMCPY_BANDWIDTH. By default both time series are added to the trace.
ranks (List[int]) – List of ranks to generate the counters for. Default = [0].
output_suffix (str) – Suffix to add to the trace file. Default = ‘_with_counters.json.gz’

Returns

None

get_comm_comp_overlap(visualize: bool = True) → DataFrame[source]

Compute the communication-computation overlap percentage for each rank.

Parameters

visualize (bool) – Set to True to display the graph. Default = True.

Returns

pd.DataFrame: A dataframe containing the communication computation overlap percentage for each rank.

get_cuda_kernel_launch_stats(ranks: Optional[List[int]] = None, runtime_cutoff: int = 50, launch_delay_cutoff: int = 100, include_memory_events: bool = True, visualize: bool = True) → Dict[int, DataFrame][source]

For each event launched on the GPU there is a corresponding scheduling event on the CPU. These events are linked by a common correlation id. This feature calculates the duration of the CPU op, GPU kernel and the launch delay (difference between gpu kernel starting and cpu op ending for each correlation id on the specified rank(s). This function finds:

GPU events with a shorter duration than the corresponding CPU events.

CPU runtime events with a large duration i.e. outliers. The outliers are defined using the runtime_cutoff value.

CPU events which have a large launch delay i.e. launch delay outliers. The launch delay outliers are defined using the launch_delay_cutoff value.

Parameters

ranks (List[int]) – List of ranks on which to run the analysis. Default = [0].
runtime_cutoff (int) – Duration in microseconds to determine outliers for cuda runtime events. Default = 50 microseconds.
launch_delay_cutoff (int) – Duration in microseconds to determine outliers for launch delay. Default value is 100 microseconds.
include_memory_events (bool) – Toggle to include cudaMemcpyAsync and cudaMemsetAsync events. Default = True.
visualize (bool) – Toggle to display the generated graphs. Default = True.

Returns

Dict[int, pd.DataFrame]: The function returns a dictionary of dataframes. The key corresponds to the rank and the value is a dataframe containing the cpu_duration, gpu_duration and launch_delay for each correlation id.

get_cupti_counter_data_with_operators(ranks: Optional[List[int]] = None) → List[DataFrame][source]

Performance counters provide insights on how to speed up GPU kernels. The PyTorch Profiler has a lightweight API [CUPTI Range Profiler API](https://docs.nvidia.com/cupti/r_main.html#r_profiler) that enables users to monitor performance counters from the device.

When the CUPTI Profiler mode is enabled then PyTorch will emit the performance counters and annotates them in the trace.

The events are logged under the cuda_profiler_range category.

Counter values are logged in the args section of the trace.

This API can investigate performance measurements per kernel and associate them to operators that the kernel belongs to. A single kernel can map to multiple levels of operators (as operators can be nested). To represent this we basically provide a list column called op_stack. For further convenience we add the top and bottom level operator column as well.

Parameters

ranks (List[int]) – List of ranks on which to run the analysis. Default = [0].

Returns

List[pd.DataFrame]: A list of dataframes, one per rank, containing kernel name, op_stack (operator stack), top and bottom level op, and columns for individual performance counters.

get_frequent_cuda_kernel_sequences(operator_name: str, output_dir: str, min_pattern_len: int = 3, rank: int = 0, top_k: int = 5, visualize: bool = False, compress_other_kernels: bool = True) → DataFrame[source]

Computes the most frequent CUDA kernel sequences originating from the CPU op with name operator_name. Generates a dataframe summarizing the sequence of kernels, their frequency and total time taken. Additionally, writes a new trace file to output_dir with the top_k frequent patterns overlaid on top of the original trace file.

Parameters

operator_name (str) – Name of the operator from which the CUDA kernels are launched.
output_dir (str) – Output folder path containing the new trace file with overlaid top k frequent patterns.
min_pattern_len (int) – Minimum length of the CUDA kernel sequences that should be identified. Default = 3.
rank (int) – Rank on which the analysis is performed. Default = 0.
top_k (int) – top_k patterns in terms of frequency to be visualized and overlaid. Default = 5.
visualize (bool) – Whether to show the histogram of top_k frequent patterns inline. Default = True.
compress_other_kernels (bool) – Should the names and args for other kernels not belonging to any frequent patterns be compressed to save memory in the overlaid trace file. Default = True.

Returns

pd.DataFrame: A dataframe with frequent cuda kernel sequences and their frequencies.

get_gpu_kernel_breakdown(visualize: bool = True, duration_ratio: float = 0.8, num_kernels: int = 10, include_memory_kernels: bool = True, image_renderer: str = 'notebook') → Tuple[DataFrame, DataFrame][source]

Summarizes the time spent by each kernel and by kernel type. Outputs the following graphs:

Pie chart indicating the percentage of time taken by each kernel type.
Pie charts showing the most time consuming kernels for each rank for each kernel type.
Bar graphs showing the average duration for the most time consuming kernels for each rank and each kernel type.

Parameters

visualize (boolean) – Set to True to display the graphs. Default = True.
duration_ratio (float) – Floating point value between 0 and 1 specifying the ratio of time taken by top COMM/COMP/MEMORY kernels. Default = 0.8.
num_kernels (int) – Maximum number of COMM/COMP/MEMORY kernels to show. Default = 10.
include_memory_kernels (bool) – Whether to include MEMORY kernels in the analysis. Default = True.
image_renderer (str) – Set to notebook when using jupyter and jupyterlab when using jupyter-lab. To see all available options execute: import plotly; plotly.io.renderers in a python shell.

Returns

Tuple[pd.DataFrame, pd.DataFrame]: Returns two dataframes. The first dataframe shows the percentage of time spent by kernel type. The second dataframe shows the min, max, mean, standard deviation, total time taken by each kernel on each rank. This dataframe will be summarized based on values of duration_ratio and num_kernels. If both duration_ratio and num_kernels are specified, num_kernels takes precedence.

get_idle_time_breakdown(ranks: Optional[List[int]] = None, streams: Optional[List[int]] = None, visualize: bool = True, visualize_pctg: bool = True, show_idle_interval_stats=False, consecutive_kernel_delay: int = 30) → Tuple[DataFrame, Optional[DataFrame]][source]

GPU is considered idle when no kernel is running on it. Idle time is broken down into 3 categories.

Host wait time: a GPU or stream is idle because the CPU thread has not enqueued enough kernels to keep it occupied.
Kernel wait time: This is the duration between kernels and is considered an overhead of launching multiple small kernels. We use the following heuristic to classify the duration as kernel wait: duration between consecutive kernels < consecutive_kernel_delay.
Other wait time: In this case the idle time is attributed to an unknown cause. For example, a compute kernel could be waiting for a CUDA event from a communication kernel to complete.

Parameters

ranks (List[int]) – List of ranks for which idle time breakdown is computed. Default = [0].
streams (List[int]) – List of streams to provide analysis for. Defaults to all streams.
visualize (bool) – Set to True to show the graph. Default = True.
visualize_pctg (bool) – Show relative percentage across streams. Default = True.
show_idle_interval_stats (bool) – Returns statistics of the idle intervals like the min, max and median of idle intervals between kernels on a CUDA stream, also broken down by the idleness category. Default = False.
consecutive_kernel_delay (int) – Configures the threshold under which we consider gaps between kernels to be due to realistic delays in launching back to back kernels on the GPU. Default = 30 nanoseconds.

Returns

Tuple[pd.DataFrame, Optional[pd.DataFrame]]: A tuple of dataframes. The first dataframe contains the idle time category and duration for each stream on each rank. The second dataframe contains the summary statistics (count, min, max, mean, standard deviation, 25th, 50th, 75th percentile) for each idle category for each stream on each rank.

get_memory_bw_summary(ranks: Optional[List[int]] = None) → DataFrame[source]

Summarizes the memory bandwidth statistics for memory copy and memset operations. This includes memory bandwidth for copies from Device to Host, Host to Device and Device to Device transfers. Note, this does not include memory bandwidth used by compute/communication kernels.

Parameters

ranks (List[int]) – List of ranks for which memory bandwidth is calculated. Default = [0].

Returns

pd.DataFrame or None: A dataframe containing the summary statistics. The dataframe includes count, min, max, standard deviation, 25th, 50th and 75th percentiles of memory copy/memset operations. The function returns None when the dataframe is empty.

get_memory_bw_time_series(ranks: Optional[List[int]] = None) → Dict[int, DataFrame][source]

Calculates the time series for memory copy bandwidth used by memcpy and memset operations in GB/s. The memory bandwidth is calculated for host to device, device to host and device to device copies. Note, this does not include memory bandwidth used by computation or communication kernels.

Parameters

ranks (List[int]) – List of ranks for which the memory bandwidth time series is generated. Default = [0].

Returns

Dict[int, pd.DataFrame]: Returns a dictionary whose key is the rank and value is a dataframe of memory bandwidth counter events. The following fields are in each row of the dataframe: ts (timestamp), pid (process id), tid (thread id), name (memcpy/memset), and memory bandwidth in GB/s.

get_queue_length_summary(ranks: Optional[List[int]] = None) → Optional[DataFrame][source]

Queue length is defined as the number of outstanding CUDA operations on a stream. This functions calculates the summary statistics for the queue length on each CUDA stream for the specified ranks.

Parameters

ranks (List[int]) – List of ranks for which to queue length summary is calculated. Default = [0].

Returns

pd.DataFrame or None: A dataframe summarizing the queue length statistics. The dataframe contains count, min, max, standard deviation, 25th, 50th and 75th percentiles. The function returns None when the dataframe is empty.

get_queue_length_time_series(ranks: Optional[List[int]] = None) → Dict[int, DataFrame][source]

Queue length is defined as the number of outstanding CUDA operations on a stream. This function calculates the time series for the queue length on each CUDA stream for the specified ranks.

Parameters

ranks (List[int]) – List of ranks for which the queue length time series is generated. Default = [0].

Returns

Dict[int, pd.DataFrame]: Returns a dictionary whose key is the rank and value is a dataframe of queue length counter events. The following fields are in each row of the dataframe: ts (timestamp), pid (process id), tid (thread id), stream, and queue length.

get_temporal_breakdown(visualize: bool = True) → DataFrame[source]

Compute the idle time, compute time and non-compute time for each rank. Time is measured in nanoseconds (ns). non-compute time is defined as the total time the GPU is not executing a compute operation such as data transfers, copying to/from memory and communication collectives. (In the strictest sense communication collectives do some compute but we classify it as a communication operation).

Parameters

visualize (bool) – Set to True to display the graphs. Default = True.

Returns

pd.DataFrame: A dataframe containing the raw value and percentage of idle time, compute time and non-compute time for each rank.