Trace Analysis API

class hta.trace_analysis.TraceAnalysis(trace_files: Dict[int, str] | None = None, trace_dir: str = '/tmp/trace', include_last_profiler_step: bool | None = False)[source]

critical_path_analysis(rank: int, annotation: str, instance_id: int | None | Tuple[int, int], data_load_events: List[str] | None = None) → Tuple[CPGraph, bool] | None[source]

Perform critical path analysis for trace events within a rank. We further reduce the region of interest by selecting a trace annotation and instance id. This will limit the analysis to events within the time range of that annoation. This will include GPU kernels launched by the cpu operators in that time duration. For example, you can use this to limit the analysis to one iteration by passing annotation=’ProfilerStep’. See notes for how to pick the iteration.

Parameters:

t (Trace) – Input trace data structure.
rank (int) – rank to analyze for the critical path.
annotation (str) – a trace annotation to limit the analysis to, for example “ProfilerStep” would match all annotations that match this string (ProfilerStep#100, ProfilerStep#101 etc)
instance_id –
can be either of the following (int) - specify which instance of the annotation to consider.

Defaults to the first instance.

(Tuple(int, int)) - considers a range of annotation instances start to end,
inclusive of both start and end instance.
data_load_events (List[str]) – List of events (regex) to be considered as data load events. Different traces may use different annotations to indicate data loading, so we allow the caller to pass in the list.

Returns:

Tuple[CPGraph, bool]: A tuple of CPGraph object and a success or fail boolean value. True indicates that the critical path analysis algorithm succeeded.

CPGraph object that can be used to obtain statistics and further visualize the critical path. CPGraph is also a subinstance of a networkx.DiGraph. Run ‘CPGraph?’ for more info and APIs.

Notes:

Avoid using the first step / iteration in a trace as it usually has some missing events.
The analysis requires CUDA synchronization events in the GPU trace, that were added in https://github.com/pytorch/pytorch/pull/105187 Please see the documentation of this PR on how to enable CUDA sync events in the trace.

generate_trace_with_counters(time_series: TimeSeriesTypes | None = None, ranks: List[int] | None = None, output_suffix: str = '_with_counters') → None[source]

Adds a set of time series to the trace in order to aid debugging traces. Creates a new trace file for each requested rank with the a suffix ‘_with_counters.json’. The following time series are available in TimeSeriesTypes flag type.

Queue length - adds a time series to the trace indicating the size of the queue at any given time on each CUDA stream.
Memory copy bandwidth - adds a time series to the trace indicating the memory bandwidth used for device to host, host to device and device to device operations.

Either or both of the above can be enabled.

Parameters:

time_series (Flag) – Used to set the requested time series. Available values are TimeSeriesTypes.QUEUE_LENGTH and TimeSeriesTypes.MEMCPY_BANDWIDTH. By default both time series are added to the trace.
ranks (List[int]) – List of ranks to generate the counters for. Default = [0].
output_suffix (str) – Suffix to add to the trace file. Default = ‘_with_counters.json.gz’

Returns:

None

get_aten_op_kernels_and_delay(ranks: List[int] | None = None, sort_by: List[str] | None = None) → Dict[int, DataFrame][source]

For each aten operator, this function finds the corresponding kernels and the delay between the aten operator and the first kernel launch. The delay is measured in microseconds (us). The delay is calculated as the difference between the start time of the aten operator and the start time of the firs kernel launch. The output is a table with the following columns: Aten operator name, kernels associated with the aten operator, number of such aten op to kernel sequences, delay between the aten operator and runtime launch of the first kernel, and the delay between the runtime kernel launch to kernel execution on device.

Parameters:

ranks – the rank numbers on which the analysis was performed on
sort_by – the column name to sort the results by, default is “occurrence_count”

Returns:

The function returns a dictionary of dataframes. The key corresponds to the rank and value is the DataFrame containing the path of kernels launch by aten op, the count number of each path, the aten op launch delay and runtime delay.

Return type:

Dict[int, pd.DataFrame]

get_comm_comp_overlap(visualize: bool = True) → DataFrame[source]

Compute the communication-computation overlap percentage for each rank.

Parameters:

visualize (bool) – Set to True to display the graph. Default = True.

Returns:

pd.DataFrame: A dataframe containing the communication computation overlap percentage for each rank.

get_cuda_kernel_launch_stats(ranks: List[int] | None = None, runtime_cutoff: int = 50, launch_delay_cutoff: int = 100, include_memory_events: bool = True, visualize: bool = True) → Dict[int, DataFrame][source]

For each event launched on the GPU there is a corresponding scheduling event on the CPU. These events are linked by a common correlation id. This feature calculates the duration of the CPU op, GPU kernel and the launch delay (difference between gpu kernel starting and cpu op ending for each correlation id on the specified rank(s). This function finds:

GPU events with a shorter duration than the corresponding CPU events.
CPU runtime events with a large duration i.e. outliers. The outliers are defined using the runtime_cutoff value.
CPU events which have a large launch delay i.e. launch delay outliers. The launch delay outliers are defined using the launch_delay_cutoff value.

Parameters:

ranks (List[int]) – List of ranks on which to run the analysis. Default = [0].
runtime_cutoff (int) – Duration in microseconds to determine outliers for cuda runtime events. Default = 50 microseconds.
launch_delay_cutoff (int) – Duration in microseconds to determine outliers for launch delay. Default value is 100 microseconds.
include_memory_events (bool) – Toggle to include cudaMemcpyAsync and cudaMemsetAsync events. Default = True.
visualize (bool) – Toggle to display the generated graphs. Default = True.

Returns:

Dict[int, pd.DataFrame]: The function returns a dictionary of dataframes. The key corresponds to the rank and the value is a dataframe containing the cpu_duration, gpu_duration and launch_delay for each correlation id.

get_cupti_counter_data_with_operators(ranks: List[int] | None = None) → List[DataFrame][source]

Performance counters provide insights on how to speed up GPU kernels. The PyTorch Profiler has a lightweight API [CUPTI Range Profiler API](https://docs.nvidia.com/cupti/r_main.html#r_profiler) that enables users to monitor performance counters from the device.

When the CUPTI Profiler mode is enabled then PyTorch will emit the performance counters and annotates them in the trace. * The events are logged under the cuda_profiler_range category. * Counter values are logged in the args section of the trace.

This API can investigate performance measurements per kernel and associate them to operators that the kernel belongs to. A single kernel can map to multiple levels of operators (as operators can be nested). To represent this we basically provide a list column called op_stack. For further convenience we add the top and bottom level operator column as well.

Parameters:

ranks (List[int]) – List of ranks on which to run the analysis. Default = [0].

Returns:

List[pd.DataFrame]: A list of dataframes, one per rank, containing kernel name, op_stack (operator stack), top and bottom level op, and columns for individual performance counters.

get_frequent_cuda_kernel_sequences(operator_name: str, output_dir: str, min_pattern_len: int = 3, rank: int = 0, top_k: int = 5, visualize: bool = False, compress_other_kernels: bool = True) → DataFrame[source]

Computes the most frequent CUDA kernel sequences originating from the CPU op with name operator_name. Generates a dataframe summarizing the sequence of kernels, their frequency and total time taken. Additionally, writes a new trace file to output_dir with the top_k frequent patterns overlaid on top of the original trace file.

Parameters:

operator_name (str) – Name of the operator from which the CUDA kernels are launched.
output_dir (str) – Output folder path containing the new trace file with overlaid top k frequent patterns.
min_pattern_len (int) – Minimum length of the CUDA kernel sequences that should be identified. Default = 3.
rank (int) – Rank on which the analysis is performed. Default = 0.
top_k (int) – top_k patterns in terms of frequency to be visualized and overlaid. Default = 5.
visualize (bool) – Whether to show the histogram of top_k frequent patterns inline. Default = True.
compress_other_kernels (bool) – Should the names and args for other kernels not belonging to any frequent patterns be compressed to save memory in the overlaid trace file. Default = True.

Returns:

pd.DataFrame: A dataframe with frequent cuda kernel sequences and their frequencies.

get_gpu_kernel_breakdown(visualize: bool = True, duration_ratio: float = 0.8, num_kernels: int = 10, include_memory_kernels: bool = True, image_renderer: str = '') → Tuple[DataFrame, DataFrame][source]

Summarizes the time spent by each kernel and by kernel type. Outputs the following graphs:

Pie chart indicating the percentage of time taken by each kernel type.
Pie charts showing the most time consuming kernels for each rank for each kernel type.
Bar graphs showing the average duration for the most time consuming kernels for each rank and each kernel type.

Parameters:

visualize (boolean) – Set to True to display the graphs. Default = True.
duration_ratio (float) – Floating point value between 0 and 1 specifying the ratio of time taken by top COMM/COMP/MEMORY kernels. Default = 0.8.
num_kernels (int) – Maximum number of COMM/COMP/MEMORY kernels to show. Default = 10.
include_memory_kernels (bool) – Whether to include MEMORY kernels in the analysis. Default = True.
image_renderer (str) – Set to notebook when using jupyter and jupyterlab when using jupyter-lab. To see all available options execute: import plotly; plotly.io.renderers in a python shell.

Returns:

Tuple[pd.DataFrame, pd.DataFrame]: Returns two dataframes. The first dataframe shows the percentage of time spent by kernel type. The second dataframe shows the min, max, mean, standard deviation, total time taken by each kernel on each rank. This dataframe will be summarized based on values of duration_ratio and num_kernels. If both duration_ratio and num_kernels are specified, num_kernels takes precedence.

get_gpu_kernels_with_user_annotations(rank: int, expand_names: bool = True, shorten_names: bool = True) → DataFrame | None[source]

Provides a complete dataframe of GPU kernels and matches them to the corresponding user annotation i.e. user provided training phase. The output is a dataframe with all GPU kernel data alongside a “user_annotation” column.

Parameters:

rank (int) – Specify rank to return GPU kernels for.
expand_names (bool) – Expand integer name value to full names. This will add the columns “s_name” and “s_user_annotation” to the dataframe.
shorten_names (bool) – When expand_names is True, this flag enables shortening large CUDA kernel names. This works by removing the ‘<’ template parameters etc.

Returns:

The returned dataframe has all trace columns along with “user_annotation”, and optionally “s_user_annotation” column if expand_names=True.

Return type:

pd.Dataframe

Note: This API is per rank, and does not have any visualization aspect.

get_gpu_user_annotation_breakdown(use_gpu_annotation: bool = True, visualize: bool = True, duration_ratio: float = 0.8, num_kernels: int = 1000, allowlist_patterns: List[str] | None = None, image_renderer: str | None = None) → DataFrame | None[source]

Summarizes the time spent by each GPU user annotation. Outputs the following graphs:

Pie charts showing the most time consuming user annotations for each rank.
Bar graphs showing the average duration for the most time user annotations for each rank.

Parameters:

use_gpu_annotation (boolean) – Use time on GPU for each user annotation, if false use the time on CPU instead. Default = True,
visualize (boolean) – Set to True to display the graphs. Default = True.
duration_ratio (float) – Floating point value between 0 and 1 specifying the ratio of time taken by top user annotations. Default = 0.8.
num_kernels (int) – Maximum number of user annotations to show. Default = 1000. Rest get grouped into “other”.
allowlist_patterns (list(str)) – if user annotations match any of the patterns in this list, they will not be aggregated into “other” catgory. This argument is meant to keep some events as distinct in the aggregation. Supports strings as well as regular expressions.
image_renderer (str) – Set to notebook when using jupyter and jupyterlab when using jupyter-lab. To see all available options execute: import plotly; plotly.io.renderers in a python shell.

Returns:

Optional[pd.DataFrame]: Returns a dataframe that shows the min, max, mean, standard deviation, total time taken by each user annotation on each rank. This dataframe will be summarized based on values of duration_ratio and num_kernels. If both duration_ratio and num_kernels are specified, num_kernels takes precedence. If user_annotations are not present on CPU or GPU (according to use_gpu_annotation flag), return None.

get_idle_time_breakdown(ranks: List[int] | None = None, streams: List[int] | None = None, visualize: bool = True, visualize_pctg: bool = True, show_idle_interval_stats=False, consecutive_kernel_delay: int = 30) → Tuple[DataFrame, DataFrame | None][source]

GPU is considered idle when no kernel is running on it. Idle time is broken down into 3 categories.

Host wait time: a GPU or stream is idle because the CPU thread has not enqueued enough kernels to keep it occupied.
Kernel wait time: This is the duration between kernels and is considered an overhead of launching multiple small kernels. We use the following heuristic to classify the duration as kernel wait: duration between consecutive kernels < consecutive_kernel_delay.
Other wait time: In this case the idle time is attributed to an unknown cause. For example, a compute kernel could be waiting for a CUDA event from a communication kernel to complete.

Parameters:

ranks (List[int]) – List of ranks for which idle time breakdown is computed. Default = [0].
streams (List[int]) – List of streams to provide analysis for. Defaults to all streams.
visualize (bool) – Set to True to show the graph. Default = True.
visualize_pctg (bool) – Show relative percentage across streams. Default = True.
show_idle_interval_stats (bool) – Returns statistics of the idle intervals like the min, max and median of idle intervals between kernels on a CUDA stream, also broken down by the idleness category. Default = False.
consecutive_kernel_delay (int) – Configures the threshold under which we consider gaps between kernels to be due to realistic delays in launching back to back kernels on the GPU. Default = 30 nanoseconds.

Returns:

Tuple[pd.DataFrame, Optional[pd.DataFrame]]: A tuple of dataframes. The first dataframe contains the idle time category and duration for each stream on each rank. The second dataframe contains the summary statistics (count, min, max, mean, standard deviation, 25th, 50th, 75th percentile) for each idle category for each stream on each rank.

get_memory_bw_summary(ranks: List[int] | None = None) → DataFrame | None[source]

Summarizes the memory bandwidth statistics for memory copy and memset operations. This includes memory bandwidth for copies from Device to Host, Host to Device and Device to Device transfers. Note, this does not include memory bandwidth used by compute/communication kernels.

Parameters:

ranks (List[int]) – List of ranks for which memory bandwidth is calculated. Default = [0].

Returns:

pd.DataFrame or None: A dataframe containing the summary statistics. The dataframe includes count, min, max, standard deviation, 25th, 50th and 75th percentiles of memory copy/memset operations. The function returns None when the dataframe is empty.

get_memory_bw_time_series(ranks: List[int] | None = None) → Dict[int, DataFrame][source]

Calculates the time series for memory copy bandwidth used by memcpy and memset operations in GB/s. The memory bandwidth is calculated for host to device, device to host and device to device copies. Note, this does not include memory bandwidth used by computation or communication kernels.

Parameters:

ranks (List[int]) – List of ranks for which the memory bandwidth time series is generated. Default = [0].

Returns:

Dict[int, pd.DataFrame]: Returns a dictionary whose key is the rank and value is a dataframe of memory bandwidth counter events. The following fields are in each row of the dataframe: ts (timestamp), pid (process id), tid (thread id), name (memcpy/memset), and memory bandwidth in GB/s.

get_queue_length_summary(ranks: List[int] | None = None) → DataFrame | None[source]

Queue length is defined as the number of outstanding CUDA operations on a stream. This functions calculates the summary statistics for the queue length on each CUDA stream for the specified ranks.

Parameters:

ranks (List[int]) – List of ranks for which to queue length summary is calculated. Default = [0].

Returns:

pd.DataFrame or None: A dataframe summarizing the queue length statistics. The dataframe contains count, min, max, standard deviation, 25th, 50th and 75th percentiles. The function returns None when the dataframe is empty.

get_queue_length_summary_from_time_series(queue_length_dict: Dict[int, DataFrame]) → DataFrame | None[source]

Queue length is defined as the number of outstanding CUDA operations on a stream. This functions calculates the summary statistics for the queue length on each CUDA stream for the specified ranks. This function takes the output from get_queue_length_time_series() directly.

Parameters:

queue_length_dict (Dict[int, pd.DataFrame]) – A dictionary of rank -> time series with the queue length of each CUDA stream. This is the output of get_queue_length_time_series().

Returns:

pd.DataFrame or None: A dataframe summarizing the queue length statistics. The dataframe contains count, min, max, standard deviation, 25th, 50th and 75th percentiles. The function returns None when the dataframe is empty.

get_queue_length_time_series(ranks: List[int] | None = None) → Dict[int, DataFrame][source]

Queue length is defined as the number of outstanding CUDA operations on a stream. This function calculates the time series for the queue length on each CUDA stream for the specified ranks.

Parameters:

ranks (List[int]) – List of ranks for which the queue length time series is generated. Default = [0].

Returns:

Dict[int, pd.DataFrame]: Returns a dictionary whose key is the rank and value is a dataframe of queue length counter events. The following fields are in each row of the dataframe: ts (timestamp), pid (process id), tid (thread id), stream, and queue length.

get_temporal_breakdown(visualize: bool = True) → DataFrame[source]

Compute the idle time, compute time and non-compute time for each rank. Time is measured in nanoseconds (ns). non-compute time is defined as the total time the GPU is not executing a compute operation such as data transfers, copying to/from memory and communication collectives. (In the strictest sense communication collectives do some compute but we classify it as a communication operation).

Parameters:

visualize (bool) – Set to True to display the graphs. Default = True.

Returns:

pd.DataFrame: A dataframe containing the raw value and percentage of idle time, compute time and non-compute time for each rank.

get_time_spent_blocked_on_full_queue(queue_length_dict: Dict[int, DataFrame], max_queue_length: int = 1024) → DataFrame | None[source]

The GPU kernels launch queue is finite. If the CPU fills up this queue the CPU will block till the GPU device launches kernels. We compute the time spent blocked on a full launch queue in this function.

Returns an (optional) dataframe with the time spent on the kernel launch queue full. This function takes the output from get_queue_length_time_series() and sums up the time spent on all streams where the queue is full (see max_queue_length)

Parameters:

queue_length_dict (Dict[int, pd.DataFrame]) – A dictionary of rank -> time series with the queue length of each CUDA stream. This is the output of get_queue_length_time_series().
max_queue_length (int) – Max kernel launch queue length.

Returns:

Optional[pd.DataFrame]

An (optional) dataframe containing the summary statistics blocked time per stream and rank The dataframe contains the columns- rank, stream, duration_at_max_queue_length, and relative_duration_at_max_queue_length.

Relative duration at max queue length considers the total duration of a trace and normalizes the duration_at_max_queue_length.

overlay_critical_path_analysis(rank: int, critical_path_graph: CPGraph, output_dir: str, only_show_critical_events: bool = True, show_all_edges: bool = False) → str[source]

Overlay the identified critical path on top of the trace file for visualization.

Parameters:

rank (int) – rank to generate the time series for.
critical_path_graph – Critical Path Graph object generated previously
output_dir (str) – Output directory to store overlaid trace.
only_show_critical_events (bool) – When set the output trace will only have operators and GPU kernels on the critical path. It will still retain the user annotations.
show_all_edges (bool) – When set this will add edge events for all types of edges in the critical path graph. This is useful for debugging the algorithm.

Returns: the overlaid trace file path. The generated trace file will have a prefix of “overlaid_critical_path_” in its name compared to the original trace file.

Note: In case of kernel launches that are not on the critical path the graph still has a 0 weight edge between CUDA runtime and kernel. These 0 weight edges are not shown in the overlaid trace by default. Set the environment variable CRITICAL_PATH_SHOW_ZERO_WEIGHT_LAUNCH_EDGE=1 to enable adding this to the overlaid trace. Add this to your notebook os.environ[“CRITICAL_PATH_SHOW_ZERO_WEIGHT_LAUNCH_EDGE”] = 1