Holistic Trace Analysis
Holistic Trace Analysis (HTA) is an open source performance analysis and visualization Python library for PyTorch users. HTA takes as input Kineto traces collected by the PyTorch Profiler and up-levels the performance information contained in the traces.
ML researchers and systems engineers often struggle to computationally scale up their models because they are not aware of the performance bottlenecks in their workloads. The resources requested for a job (e.g. GPUs, memory) are often misaligned with the resources actually required due to lack of visibility “under the hood”.
The goal of HTA is to help engineers and researchers achieve the best performance from the hardware stack. For this to happen it is imperative to understand the resource utilization and bottlenecks for distributed training and inference workloads.
Features in Holistic Trace Analysis
To aid in performance debugging HTA provides the following features
Temporal Breakdown: Breakdown of GPU time in terms of time spent in computation, communication, memory events, and idle time on a single node and across all ranks.
Idle Time Breakdown: Breakdown of GPU idle time into waiting for the host, waiting for another kernel or attributed to an unknown cause.
Kernel Breakdown: Find kernels with the longest duration on each rank.
Kernel Duration Distribution: Distribution of average time taken by longest kernels across different ranks.
Communication Computation Overlap: Calculate the percentage of time when communication overlaps computation.
CUDA Kernel Launch Statistics: Distributions of GPU kernels with very small duration, large duration, and excessive launch time.
Augmented Counters (Memory copy bandwidth, Queue length): Augmented trace files which provide insights into memory copy bandwidth and number of outstanding operations on each CUDA stream.
Frequent CUDA Kernel Patterns: Find the CUDA kernels most frequently launched by any given PyTorch or user defined operator.
Trace Diff: A trace comparison tool to identify and visualize the differences between traces.
CUPTI Counter Analysis: An experimental API to interpret GPU performance counters. It attributes performance measurements from kernels to PyTorch operators, and can help with kernel optimization and roofline analysis.
Lightweight Critical Path Analysis: An experimental API to compute the critical path in the trace. Critical path can help one undertand if an application is CPU bound, GPU compute bound or communication bound. The path can be visualized on the original trace as well as manipulated as a directed acyclic graph object.