-
https://docs.pytorch.org/tutorials/beginner/hta_intro_tutorial.html
-
The goal of HTA is to help engineers and researchers achieve the best performance from the hardware stack. For this to happen it is imperative to understand the resource utilization and bottlenecks for distributed training and inference workloads.
-
aim to provide users insights into “what is happening under the hood in a distributed GPU workload?
Bottleneck dimension | What HTA measures | Why it matters |
---|---|---|
Temporal balance | % of time GPUs spend computing, communicating/memory, idle on each rank (hta.readthedocs.io) | Reveals under-subscription and serialization. |
Idle-time taxonomy | Splits idle gaps into host-wait, kernel-gap, other (docs.pytorch.org) | Tells you whether to optimize CPU input pipeline, stream concurrency, or synchronization. |
Communication–compute overlap | Ratio of compute that runs concurrently with communication (docs.pytorch.org) | Low overlap ⇒ GPUs block on data dependencies; overlap ↑ ⇒ TFLOPS/GPU ↑. |
Kernel signatures | Longest kernels, duration distributions, launch latency plots (docs.pytorch.org) | Pinpoints hot kernels, launch overhead and imbalance across ranks. |
Augmented counters | Memory-bandwidth traces & queue-length time-series (docs.pytorch.org) | Quantifies copy pressure and stream back-pressure. |
Trace diff | Adds/removed ops & kernels between two runs — an A/B-test for performance (hta.readthedocs.io, docs.pytorch.org) | Proves whether a code change really sped things up. |