Holisitic Trace Analysis (HTA)

https://hta.readthedocs.io/en/latest/
https://docs.pytorch.org/tutorials/beginner/hta_intro_tutorial.html
The goal of HTA is to help engineers and researchers achieve the best performance from the hardware stack. For this to happen it is imperative to understand the resource utilization and bottlenecks for distributed training and inference workloads.
aim to provide users insights into “what is happening under the hood in a distributed GPU workload?

Bottleneck dimension	What HTA measures	Why it matters
Temporal balance	% of time GPUs spend computing, communicating/memory, idle on each rank (hta.readthedocs.io)	Reveals under-subscription and serialization.
Idle-time taxonomy	Splits idle gaps into host-wait, kernel-gap, other (docs.pytorch.org)	Tells you whether to optimize CPU input pipeline, stream concurrency, or synchronization.
Communication–compute overlap	Ratio of compute that runs concurrently with communication (docs.pytorch.org)	Low overlap ⇒ GPUs block on data dependencies; overlap ↑ ⇒ TFLOPS/GPU ↑.
Kernel signatures	Longest kernels, duration distributions, launch latency plots (docs.pytorch.org)	Pinpoints hot kernels, launch overhead and imbalance across ranks.
Augmented counters	Memory-bandwidth traces & queue-length time-series (docs.pytorch.org)	Quantifies copy pressure and stream back-pressure.
Trace diff	Adds/removed ops & kernels between two runs — an A/B-test for performance (hta.readthedocs.io, docs.pytorch.org)	Proves whether a code change really sped things up.

🤖 Harold's Notes