Bottleneck dimensionWhat HTA measuresWhy it matters
Temporal balance% of time GPUs spend computing, communicating/memory, idle on each rank (hta.readthedocs.io)Reveals under-subscription and serialization.
Idle-time taxonomySplits idle gaps into host-wait, kernel-gap, other (docs.pytorch.org)Tells you whether to optimize CPU input pipeline, stream concurrency, or synchronization.
Communication–compute overlapRatio of compute that runs concurrently with communication (docs.pytorch.org)Low overlap ⇒ GPUs block on data dependencies; overlap ↑ ⇒ TFLOPS/GPU ↑.
Kernel signaturesLongest kernels, duration distributions, launch latency plots (docs.pytorch.org)Pinpoints hot kernels, launch overhead and imbalance across ranks.
Augmented countersMemory-bandwidth traces & queue-length time-series (docs.pytorch.org)Quantifies copy pressure and stream back-pressure.
Trace diffAdds/removed ops & kernels between two runs — an A/B-test for performance (hta.readthedocs.io, docs.pytorch.org)Proves whether a code change really sped things up.