Make torch profiler export asynchronous and add guarded tensor check #20

aditya-narayan5 · 2025-10-16T17:39:24Z

I made following two changes:

Adds a safe‐guard in _summarize_value() to avoid sampling tensors larger than 10,000 elements.
Large GPU-to-CPU transfers during profiling may cause significant host‑side latency and unnecessary device synchronization.
For developer experience, future work could integrate lovely‑tensors library.

Replaced the synchronous on_trace_ready=_save_trace callback with an asynchronous export launched via a module‑level ThreadPoolExecutor.
torch.profiler.stop() is now lightweight: the heavy export_chrome_trace() + artifact upload are offloaded to background threads.
Empirical benchmarks (see attached Colab notebook ) show export_chrome_trace() can take 14–17 s on a simulated 7 B model trace (~1 GB), this may block the inference thread during that period.
Basically, in the previous synchronous flow, the slowdown wasn’t limited to export_chrome_trace() performing heavy JSON serialization — the subsequent artifact upload could also block the main thread. When traces grow into the hundreds of MBs (anything above 120MB/s for standard SSD) or beyond, both serialisation and upload contribute to long stalls during profiling.

A quick summary:

Introduces a small, persistent ThreadPoolExecutor(max_workers=2) for asynchronous trace writing. (profiler.py)
Ensures thread safety: exports only after profiler.stop()
Uses existing _save_trace() + _upload_trace_artifact() with no other API changes.
Adds explicit clean‑up (del prof) once trace is flushed to disk in _save_trace() to prevent OutOfMemory(OOM).

…in tracer

Make torch profiler export asynchronous and add guarded tensor check …

2be9357

…in tracer

Provide feedback