Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

aditya-narayan5
Copy link

I made following two changes:

trace.py – skip expensive tensor sampling

  • Adds a safe‐guard in _summarize_value() to avoid sampling tensors larger than 10,000 elements.
  • Large GPU-to-CPU transfers during profiling may cause significant host‑side latency and unnecessary device synchronization.
  • For developer experience, future work could integrate lovely‑tensors library.

profiler.py – non‑blocking trace export

  • Replaced the synchronous on_trace_ready=_save_trace callback with an asynchronous export launched via a module‑level ThreadPoolExecutor.
  • torch.profiler.stop() is now lightweight: the heavy export_chrome_trace() + artifact upload are offloaded to background threads.
  • Empirical benchmarks (see attached Colab notebook ) show export_chrome_trace() can take 14–17 s on a simulated 7 B model trace (~1 GB), this may block the inference thread during that period.
  • Basically, in the previous synchronous flow, the slowdown wasn’t limited to export_chrome_trace() performing heavy JSON serialization — the subsequent artifact upload could also block the main thread. When traces grow into the hundreds of MBs (anything above 120MB/s for standard SSD) or beyond, both serialisation and upload contribute to long stalls during profiling.

A quick summary:

  • Introduces a small, persistent ThreadPoolExecutor(max_workers=2) for asynchronous trace writing. (profiler.py)
  • Ensures thread safety: exports only after profiler.stop()
  • Uses existing _save_trace() + _upload_trace_artifact() with no other API changes.
  • Adds explicit clean‑up (del prof) once trace is flushed to disk in _save_trace() to prevent OutOfMemory(OOM).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant