Codestin Search App

v2026.04.27.00

Update tbe.monitoring import in torchrec/distributed/types.py (meta-p…

…ytorch#4170)

Summary:
Pull Request resolved: meta-pytorch#4170

Migrate TBEStatsReporterConfig import from fbgemm_gpu.runtime_monitor to
fbgemm_gpu.tbe.monitoring in torchrec/distributed/types.py.

Reviewed By: cthi

Differential Revision: D99748581

fbshipit-source-id: bf7f88c8c4d9556165e497afe9d6a24a40740593

Apr 25, 2026
8e668c7
zip
tar.gz

v2026.04.20.00

Add logging for inplace_copy_batch size (meta-pytorch#4136)

Summary:
Pull Request resolved: meta-pytorch#4136

This diff adds one-time logging to the `inplace_copy_batch` technique to track how much data (in GB) is being copied to GPU via the in-place path. This gives visibility into the memory footprint of H2D transfers, which serves as a proxy for memory savings enabled by the technique (3-5 GB per rank in production RecSys models).

## Changes
1. **`utils.py`**: Add `_batch_tensor_size(batch)` helper that uses `torch.utils._pytree.tree_flatten` to flatten the batch into leaf tensors, then sums `element_size() * numel()` for each. This leverages existing pytree registrations for KJT, JaggedTensor, KeyedTensor, etc., so it automatically handles any registered batch type. All operations are O(1) tensor metadata reads.
2. **`train_pipelines.py`**: Add flag-gated logging to `TrainPipelineBase._copy_batch_to_gpu` and `TrainPipelineSparseDist.inplace_copy_batch_to_gpu`. Size is computed only on the first batch (via `_inplace_copy_batch_size_logged` flag), then logged once via `one_time_logger` + `LazyStr` for deferred string formatting.
3. **`experimental_pipelines.py`**: Same logging added to `TrainPipelineSparseDistT.inplace_copy_batch_to_gpu` (threaded variant).

**Performance:** Zero steady-state overhead — size computation runs only once (first batch), `LazyStr` defers formatting, and `one_time_logger` caps at 1 emission per call site.

Reviewed By: TroyGarden

Differential Revision: D101420284

fbshipit-source-id: 9e6a0b96ced2471711fe1c80dfea38df604da3b9

Apr 18, 2026
5315db2
zip
tar.gz

v2026.04.13.00

Replace deprecated is_fx_tracing with is_fx_symbolic_tracing (meta-py…

…torch#4098)

Summary:
Pull Request resolved: meta-pytorch#4098

## Problem

`is_fx_tracing()` from `torch.fx._symbolic_trace` emits a warning on every call:

```
is_fx_tracing will return true for both fx.symbolic_trace and torch.export. Please use is_fx_tracing_symbolic_tracing() for specifically fx.symbolic_trace or torch.compiler.is_compiling() for specifically torch.export/compile.
```

## Fix

Replace with `is_fx_symbolic_tracing()`, which returns `True` only during `fx.symbolic_trace` (not during `torch.export/compile`). All usages in these files are guarding symbolic-trace-specific behavior (skipping assertions, skipping hooks), so `is_fx_symbolic_tracing` is the correct replacement.

## Prod Safety

[`is_fx_symbolic_tracing`](https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L66-L67) is defined as `return _is_fx_tracing_flag and not torch.compiler.is_compiling()`. Since TorchRec's FX tracing uses `fx.symbolic_trace` (not `torch.export`), the behavior is identical — these code paths were never entered during `torch.export/compile` anyway.

Note: ~40 additional usages in `fb/` files import from Meta internal wrappers (`caffe2.torch.fb.model_transform.symbolic_tracing`, `hpc.models.fx_utils`) and are left to their respective owners.

Reviewed By: spmex

Differential Revision: D100528990

fbshipit-source-id: 554ddbe0a39e79a43f97c8da665724f73dbecf13

Apr 13, 2026
f4d689b
zip
tar.gz

v2026.04.06.00

Remove output_dist_embeddings_requests from TrainPipelineContext (met…

…a-pytorch#4008)

Summary:
Pull Request resolved: meta-pytorch#4008

## 1. Context
`output_dist_embeddings_requests` was introduced in D83789518 to store output dist (all-to-all) awaitables in `TrainPipelineContext`, keyed by module FQN. However, no downstream code ever consumes these awaitables from the context — they are already returned directly from `PipelinedForward.__call__`. Holding references to the embedding awaitables in the context prevents the tensors from being freed when the model no longer needs them, unnecessarily increasing peak memory usage.

## 2. Approach
1. **Remove the field**: Delete `output_dist_embeddings_requests` from the `TrainPipelineContext` dataclass, including its docstring.
2. **Remove the write**: Stop storing the awaitable in the context dict after `compute_and_output_dist` — it's already returned to the caller.
3. **Simplify error message**: Remove the diagnostic `output_dist_names` from the `AssertionError` in `PipelinedForward.__call__`, since the field no longer exists.

## 3. Results
* benchmark
|short name                         |GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)|
|--|--|--|--|--|--|--|--|
|sparse_data_dist_base_before       |6046.93 ms       |5533.60 ms       |**49.06 GB**                |67.47 GB                   |70.05 GB          |0.0 / 0.0 / 0.0              |31.40 GB          |
|sparse_data_dist_base_after        |5794.76 ms       |5371.52 ms       |**43.30 GB**                |67.68 GB                   |70.26 GB          |0.0 / 0.0 / 0.0              |31.39 GB          |

* repro commands
```
python -m torchrec.distributed.benchmark.benchmark_train_pipeline \
    --yaml_config=torchrec/distributed/benchmark/yaml/sparse_data_dist_base.yml \
    --name=sparse_data_dist_base_[before|after]
```

* before 2.5+3.1 GB pooled embeddings are not freed util optimizer.step
 {F1987713292}

* after 2.5+3.1 GB pooled embeddings are freed in early forward
 {F1987713392}

## 4. Analysis
1. **Memory improvement**: GPU Peak Mem allocated dropped from 49.06 GB to 43.30 GB (~12% reduction). The context was holding strong references to output dist awaitables (which contain embedding tensors from all-to-all collectives). These references prevented GC from freeing the tensors even after the model consumed them, increasing peak memory.
2. **No functional impact**: The dict was write-only — no code path ever retrieved values from it. The only read was `.keys()` for a diagnostic error message.
3. **Backward compatibility**: `output_dist_embeddings_requests` is an internal field on `TrainPipelineContext`. It's not part of any public API contract.

## 5. Changes
1. **`pipeline_context.py`**: Removed `output_dist_embeddings_requests` field definition and its docstring from `TrainPipelineContext`.
2. **`runtime_forwards.py`**: Removed the write to `output_dist_embeddings_requests` after `compute_and_output_dist`, removed the diagnostic `.keys()` read from the error message, and renamed the `record_function` label from `wait_sparse_data_dist` to `runtime_forward_assemble_KJT`.

Reviewed By: spmex

Differential Revision: D98744415

fbshipit-source-id: 1e0f713cd598826185352f7a991725832c50fdb7

Apr 6, 2026
70f6650
zip
tar.gz

v2026.03.30.00

Add logging to DataLoadingThread (meta-pytorch#3947)

Summary:
Pull Request resolved: meta-pytorch#3947

## 1. Context
When debugging data loading issues in pipelined training, there is limited visibility into `DataLoadingThread` lifecycle events. The existing logging in `stop()` uses hardcoded class names, which is unhelpful when subclasses are involved.

## 2. Approach
1. **One-time creation log**: Added a `one_time_rank0_logger.info()` call in `__init__` that logs device, `to_device_non_blocking`, `memcpy_stream_priority`, and whether the memcpy stream was provided, auto-created, or None. This uses the `Cap01Logger` (one-time per location on rank 0 only) to avoid log spam in multi-step training.
2. **Dynamic class name**: All log messages use `self.__class__.__name__` instead of hardcoded strings, so subclasses of `DataLoadingThread` will correctly identify themselves in logs.

## 3. Results
No benchmark impact — logging-only change.

## 4. Analysis
1. **Risk**: Low. Only adds logging statements with no behavioral changes.
2. **Backward compatibility**: No API changes. The `one_time_rank0_logger` is already imported and used elsewhere in the file (e.g., `FutureDeque`).

## 5. Changes
1. **`utils.py` (`DataLoadingThread.__init__`)**: Added `one_time_rank0_logger.info()` with creation config details using `self.__class__.__name__`.
2. **`utils.py` (`DataLoadingThread.stop`)**: Updated two existing `logger.info()` calls to prefix messages with `self.__class__.__name__`.

Reviewed By: aporialiao

Differential Revision: D98575560

fbshipit-source-id: 515dbe1b6ae3a4ab874dfbb250693dac65a5bb0e

Mar 30, 2026
b4d622a
zip
tar.gz

v2026.03.23.00

Fix Unused Import issue in fbcode/torchrec/github/contrib/dynamic_emb…

…edding/src/torchrec_dynamic_embedding/distributed/__init__.py (meta-pytorch#3896)

Summary: Pull Request resolved: meta-pytorch#3896

Reviewed By: jeffkbkim

Differential Revision: D97469668

fbshipit-source-id: f9ca93982829dd072c7ae529c38ca453e8facc8a

Mar 23, 2026
62d9fbe
zip
tar.gz

v2026.03.16.00

Guard reduce_scatter with self.training to prevent inference deadlock (…

…meta-pytorch#3849)

Summary:
Pull Request resolved: meta-pytorch#3849

Within a single forward pass of a 2D FULLY_SHARDED embedding module, two NCCL collectives are active on different CUDA streams:

- Default stream: collectives on sharding_pg (feature distribution via the outer ShardedModule)
- Async stream: reduce_scatter on replica_pg (weight averaging across replicas)

The reduce_scatter is intentionally async (launched on self._async_stream, forward returns immediately) so it can overlap with downstream computation. During training, the backward pre-hook (_all_gather_table_weights) calls ensure_reduce_scatter_complete() to wait on it before performing the next all-gather. During inference (model.eval()), there is no backward pass, so the reduce_scatter is both unnecessary (no weight modifications to sync) and orphaned (never waited on). But it still fires, and with both streams submitting NCCL operations to different communicators on the same GPU, NCCL's internal cross-communicator serialization can create a circular dependency.

NCCL serializes operations from different communicators sharing a GPU. Each GPU picks one to execute first. When different GPUs pick different operations, a circular dependency forms across the overlapping process groups. With 6 GPUs and world_size_2D=2 (sharding PGs {0,3},{1,4},{2,5}, replica PGs {0,1,2},{3,4,5}):

  GPU 0: NCCL runs sharding_pg collective  -> barrier waiting for GPU 3
  GPU 3: NCCL runs reduce_scatter({3,4,5}) -> barrier waiting for GPU 4
  GPU 4: NCCL runs sharding_pg collective  -> barrier waiting for GPU 1
  GPU 1: NCCL runs reduce_scatter({0,1,2}) -> barrier waiting for GPU 0
  -> DEADLOCK

Technically a race condition (depends on NCCL's cross-communicator scheduling order), but near-deterministic with small PG sizes.

The fix adds `if self.training:` guard so reduce_scatter only fires during training, where it's needed for weight sync after gradient updates. Also adds a None filter to get_resize_awaitables() since get_rs_awaitable() returns None during inference when reduce_scatter is skipped.

Reviewed By: kausv, TroyGarden

Differential Revision: D91254757

fbshipit-source-id: afeb48758fe876d4c9c8074b6e03382a2ee2e646

Mar 16, 2026
55aa3ce
zip
tar.gz

v1.6.0-rc1

Mar 15, 2026
725801d
zip
tar.gz

v2026.03.09.00

Add batch input size logging to benchmark_train_pipeline (meta-pytorc…

…h#3845)

Summary:
Pull Request resolved: meta-pytorch#3845

## 1. Context
When running benchmark_train_pipeline, there is no visibility into how large the generated model inputs are in memory. This makes it difficult to reason about memory consumption, correlate input sizes with GPU memory stats, or diagnose OOM issues. The `ModelInput` class already provides a `size_in_bytes()` API, but it was not being used in the benchmark runner.

## 2. Approach
1. **Per-batch and total size logging**: After `input_config.generate_batches()` returns, iterate over all batches, call `size_in_bytes()` on each, and log both per-batch and total sizes with human-readable formatting (MB or GB).
2. **All-rank logging**: Log on every rank (not just rank 0) since each rank generates its own inputs independently and sizes could theoretically differ with variable-batch configurations.
3. **Human-readable formatting**: Sizes are displayed in GB when >= 1 GB, otherwise in MB, with 2 decimal places.

## 3. Results
* benchmark

|short name                         |GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)|
|--|--|--|--|--|--|--|--|
|sparse_data_dist_base              |3230.91 ms       |3068.12 ms       |49.06 GB                |66.50 GB                   |68.97 GB          |0.0 / 0.0 / 0.0              |31.07 GB          |

New logging output (per rank):
```
Rank 0 batch 0 input size: 1.58 GB
...
Rank 0 total input size: 15.81 GB (10 batches)
Rank 1 batch 0 input size: 1.58 GB
...
Rank 1 total input size: 15.81 GB (10 batches)
```

* repro commands
```bash
# sparse_data_dist_base
buck2 run $GB200 fbcode//torchrec/distributed/benchmark:benchmark_train_pipeline -- \
  --yaml_config=fbcode/torchrec/distributed/benchmark/yaml/sparse_data_dist_base.yml
```

## 4. Analysis
1. **No behavioral impact**: This is a logging-only change. It does not modify any model inputs, sharding, or pipeline behavior. Zero risk to training correctness.
2. **Performance**: The `size_in_bytes()` calls are lightweight (summing `element_size() * numel()` over tensors). Negligible overhead compared to the benchmark itself.
3. **Observability**: With `sparse_data_dist_base.yml` config (batch_size=32768, 90 unweighted + 80 weighted features, pooling_avg=30), each batch is 1.58 GB. Total input across 10 batches is 15.81 GB per rank — this context is valuable when interpreting the 49 GB peak GPU memory allocation.

## 5. Changes
1. **`benchmark_train_pipeline.py`**: Added 17 lines after `generate_batches()` to log per-batch and total input sizes in human-readable format (MB/GB) for all ranks.

Reviewed By: yupadhyay

Differential Revision: D95741625

fbshipit-source-id: 4044f93287c3aa376a4770fe6beba7501242b2de

Mar 9, 2026
a2000f1
zip
tar.gz

v2026.03.02.00

Fix res_store_shards=None propagation causing video_udd_lsr conveyor …

…failure

Summary:
The conveyor mvai/video_udd_lsr has been BLOCKED since R4352 (Feb 28) because
res_store_shards=None propagates through the config chain to a C++ pybind
constructor that expects int64_t, causing RuntimeError.

The None propagation chain:
1. configs.py: res_store_shards defaults to Optional[int] = None
2. model_family.py: unconditionally injects None into fused_params
3. batched_embedding_kernel.py: _populate_res_params() overwrites safe default (1) with None
4. training.py: passes None to SSD/DRAM KV C++ constructors expecting int

Fix A (model_family.py): Only inject res_store_shards into fused_params when
the config value is not None.

Fix B (batched_embedding_kernel.py): In _populate_res_params(), only overwrite
the safe default RESParams.res_store_shards (=1) if the fused_params value is
not None.

SEVs: S625464 (closed), S627484 (in progress)

Reviewed By: catalinii

Differential Revision: D94837597

fbshipit-source-id: 604869e1f4d365345630a653196fc0ee01a4a100

Mar 1, 2026
3791c84
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2026.04.27.00

v2026.04.20.00

v2026.04.13.00

v2026.04.06.00

v2026.03.30.00

v2026.03.23.00

v2026.03.16.00

v1.6.0-rc1

v2026.03.09.00

v2026.03.02.00

Tags: serimj98/torchrec