Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tags: pytorch/pytorch

Tags

viable/strict/1761375054

Toggle viable/strict/1761375054's commit message
[Pytorch] Enable autovec on aarch64 for type conversion (#166049)

Summary:
Implementing autovec template for type conversions on aarch64-NEON

Generated code can be seen here: https://godbolt.org/z/1K6T1d9TE

We've seen significant performance improvements for converting to and from bytes, compiling using clang with -march=armv9-a+sve2:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Differential Revision: D85213420

Pull Request resolved: #166049
Approved by: https://github.com/ezyang, https://github.com/mcfi, https://github.com/aditew01

viable/strict/1761373581

Toggle viable/strict/1761373581's commit message
[MPS] Migrate `angle` to Metal ops (#166210)

Pull Request resolved: #166210
Approved by: https://github.com/Skylion007

viable/strict/1761371385

Toggle viable/strict/1761371385's commit message
[dynamo] Remove unnecessary NAME_MATCH guard (#166112)

Pull Request resolved: #166112
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166155

viable/strict/1761369889

Toggle viable/strict/1761369889's commit message
[OpenReg] Remove the Unnecessary Fallback Implementation for Autograd…

…Private1 (#165316)

As the title stated.

The fallback for AutogradPrivateUse1 is builtin in PyTorch, so it is no need to register general implementation for out of tree backend.
Pull Request resolved: #165316
Approved by: https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #165315

viable/strict/1761367802

Toggle viable/strict/1761367802's commit message
Fix race condition and make CUDA kthvalue deterministic (#165762)

The gatherKthValue kernel had a race condition where multiple threads could write to the same output location without synchronization when duplicate k-th values exist, resulting in non-deterministic output.

Changes:
- aten/src/ATen/native/cuda/Sorting.cu: Use atomicMin with shared memory to deterministically find minimum index. Add early termination and remove redundant inRange checks. (We have to cast the index to `int32_t`, but this is already assumed to fit earlier in the kernel.)
- aten/src/ATen/native/cuda/Sorting.cpp: Remove non-deterministic alert since kthvalue is now deterministic on CUDA.
- torch/__init__.py: Remove kthvalue from non-deterministic operations list and remove kthvalue example from use_deterministic_algorithms() docstring.
- test/test_torch.py: Remove test_nondeterministic_alert_kthvalue since kthvalue no longer raises alerts on CUDA.

Benefits:
- Deterministic: always returns minimum index when duplicates exist
- Potential performance improvement on large arrays with repetitions

Test Results:
- All existing PyTorch tests pass (test_kthvalue)
- Custom determinism tests confirm consistent results
- Custom CUDA vs CPU correctness validated across 50+ scenarios
- Custom performance benchmarks show improvements with no visible regressions

Addresses #165227

Pull Request resolved: #165762
Approved by: https://github.com/ngimel, https://github.com/eqy

viable/strict/1761366338

Toggle viable/strict/1761366338's commit message
[10/N] Apply ruff UP035 rule (#165709)

This is a follow-up of #165515. ruff `UP035` rules are applied to  dynamo code to use Py 3.10+ typing.

Pull Request resolved: #165709
Approved by: https://github.com/ezyang

trunk/7924e3aacf49e180a1b8100adef38b0b24b5c22b

Toggle trunk/7924e3aacf49e180a1b8100adef38b0b24b5c22b's commit message
Remove likely unnecessary _EXPAND trick for non-windows in HIDDEN_NAM…

…ESPACE_BEGIN (#166203)

I've learned that the EXPAND trick is needed mostly for an MSVC quirk to properly expand arguments. I tested on Linux locally and suspect that we don't need the _EXPAND for non-windows.  This PR is BE to minimalize what we need and remove what we don't, but I'm also okay not landing this if @malfet tells me that this quirk goes beyond MSVC.

Pull Request resolved: #166203
Approved by: https://github.com/malfet
ghstack dependencies: #166076, #166077, #166078, #166079

trunk/003601a70da8d3e9afec0a84a2c6395990e8ed41

Toggle trunk/003601a70da8d3e9afec0a84a2c6395990e8ed41's commit message
Set prefer_deferred_runtime_asserts_over_guards to True (#165820)

Set prefer_deferred_runtime_asserts_over_guards to True and allow a flag to control the behavior, just in case.

This option has enable the gemma3 model export with transformers==4.57. I am not sure how best to test it though.

Pull Request resolved: #165820
Approved by: https://github.com/titaiwangms

trunk/761f946043aa9ca9696da2f0042a3dd36f9e7e1e

Toggle trunk/761f946043aa9ca9696da2f0042a3dd36f9e7e1e's commit message
[ROCm] new implementation of upsample_bilinear2d_backward (#164572)

Changed the implementation from an output-based approach to an input-based one to remove `atomicAdd` operations, and it appears to deliver at least a 20× speedup.

The changes are from Yu-Yun <[email protected]>.

# Summary: Refactor of the implementation of the `upsample_bilinear2d_backward` opertion on MI300X/MI325X
- The original "scatter-add" approach
  - Each thread, representing an output pixel, scattered gradient contributions to four input pixels, using costly atomic operations on MI300X/MI325X GPUs.
- The new "gather-sum" approach
  - Each thread is responsible for a single input pixel and gathers all relevant gradient contributions from a small, calculated region of the output tensor (done by the `compute_output_range` device function).
# Breakdown of the code changes
- Inversion of the parallelization strategy of the kernel function `upsample_bilinear2d_backward_out_frame`
  - Originally, the main kernel loop was parallelized over the number of elements in the output gradient tensor (`const size_t o_numel = nc * width2 * height2;`).
    - Each thread processed one output pixel.
  - The new loop is parallelized over the number of elements in the input gradient tensor (`const size_t i_numel = nc * height1 * width1;`).
    - Each thread is responsible for calculating the final gradient for a single input pixel.
  - The kernel launch changes accordingly in the function `upsample_bilinear2d_backward_out_cuda_template`.
- Added a device function for calculating the range of output pixels that could have possibly used that the input pixel (`input_pos`) during the forward pass interpolation
  - This is essentially the mathematical inverse of the forward pass.
  - This function tries to prune a thread's search space so that it only needs to inspect a small, local window of the output tensor.
- Gradient calculation approach switching from "scatter-add" to "gather-sum"
  - Scatter-add
    - For each output pixel, the thread calculated 4 gradient contributions and use `fastAtomicAdd` 4 times to add these values to 4 different (and potentially highly contended) memory locations in the input gradient tensor.
  - Gather-sum
    - A thread responsible for one input pixel calls `compute_output_range` to determine the small rectangular region of output pixels that influence the input's final gradient value.
    - The thread iterates through this region, and for each output pixel in the regionre, it re-calculates the interpolation weights to determine the exact contribution to its specific input pixel.
    - All these contributions are accumulated into a private, per-thread register variable (`accscalar_t grad_sum = 0;`).
      - W/o any gloabl memory access, this accumulation is extremely fast.
    - When the loops are done, the thread performs a single, direct write (non-atomic) of the final summed gradient to its designated location in global memory (`idata[index] = static_cast<scalar_t>(grad_sum);`).
# Why performance gets boosted
- Analysis of the root cause of performance drop
  - Ref. (internal only) - https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1140493327/PyTorch__upsample_bilinear2d_backward
- First and foremost, elimination of the contention of atomic operations
  - Many parallel threads called `atomicAdd` frequently attempting to update the exact same memory location in the input gradient tensor at the same time.
    - The GPU's memory controler has to serialize these operations, effectively nullifying the benefit of parallel capability at those contention points.
  - MI300X/MI325X chiplet-based CDNA 3 architeture amplified the issue.
    - When contending threads reside on different XCDs, resolving the atomic operation requires high-latency coherence traffic across the Infinity Fabric interconnect.
  - The implementation change eliminates hardware-level serialization and cross-chiplet coherence traffic caused by many `atomicAdd`.
- Improved memory access pattern and locality
  - Write coalescing
    - The regular sum writes `idata[index] = static_cast<scalar_t>(grad_sum);` can be perfectly coalesced by GPUs.
  - Read locality
    - Even though there are many (potentially repeated) reads from the output tensor (`static_cast<accscalar_t>(odata[output_idx])`), these are highly cache-friendly, meaning the data for one thread is likely to be in the L1 or L2 cache already due to an access from a neighboring thread.
- Trade-off: computation for memory synchronization
  - The recalculation of interpolation weights fits well on high-computational-throughput modern GPUs like MI300X/MI325X.
  - Removal of atomic operations avoids expensive memory synchronization.

---

Optimizations of `grid_sampler_2d_backward` will be addressed in a separate PR.
Doc for reference: (internal only) https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1162750701/PyTorch__grid_sampler_2d_backward

Pull Request resolved: #164572
Approved by: https://github.com/jeffdaily

Co-authored-by: Eli Uriegas <[email protected]>

trunk/661a56002f5f747712378752cc172fafab7ee000

Toggle trunk/661a56002f5f747712378752cc172fafab7ee000's commit message
[AI Codemod][DevmateFBSourceTestFailureBot] Fix for T241916639 ("Your…

… diff, D84932408, broke one test") (#166168)

Reviewed By: XilunWu

Differential Revision: D84983164

Pull Request resolved: #166168
Approved by: https://github.com/H-Huang, https://github.com/fduwjj