Tags: pytorch/pytorch
Tags
[Pytorch] Enable autovec on aarch64 for type conversion (#166049) Summary: Implementing autovec template for type conversions on aarch64-NEON Generated code can be seen here: https://godbolt.org/z/1K6T1d9TE We've seen significant performance improvements for converting to and from bytes, compiling using clang with -march=armv9-a+sve2: Before float->uint8->float ===> 683.212us float->int8->float ===> 687.846us int32->uint8->int32 ===> 497.121us int32->int8->int32 ===> 481.889us After: float->uint8->float ===> 198.204us ----> 245% higher throughput float->int8->float ===> 200.241us ----> 244% higher throughput int32->uint8->int32 ===> 197.970us ----> 151% higher throughput int32->int8->int32 ===> 198.206us ----> 143% higher throughput Test Plan: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Differential Revision: D85213420 Pull Request resolved: #166049 Approved by: https://github.com/ezyang, https://github.com/mcfi, https://github.com/aditew01
[MPS] Migrate `angle` to Metal ops (#166210) Pull Request resolved: #166210 Approved by: https://github.com/Skylion007
[dynamo] Remove unnecessary NAME_MATCH guard (#166112) Pull Request resolved: #166112 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166155
[OpenReg] Remove the Unnecessary Fallback Implementation for Autograd… …Private1 (#165316) As the title stated. The fallback for AutogradPrivateUse1 is builtin in PyTorch, so it is no need to register general implementation for out of tree backend. Pull Request resolved: #165316 Approved by: https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #165315
Fix race condition and make CUDA kthvalue deterministic (#165762) The gatherKthValue kernel had a race condition where multiple threads could write to the same output location without synchronization when duplicate k-th values exist, resulting in non-deterministic output. Changes: - aten/src/ATen/native/cuda/Sorting.cu: Use atomicMin with shared memory to deterministically find minimum index. Add early termination and remove redundant inRange checks. (We have to cast the index to `int32_t`, but this is already assumed to fit earlier in the kernel.) - aten/src/ATen/native/cuda/Sorting.cpp: Remove non-deterministic alert since kthvalue is now deterministic on CUDA. - torch/__init__.py: Remove kthvalue from non-deterministic operations list and remove kthvalue example from use_deterministic_algorithms() docstring. - test/test_torch.py: Remove test_nondeterministic_alert_kthvalue since kthvalue no longer raises alerts on CUDA. Benefits: - Deterministic: always returns minimum index when duplicates exist - Potential performance improvement on large arrays with repetitions Test Results: - All existing PyTorch tests pass (test_kthvalue) - Custom determinism tests confirm consistent results - Custom CUDA vs CPU correctness validated across 50+ scenarios - Custom performance benchmarks show improvements with no visible regressions Addresses #165227 Pull Request resolved: #165762 Approved by: https://github.com/ngimel, https://github.com/eqy
[10/N] Apply ruff UP035 rule (#165709) This is a follow-up of #165515. ruff `UP035` rules are applied to dynamo code to use Py 3.10+ typing. Pull Request resolved: #165709 Approved by: https://github.com/ezyang
Remove likely unnecessary _EXPAND trick for non-windows in HIDDEN_NAM… …ESPACE_BEGIN (#166203) I've learned that the EXPAND trick is needed mostly for an MSVC quirk to properly expand arguments. I tested on Linux locally and suspect that we don't need the _EXPAND for non-windows. This PR is BE to minimalize what we need and remove what we don't, but I'm also okay not landing this if @malfet tells me that this quirk goes beyond MSVC. Pull Request resolved: #166203 Approved by: https://github.com/malfet ghstack dependencies: #166076, #166077, #166078, #166079
Set prefer_deferred_runtime_asserts_over_guards to True (#165820) Set prefer_deferred_runtime_asserts_over_guards to True and allow a flag to control the behavior, just in case. This option has enable the gemma3 model export with transformers==4.57. I am not sure how best to test it though. Pull Request resolved: #165820 Approved by: https://github.com/titaiwangms
[ROCm] new implementation of upsample_bilinear2d_backward (#164572) Changed the implementation from an output-based approach to an input-based one to remove `atomicAdd` operations, and it appears to deliver at least a 20× speedup. The changes are from Yu-Yun <[email protected]>. # Summary: Refactor of the implementation of the `upsample_bilinear2d_backward` opertion on MI300X/MI325X - The original "scatter-add" approach - Each thread, representing an output pixel, scattered gradient contributions to four input pixels, using costly atomic operations on MI300X/MI325X GPUs. - The new "gather-sum" approach - Each thread is responsible for a single input pixel and gathers all relevant gradient contributions from a small, calculated region of the output tensor (done by the `compute_output_range` device function). # Breakdown of the code changes - Inversion of the parallelization strategy of the kernel function `upsample_bilinear2d_backward_out_frame` - Originally, the main kernel loop was parallelized over the number of elements in the output gradient tensor (`const size_t o_numel = nc * width2 * height2;`). - Each thread processed one output pixel. - The new loop is parallelized over the number of elements in the input gradient tensor (`const size_t i_numel = nc * height1 * width1;`). - Each thread is responsible for calculating the final gradient for a single input pixel. - The kernel launch changes accordingly in the function `upsample_bilinear2d_backward_out_cuda_template`. - Added a device function for calculating the range of output pixels that could have possibly used that the input pixel (`input_pos`) during the forward pass interpolation - This is essentially the mathematical inverse of the forward pass. - This function tries to prune a thread's search space so that it only needs to inspect a small, local window of the output tensor. - Gradient calculation approach switching from "scatter-add" to "gather-sum" - Scatter-add - For each output pixel, the thread calculated 4 gradient contributions and use `fastAtomicAdd` 4 times to add these values to 4 different (and potentially highly contended) memory locations in the input gradient tensor. - Gather-sum - A thread responsible for one input pixel calls `compute_output_range` to determine the small rectangular region of output pixels that influence the input's final gradient value. - The thread iterates through this region, and for each output pixel in the regionre, it re-calculates the interpolation weights to determine the exact contribution to its specific input pixel. - All these contributions are accumulated into a private, per-thread register variable (`accscalar_t grad_sum = 0;`). - W/o any gloabl memory access, this accumulation is extremely fast. - When the loops are done, the thread performs a single, direct write (non-atomic) of the final summed gradient to its designated location in global memory (`idata[index] = static_cast<scalar_t>(grad_sum);`). # Why performance gets boosted - Analysis of the root cause of performance drop - Ref. (internal only) - https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1140493327/PyTorch__upsample_bilinear2d_backward - First and foremost, elimination of the contention of atomic operations - Many parallel threads called `atomicAdd` frequently attempting to update the exact same memory location in the input gradient tensor at the same time. - The GPU's memory controler has to serialize these operations, effectively nullifying the benefit of parallel capability at those contention points. - MI300X/MI325X chiplet-based CDNA 3 architeture amplified the issue. - When contending threads reside on different XCDs, resolving the atomic operation requires high-latency coherence traffic across the Infinity Fabric interconnect. - The implementation change eliminates hardware-level serialization and cross-chiplet coherence traffic caused by many `atomicAdd`. - Improved memory access pattern and locality - Write coalescing - The regular sum writes `idata[index] = static_cast<scalar_t>(grad_sum);` can be perfectly coalesced by GPUs. - Read locality - Even though there are many (potentially repeated) reads from the output tensor (`static_cast<accscalar_t>(odata[output_idx])`), these are highly cache-friendly, meaning the data for one thread is likely to be in the L1 or L2 cache already due to an access from a neighboring thread. - Trade-off: computation for memory synchronization - The recalculation of interpolation weights fits well on high-computational-throughput modern GPUs like MI300X/MI325X. - Removal of atomic operations avoids expensive memory synchronization. --- Optimizations of `grid_sampler_2d_backward` will be addressed in a separate PR. Doc for reference: (internal only) https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1162750701/PyTorch__grid_sampler_2d_backward Pull Request resolved: #164572 Approved by: https://github.com/jeffdaily Co-authored-by: Eli Uriegas <[email protected]>
[AI Codemod][DevmateFBSourceTestFailureBot] Fix for T241916639 ("Your…
… diff, D84932408, broke one test") (#166168)
Reviewed By: XilunWu
Differential Revision: D84983164
Pull Request resolved: #166168
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
PreviousNext