Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Checklist for Release #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
56 of 87 tasks
soumith opened this issue Aug 26, 2016 · 4 comments
Closed
56 of 87 tasks

Checklist for Release #5

soumith opened this issue Aug 26, 2016 · 4 comments

Comments

@soumith
Copy link
Member

soumith commented Aug 26, 2016

Core

Core Framework code

  • optim + trainer + dataset objects
  • Sharing CPU tensors
  • Add all operations to autograd
  • Free GIL when processing big tensors
  • Custom CUDA memory allocator (Sam Gross)
  • multi-GPU functions
  • nccl integration
  • finish legacy.nn
  • refactor C API for extensions
  • create an example extension with TH/pytorch C API
  • Checkpointing and improving torch.save / torch.load to use the same byte order
  • implement keyword arguments in cwrap
  • go over TH and try to make error messages more descriptive (e.g. if the sizes don't match )
  • Sparse Tensors on CPU and GPU (Zeming)
  • improve tensor printing
  • functional API for autograd variables
  • Finish multiple CUDA types (Soumith)
  • Add stochastic nodes
  • Add all modules to nn
  • Improved error messages Error messages to improve #39
  • sync legacy.nn with Lua nn
  • move Trainer to torch.experimental

Operations

  • Integrate CuDNN
  • Write nn.LSTM*, nn.GRU* etc. to integrate CuDNN RNNs
  • Rewrite LookupTable and SparseLinear to use SparseTensors

FBCode stuff

  • Import into FBCode

Open Source stuff

  • Binary builds
  • Continuous builds for CUDA
  • MNIST and ResNet18 Contbuilds
  • pip wheels

Backward Compatibility

Lua Bridge

  • integrate lutorpy into pytorch (either as optional package or by default)
    • change TH indexing to 0-based and add to cwrap the 1-subtraction and addition

Model Loading

Framework Integration

  • Caffe2 Integration
    • Modify TH / THC / THNN / THCUNN to integrate them
    • Have a converter that takes in a (Module and input) or (output) and auto-converts it to caffe model
      • vice versa. take a caffe protobuf and codegen a python class with loading weights
  • Keras Integration
    • Have a keras backend. Send in a Pull Request to fchollet/keras
  • Converting models between TF and Pytorch
    • Torch2TF: Same as caffe convertor pretty much!
    • TF2Torch: same as caffe, but cover ops like tf.if and tf.while

Website

  • Find someone to design and code it
  • Getting Started
    • Binary installs
      • anaconda-based which links automatically with MKL
      • Each of them for different CUDA versions. 7.0, 7.5, 8.0
    • Source-based installs
  • Showcase Demos / Examples / ModelZoo elegantly
  • Tutorials
  • Look at gym.openai.com (http://gym.openai.com/)
  • Developer docs

Documentation, Demos, Examples, Tutorials, ModelZoo

Demos / Examples / ModelZoo

  • Pre-trained models for each demo (in the model zoo)
    • Create a python wrapper that allows to search and download models (like nltk)
  • Simple API for retraining / using pre-trained models on custom datasets
  • Documentation on how to modify the example for one's own experiments
  • Most or all of them should be multi-GPU ready

Demos + Examples

  • Basic
  • Vision
    • Supervised
      • fb.resnet.torch / googlenet for image classification (sam)
      • fastrcnn (francisco)
      • Video Classification
      • NeuralTalk2 (paszke)
      • Visual Q&A (paszke)
    • Unsupervised
      • Image super-resolution (wafi2x) (soumith)
      • DCGANs + Improved Training for GANs + InfoGAN
      • Text 2 Image (soumith)
      • Pixel RNNs (soumith)
      • Variational AutoEncoders (joost)
  • Games / RL (ludc)
  • NLP / Text
  • Metalearning
    • Neural Turing Machine
    • Learning to Learn by Gradient Descent by Gradient Descent
    • Decoupled Neural Interfaces using Synthetic Gradients https://arxiv.org/abs/1608.05343
  • ConvNet-Benchmarks / DeepMark scripts

Tutorials

Documentation

  • Auto-generate from source / docstrings

Links

Postponed for next release

  • lazy forward execution engine
  • double backprop
  • Sharing CUDA tensors
  • look into Cython
  • a built-in profiler for forward/backward (with automatic hints for speeding up the execution?)

AIViz Integration

  • Have an intial attempt, and talk to Allan

  • Serveable via some Python REST API

  • figure out details for images and videos

  • Audio

  • wav2letter

  • DeepSpeech2 for maybe Switchboard or something (Ask Gabriel)

  • Sparse Models (ads?)

Distributed Training

  • simple distributed trainer like torch-distlearn / torch-ipc / torch-thrift

  • Synchronous, asynchronous and Elastic SGD

  • Integrate with Andrew / Yangqing's MPI library when that's ready

  • Port image classification and seq2seq to this

  • error handling

    • create a dict for translating exceptions and adding some pytorch specific info, sort of like in Elm [1,2]
    • make sure there's a clear error message when multiprocessing runs out of fds
@pltrdy
Copy link

pltrdy commented Apr 24, 2017

@soumith I see that Neural Turing Machine was on the list, anything planned about it atm?

@soumith
Copy link
Member Author

soumith commented Apr 24, 2017

no, nothing planned.

@jekbradbury
Copy link
Contributor

there's a dynamic neural computer implementation that may or may not be complete/fully functional but looks like a good guide to how to implement NTM-like models https://github.com/ypxie/pytorch-NeuCom

@pltrdy
Copy link

pltrdy commented Apr 28, 2017

@jekbradbury thx for the link

kwen2501 added a commit that referenced this issue Nov 4, 2024
…g timeout"


### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this issue Nov 5, 2024
…g timeout"


### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this issue Nov 5, 2024
…g timeout"


### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue Nov 5, 2024
…139659)

### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: #139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
atalman pushed a commit to atalman/pytorch that referenced this issue Nov 11, 2024
…ytorch#139659)

### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: pytorch#139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
pytorchmergebot pushed a commit that referenced this issue Nov 22, 2024
See #140725 (comment)
Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]`
```
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame #8: 0x0000000100fccbe4 Python`run_mod + 168
    frame #9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame #15: 0x0000000100ff1564 Python`pymain_main + 304
    frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame #17: 0x000000019f630274 dyld`start + 2840
```

Pull Request resolved: #141296
Approved by: https://github.com/huydhn
youssef62 pushed a commit to youssef62/pytorch that referenced this issue Nov 23, 2024
See pytorch#140725 (comment)
Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]`
```
(lldb) bt
* thread pytorch#1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame pytorch#1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168
    frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304
    frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame pytorch#17: 0x000000019f630274 dyld`start + 2840
```

Pull Request resolved: pytorch#141296
Approved by: https://github.com/huydhn
gglin001 added a commit to gglin001/pytorch that referenced this issue Nov 27, 2024
Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this issue Dec 2, 2024
…ytorch#139659)

### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: pytorch#139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this issue Dec 2, 2024
See pytorch#140725 (comment)
Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]`
```
(lldb) bt
* thread pytorch#1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame pytorch#1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168
    frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304
    frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame pytorch#17: 0x000000019f630274 dyld`start + 2840
```

Pull Request resolved: pytorch#141296
Approved by: https://github.com/huydhn
pytorch-bot bot pushed a commit that referenced this issue Dec 5, 2024
pobin6 pushed a commit to pobin6/pytorch that referenced this issue Dec 5, 2024
…ytorch#139659)

### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: pytorch#139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
pobin6 pushed a commit to pobin6/pytorch that referenced this issue Dec 5, 2024
See pytorch#140725 (comment)
Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]`
```
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168
    frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304
    frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame pytorch#17: 0x000000019f630274 dyld`start + 2840
```

Pull Request resolved: pytorch#141296
Approved by: https://github.com/huydhn
pytorchmergebot pushed a commit that referenced this issue Dec 23, 2024
# Motivation
Fix #143543

# Solution
We should raise python exception instead of aborting...

# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
  what():  device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
    return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```

Pull Request resolved: #143550
Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
drisspg added a commit that referenced this issue Jan 15, 2025
…ention"


Thanks to manman-ren who verified that triton-lang/triton#4247 fixes this issue as well. This is not currently cherry-picked into pytorch-triton.

========= COMPUTE-SANITIZER
Test completed successfully!
========= ERROR SUMMARY: 0 errors
## NOTE:
HMM very interestingly:
If the og_headdim is a odd this works as expected. However when the og_head_dim is a multiple of 2 this segfaults here:
```Shell
(lldb) bt
* thread #67, name = 'pt_autograd_0', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x10)
  * frame #0: 0x00007ffed327fbfe libtriton.so`scheduleRemainingToLastStage(forOp=ForOp @ 0x00007ffcafdfd658, schedule=0x00007ffcafdfd9e0, afterPrologue=<unavailable>, numStages=2) at MatmulLoopPipeline.cpp:893:9
    frame #1: 0x00007ffed328d970 libtriton.so`mlir::triton::preProcessLoopAndGetSchedule(forOp=0x00007ffcafdfddc0, numStages=2, options=0x00007ffcafdfde80) at MatmulLoopPipeline.cpp:1230:31
    frame #2: 0x00007ffed32a6a43 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation() [inlined] pipelineLoop(numStages=2, forOp=ForOp @ 0x00007ffcafdfddc0) at SoftwarePipeliner.cpp:79:47
    frame #3: 0x00007ffed32a6998 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation(this=0x00007ffc54767f10) at SoftwarePipeliner.cpp:125:36
    frame #4: 0x00007ffed385147c libtriton.so`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 700
    frame #5: 0x00007ffed3851df2 libtriton.so`mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 354
    frame #6: 0x00007ffed385481c libtriton.so`mlir::PassManager::run(mlir::Operation*) + 876
    frame #7: 0x00007ffed3542bad libtriton.so`<lambda(mlir::PassManager&, mlir::ModuleOp&)>::operator(self=<unavailable>, mod=0x00007ffc54579280, __closure=<unavailable>)(mlir::PassManager &, mlir::ModuleOp &) at ir.cc:1625:19
    frame #8: 0x00007ffed3560108 libtriton.so`_FUN [inlined] operator(this=0x0000000000000000, call=0x00007ffcafdfe6e0) at cast.h:1480:37
    frame #9: 0x00007ffed35600f0 libtriton.so`_FUN((null)=0x00007ffcafdfe6e0) at pybind11.h:224:21
    frame #10: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24
    frame #11: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
drisspg added a commit that referenced this issue Jan 15, 2025
Thanks to manman-ren who verified that triton-lang/triton#4247 fixes this issue as well. This is not currently cherry-picked into pytorch-triton.

========= COMPUTE-SANITIZER
Test completed successfully!
========= ERROR SUMMARY: 0 errors
## NOTE:
HMM very interestingly:
If the og_headdim is a odd this works as expected. However when the og_head_dim is a multiple of 2 this segfaults here:
```Shell
(lldb) bt
* thread #67, name = 'pt_autograd_0', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x10)
  * frame #0: 0x00007ffed327fbfe libtriton.so`scheduleRemainingToLastStage(forOp=ForOp @ 0x00007ffcafdfd658, schedule=0x00007ffcafdfd9e0, afterPrologue=<unavailable>, numStages=2) at MatmulLoopPipeline.cpp:893:9
    frame #1: 0x00007ffed328d970 libtriton.so`mlir::triton::preProcessLoopAndGetSchedule(forOp=0x00007ffcafdfddc0, numStages=2, options=0x00007ffcafdfde80) at MatmulLoopPipeline.cpp:1230:31
    frame #2: 0x00007ffed32a6a43 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation() [inlined] pipelineLoop(numStages=2, forOp=ForOp @ 0x00007ffcafdfddc0) at SoftwarePipeliner.cpp:79:47
    frame #3: 0x00007ffed32a6998 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation(this=0x00007ffc54767f10) at SoftwarePipeliner.cpp:125:36
    frame #4: 0x00007ffed385147c libtriton.so`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 700
    frame #5: 0x00007ffed3851df2 libtriton.so`mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 354
    frame #6: 0x00007ffed385481c libtriton.so`mlir::PassManager::run(mlir::Operation*) + 876
    frame #7: 0x00007ffed3542bad libtriton.so`<lambda(mlir::PassManager&, mlir::ModuleOp&)>::operator(self=<unavailable>, mod=0x00007ffc54579280, __closure=<unavailable>)(mlir::PassManager &, mlir::ModuleOp &) at ir.cc:1625:19
    frame #8: 0x00007ffed3560108 libtriton.so`_FUN [inlined] operator(this=0x0000000000000000, call=0x00007ffcafdfe6e0) at cast.h:1480:37
    frame #9: 0x00007ffed35600f0 libtriton.so`_FUN((null)=0x00007ffcafdfe6e0) at pybind11.h:224:21
    frame #10: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24
    frame #11: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
c-p-i-o added a commit to c-p-i-o/pytorch that referenced this issue Jan 23, 2025
Summary:
Fix memory leak on shutdown when socket is closed.
We still need to free the buffer to make valgrind happy.

Test Plan:
Use `mtiavm`.
Repro steps provided by cristianlume.

1. Build
```
buck2 run //mtia/vm:athena-amodel-usd-owl-rank-
2. Run 2 VMs

on window 1:
```
mtiavm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2
on window 2:
````
mtiavm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1
```


without the fix:
```
==8766==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 8000 byte(s) in 2 object(s) allocated from:
    #0 0x5696fe in malloc (/data/users/cpio/fbsource/buck-out/v2/gen/fbcode/d4f2c81239ceac96/neteng/ai/rdma_gen/__rdma_gen__/rdma_gen+0x5696fe)
    pytorch#1 0x7faa8d40c47b in c10d::detail::UvTcpSocket::alloc_buffer(uv_handle_s*, unsigned long, uv_buf_t*) fbcode/caffe2/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:121
    pytorch#2 0x7faa6f62316d in uv__read /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/stream.c:1143:5
    pytorch#3 0x7faa6f6239ef in uv__stream_io /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/stream.c:1306:5
    pytorch#4 0x7faa6f62941f in uv__io_poll /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/linux-core.c:431:11
    pytorch#5 0x7faa6f618629 in uv_run /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/core.c:375:5
    pytorch#6 0x7faa8d3e7320 in c10d::detail::LibUVStoreDaemon::run() fbcode/caffe2/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:1216
    pytorch#7 0x7faa8d3bc933 in void std::__invoke_impl<void, void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>(std::__invoke_memfun_deref, void (c10d::detail::BackgroundThread::*&&)(), c10d::detail::BackgroundThread*&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74
    pytorch#8 0x7faa8d3bc80c in std::__invoke_result<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>::type std::__invoke<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>(void (c10d::detail::BackgroundThread::*&&)(), c10d::detail::BackgroundThread*&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:96
    pytorch#9 0x7faa8d3bc7e1 in void std::thread::_Invoker<std::tuple<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>>::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:253
    pytorch#10 0x7faa8d3bc7a4 in std::thread::_Invoker<std::tuple<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>>::operator()() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:260
    pytorch#11 0x7faa8d3bc608 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:211
    pytorch#12 0x7faa436df5b4 in execute_native_thread_routine (/usr/local/fbcode/platform010/lib/libstdc++.so.6+0xdf5b4) (BuildId: 14a4eafe0cdc86af9a949a6c0c27bf21a033e047)
    pytorch#13 0x56744a in asan_thread_start(void*) ubsan.c
    pytorch#14 0x7faa43b2cf5b in __GI___clone3 (/usr/local/fbcode/platform010/lib/libc.so.6+0x12cf5b) (BuildId: 93cdceeb8322234c38e1f2c93ad0ff10c7632fa6)
```
With fix, no leak

Differential Revision: D68566104
akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025
akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025
akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025
akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025
* Squashed 'src/composable_kernel/' content from commit f6edda6

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda6

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda6..5781adf

5781adf Update develop (pytorch#5) (pytorch#6)
97e6d51 Merge pull request pytorch#4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41 refactor
49c33aa refactor
54b3e73 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf

* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

Co-authored-by: JD <[email protected]>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix

Co-authored-by: JD <[email protected]>
akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025
…duction (pytorch#1156)

* Squashed 'src/composable_kernel/' content from commit f6edda6

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda6

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda6..5781adf

5781adf Update develop (pytorch#5) (pytorch#6)
97e6d51 Merge pull request pytorch#4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41 refactor
49c33aa refactor
54b3e73 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf

* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* tidy

* add tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

Co-authored-by: JD <[email protected]>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix

* Squashed 'src/composable_kernel/' changes from 5781adf..31b4035

31b4035 Merge pull request pytorch#16 from ROCmSoftwarePlatform/develop
b62bf8c Merge pull request pytorch#14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
ccc4a1d Merge pull request pytorch#8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
67ad47e refactor
16effa7 refactor
a91b68d DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
2cbabbb use int instead of index_t in kernel wrapper
0834bc7 compiler parameter use stream
f2ac783 make innner product compatiable on gfx900
4e57b30 rename
c03045c rename
b258995 update CK build script
2c48039 fix kernel filename
d626dcc fix enum issue
643ebd4 tidy
ddd49ec fix clang warning suppression
4f566c6 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
172036d add c-style pointer cast
76f3131 tidy
d184289 tidy
f885c13 tidy
80120f0 tidy
c3efeb5 tidy
56fc084 tidy
54fba51 tidy
e62bae7 tidy
24c8728 add tidy
61487e0 fix
ae98b52 remove online compilation from CK
cb95421 refactor
73ca970 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3
3b86646 Merge pull request pytorch#7 from ROCmSoftwarePlatform/master
d09ea4f Update develop (pytorch#5)
3d32ae9 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

git-subtree-dir: src/composable_kernel
git-subtree-split: 31b4035

* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update in src/reducetensor.cpp for consistent IDs passing to the kernel

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Small updates in src/reducetensor.cpp

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers

Co-authored-by: Chao Liu <[email protected]>
Co-authored-by: JD <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants