Checklist for Release #5

soumith · 2016-08-26T16:59:48Z

Core

Core Framework code

Operations

Integrate CuDNN
Write nn.LSTM*, nn.GRU* etc. to integrate CuDNN RNNs
Rewrite LookupTable and SparseLinear to use SparseTensors

FBCode stuff

Import into FBCode

Open Source stuff

Binary builds
Continuous builds for CUDA
MNIST and ResNet18 Contbuilds
pip wheels

Backward Compatibility

Lua Bridge

integrate lutorpy into pytorch (either as optional package or by default)
- change TH indexing to 0-based and add to cwrap the 1-subtraction and addition

Model Loading

Build a simple model loader based on https://github.com/bshillingford/python-torchfile
Apply backward-compatibility patches (they've been removed from legacy nn)

Framework Integration

Caffe2 Integration
- Modify TH / THC / THNN / THCUNN to integrate them
- Have a converter that takes in a (Module and input) or (output) and auto-converts it to caffe model
  - vice versa. take a caffe protobuf and codegen a python class with loading weights
Keras Integration
- Have a keras backend. Send in a Pull Request to fchollet/keras
Converting models between TF and Pytorch
- Torch2TF: Same as caffe convertor pretty much!
- TF2Torch: same as caffe, but cover ops like tf.if and tf.while

Website

Documentation, Demos, Examples, Tutorials, ModelZoo

Demos / Examples / ModelZoo

Pre-trained models for each demo (in the model zoo)
- Create a python wrapper that allows to search and download models (like nltk)
Simple API for retraining / using pre-trained models on custom datasets
Documentation on how to modify the example for one's own experiments
Most or all of them should be multi-GPU ready

Demos + Examples

Tutorials

See Task to track documentation and tutorials #288

Documentation

Auto-generate from source / docstrings

Links

Postponed for next release

lazy forward execution engine
double backprop
Sharing CUDA tensors
look into Cython
a built-in profiler for forward/backward (with automatic hints for speeding up the execution?)

AIViz Integration

Have an intial attempt, and talk to Allan
Serveable via some Python REST API
figure out details for images and videos
Audio
wav2letter
DeepSpeech2 for maybe Switchboard or something (Ask Gabriel)
Sparse Models (ads?)

Distributed Training

simple distributed trainer like torch-distlearn / torch-ipc / torch-thrift
Synchronous, asynchronous and Elastic SGD
Integrate with Andrew / Yangqing's MPI library when that's ready
Port image classification and seq2seq to this
error handling
- create a dict for translating exceptions and adding some pytorch specific info, sort of like in Elm [1,2]
- make sure there's a clear error message when multiprocessing runs out of fds

Don't link tests with NVML

pltrdy · 2017-04-24T14:48:02Z

@soumith I see that Neural Turing Machine was on the list, anything planned about it atm?

soumith · 2017-04-24T22:12:29Z

no, nothing planned.

jekbradbury · 2017-04-27T23:26:43Z

there's a dynamic neural computer implementation that may or may not be complete/fully functional but looks like a good guide to how to implement NTM-like models https://github.com/ypxie/pytorch-NeuCom

pltrdy · 2017-04-28T13:41:53Z

@jekbradbury thx for the link

…g timeout" ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

@zdevito

…139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: #139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

@zdevito

…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

See #140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame #15: 0x0000000100ff1564 Python`pymain_main + 304 frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame #17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: #141296 Approved by: https://github.com/huydhn

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread pytorch#1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame pytorch#1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168 frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164 frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

@zdevito

…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread pytorch#1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame pytorch#1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168 frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164 frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

fix kv block size

@zdevito

…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168 frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164 frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

# Motivation Fix #143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: #143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD

…ention" Thanks to manman-ren who verified that triton-lang/triton#4247 fixes this issue as well. This is not currently cherry-picked into pytorch-triton. ========= COMPUTE-SANITIZER Test completed successfully! ========= ERROR SUMMARY: 0 errors ## NOTE: HMM very interestingly: If the og_headdim is a odd this works as expected. However when the og_head_dim is a multiple of 2 this segfaults here: ```Shell (lldb) bt * thread #67, name = 'pt_autograd_0', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x10) * frame #0: 0x00007ffed327fbfe libtriton.so`scheduleRemainingToLastStage(forOp=ForOp @ 0x00007ffcafdfd658, schedule=0x00007ffcafdfd9e0, afterPrologue=<unavailable>, numStages=2) at MatmulLoopPipeline.cpp:893:9 frame #1: 0x00007ffed328d970 libtriton.so`mlir::triton::preProcessLoopAndGetSchedule(forOp=0x00007ffcafdfddc0, numStages=2, options=0x00007ffcafdfde80) at MatmulLoopPipeline.cpp:1230:31 frame #2: 0x00007ffed32a6a43 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation() [inlined] pipelineLoop(numStages=2, forOp=ForOp @ 0x00007ffcafdfddc0) at SoftwarePipeliner.cpp:79:47 frame #3: 0x00007ffed32a6998 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation(this=0x00007ffc54767f10) at SoftwarePipeliner.cpp:125:36 frame #4: 0x00007ffed385147c libtriton.so`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 700 frame #5: 0x00007ffed3851df2 libtriton.so`mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 354 frame #6: 0x00007ffed385481c libtriton.so`mlir::PassManager::run(mlir::Operation*) + 876 frame #7: 0x00007ffed3542bad libtriton.so`<lambda(mlir::PassManager&, mlir::ModuleOp&)>::operator(self=<unavailable>, mod=0x00007ffc54579280, __closure=<unavailable>)(mlir::PassManager &, mlir::ModuleOp &) at ir.cc:1625:19 frame #8: 0x00007ffed3560108 libtriton.so`_FUN [inlined] operator(this=0x0000000000000000, call=0x00007ffcafdfe6e0) at cast.h:1480:37 frame #9: 0x00007ffed35600f0 libtriton.so`_FUN((null)=0x00007ffcafdfe6e0) at pybind11.h:224:21 frame #10: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 frame #11: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

Thanks to manman-ren who verified that triton-lang/triton#4247 fixes this issue as well. This is not currently cherry-picked into pytorch-triton. ========= COMPUTE-SANITIZER Test completed successfully! ========= ERROR SUMMARY: 0 errors ## NOTE: HMM very interestingly: If the og_headdim is a odd this works as expected. However when the og_head_dim is a multiple of 2 this segfaults here: ```Shell (lldb) bt * thread #67, name = 'pt_autograd_0', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x10) * frame #0: 0x00007ffed327fbfe libtriton.so`scheduleRemainingToLastStage(forOp=ForOp @ 0x00007ffcafdfd658, schedule=0x00007ffcafdfd9e0, afterPrologue=<unavailable>, numStages=2) at MatmulLoopPipeline.cpp:893:9 frame #1: 0x00007ffed328d970 libtriton.so`mlir::triton::preProcessLoopAndGetSchedule(forOp=0x00007ffcafdfddc0, numStages=2, options=0x00007ffcafdfde80) at MatmulLoopPipeline.cpp:1230:31 frame #2: 0x00007ffed32a6a43 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation() [inlined] pipelineLoop(numStages=2, forOp=ForOp @ 0x00007ffcafdfddc0) at SoftwarePipeliner.cpp:79:47 frame #3: 0x00007ffed32a6998 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation(this=0x00007ffc54767f10) at SoftwarePipeliner.cpp:125:36 frame #4: 0x00007ffed385147c libtriton.so`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 700 frame #5: 0x00007ffed3851df2 libtriton.so`mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 354 frame #6: 0x00007ffed385481c libtriton.so`mlir::PassManager::run(mlir::Operation*) + 876 frame #7: 0x00007ffed3542bad libtriton.so`<lambda(mlir::PassManager&, mlir::ModuleOp&)>::operator(self=<unavailable>, mod=0x00007ffc54579280, __closure=<unavailable>)(mlir::PassManager &, mlir::ModuleOp &) at ir.cc:1625:19 frame #8: 0x00007ffed3560108 libtriton.so`_FUN [inlined] operator(this=0x0000000000000000, call=0x00007ffcafdfe6e0) at cast.h:1480:37 frame #9: 0x00007ffed35600f0 libtriton.so`_FUN((null)=0x00007ffcafdfe6e0) at pybind11.h:224:21 frame #10: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 frame #11: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

Summary: Fix memory leak on shutdown when socket is closed. We still need to free the buffer to make valgrind happy. Test Plan: Use `mtiavm`. Repro steps provided by cristianlume. 1. Build ``` buck2 run //mtia/vm:athena-amodel-usd-owl-rank- 2. Run 2 VMs on window 1: ``` mtiavm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 on window 2: ```` mtiavm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1 ``` without the fix: ``` ==8766==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8000 byte(s) in 2 object(s) allocated from: #0 0x5696fe in malloc (/data/users/cpio/fbsource/buck-out/v2/gen/fbcode/d4f2c81239ceac96/neteng/ai/rdma_gen/__rdma_gen__/rdma_gen+0x5696fe) pytorch#1 0x7faa8d40c47b in c10d::detail::UvTcpSocket::alloc_buffer(uv_handle_s*, unsigned long, uv_buf_t*) fbcode/caffe2/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:121 pytorch#2 0x7faa6f62316d in uv__read /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/stream.c:1143:5 pytorch#3 0x7faa6f6239ef in uv__stream_io /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/stream.c:1306:5 pytorch#4 0x7faa6f62941f in uv__io_poll /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/linux-core.c:431:11 pytorch#5 0x7faa6f618629 in uv_run /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/core.c:375:5 pytorch#6 0x7faa8d3e7320 in c10d::detail::LibUVStoreDaemon::run() fbcode/caffe2/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:1216 pytorch#7 0x7faa8d3bc933 in void std::__invoke_impl<void, void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>(std::__invoke_memfun_deref, void (c10d::detail::BackgroundThread::*&&)(), c10d::detail::BackgroundThread*&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 pytorch#8 0x7faa8d3bc80c in std::__invoke_result<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>::type std::__invoke<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>(void (c10d::detail::BackgroundThread::*&&)(), c10d::detail::BackgroundThread*&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:96 pytorch#9 0x7faa8d3bc7e1 in void std::thread::_Invoker<std::tuple<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>>::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:253 pytorch#10 0x7faa8d3bc7a4 in std::thread::_Invoker<std::tuple<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>>::operator()() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:260 pytorch#11 0x7faa8d3bc608 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:211 pytorch#12 0x7faa436df5b4 in execute_native_thread_routine (/usr/local/fbcode/platform010/lib/libstdc++.so.6+0xdf5b4) (BuildId: 14a4eafe0cdc86af9a949a6c0c27bf21a033e047) pytorch#13 0x56744a in asan_thread_start(void*) ubsan.c pytorch#14 0x7faa43b2cf5b in __GI___clone3 (/usr/local/fbcode/platform010/lib/libc.so.6+0x12cf5b) (BuildId: 93cdceeb8322234c38e1f2c93ad0ff10c7632fa6) ``` With fix, no leak Differential Revision: D68566104

* refactor

* Squashed 'src/composable_kernel/' content from commit f6edda6 git-subtree-dir: src/composable_kernel git-subtree-split: f6edda6 * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files * Squashed 'src/composable_kernel/' changes from f6edda6..5781adf 5781adf Update develop (pytorch#5) (pytorch#6) 97e6d51 Merge pull request pytorch#4 from ROCmSoftwarePlatform/separate_online_compile 7b1ec41 refactor 49c33aa refactor 54b3e73 rename git-subtree-dir: src/composable_kernel git-subtree-split: 5781adf * fix * refactor * remove online compilation from CK * refactor * fix * add ctest * add c-style pointer cast * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast * fix clang warning suppression * tidy * suppress cppcheck * fix enum issue * revert chagnes to hip build * fix kernel filename * update CK build script * rename * rename * make innner product compatiable on gfx900 * Update src/include/miopen/solver/ck_utility_common.hpp Co-authored-by: JD <[email protected]> * compiler parameter use stream * use int instead of index_t in kernel wrapper * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element * refactor * refactor * change cmakelist * change ck common utility * fix Co-authored-by: JD <[email protected]>

…duction (pytorch#1156) * Squashed 'src/composable_kernel/' content from commit f6edda6 git-subtree-dir: src/composable_kernel git-subtree-split: f6edda6 * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files * Squashed 'src/composable_kernel/' changes from f6edda6..5781adf 5781adf Update develop (pytorch#5) (pytorch#6) 97e6d51 Merge pull request pytorch#4 from ROCmSoftwarePlatform/separate_online_compile 7b1ec41 refactor 49c33aa refactor 54b3e73 rename git-subtree-dir: src/composable_kernel git-subtree-split: 5781adf * fix * refactor * remove online compilation from CK * refactor * fix * add ctest * tidy * add tidy * tidy * tidy * tidy * tidy * tidy * tidy * tidy * tidy * tidy * add c-style pointer cast * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast * fix clang warning suppression * tidy * suppress cppcheck * fix enum issue * revert chagnes to hip build * fix kernel filename * update CK build script * rename * rename * make innner product compatiable on gfx900 * Update src/include/miopen/solver/ck_utility_common.hpp Co-authored-by: JD <[email protected]> * compiler parameter use stream * use int instead of index_t in kernel wrapper * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element * refactor * refactor * change cmakelist * change ck common utility * fix * Squashed 'src/composable_kernel/' changes from 5781adf..31b4035 31b4035 Merge pull request pytorch#16 from ROCmSoftwarePlatform/develop b62bf8c Merge pull request pytorch#14 from ROCmSoftwarePlatform/miopen_downstream_init_integration ccc4a1d Merge pull request pytorch#8 from ROCmSoftwarePlatform/miopen_downstream_init_integration 67ad47e refactor 16effa7 refactor a91b68d DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element 2cbabbb use int instead of index_t in kernel wrapper 0834bc7 compiler parameter use stream f2ac783 make innner product compatiable on gfx900 4e57b30 rename c03045c rename b258995 update CK build script 2c48039 fix kernel filename d626dcc fix enum issue 643ebd4 tidy ddd49ec fix clang warning suppression 4f566c6 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast 172036d add c-style pointer cast 76f3131 tidy d184289 tidy f885c13 tidy 80120f0 tidy c3efeb5 tidy 56fc084 tidy 54fba51 tidy e62bae7 tidy 24c8728 add tidy 61487e0 fix ae98b52 remove online compilation from CK cb95421 refactor 73ca970 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3 3b86646 Merge pull request pytorch#7 from ROCmSoftwarePlatform/master d09ea4f Update develop (pytorch#5) 3d32ae9 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files git-subtree-dir: src/composable_kernel git-subtree-split: 31b4035 * Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel * Fix with regard to implementing GetZeroVal() in both kernel and host * Avoid convert to compType from dstDataType before writting the output value * Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator * Add CONSTANT decorator for descriptor read buffer * Use get_thread_local_1d_id() for thread local Id * Rename GetZeroVal() to GetReductionZeroVal() in the kernels * Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp * Occasional tiny simplification and update in the kernel files * Update in src/reducetensor.cpp for consistent IDs passing to the kernel * Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers * Update to remove OpenCL tidy checking failures * Small updates in src/reducetensor.cpp * Update for better readability * Remove unused codes and not-needed template parameters in the kernel wrappers Co-authored-by: Chao Liu <[email protected]> Co-authored-by: JD <[email protected]>

colesbury referenced this issue in colesbury/pytorch-old Sep 30, 2016

Merge pull request #5 from lukeyeager/tests-nvml

eb2d869

Don't link tests with NVML

bamos mentioned this issue Oct 21, 2016

Exponent not shown when printing small numbers #147

Closed

bshillingford mentioned this issue Oct 28, 2016

More optimizers in torch.optim #175

Closed

rockcat mentioned this issue Jan 24, 2017

segfault on AMD Athlon CPUs #535

Closed

lukeyeager mentioned this issue Feb 1, 2017

Placeholder variable #670

Closed

soumith closed this as completed Apr 18, 2017

egborbe mentioned this issue May 25, 2017

memory leak(cpu, not gpu) in convolution layer #1272

Closed

ashmeet4293 mentioned this issue Jun 22, 2017

Segmentation fault (core dumped) on Ubuntu 14.04.5 #926

Closed

howard0su mentioned this issue Jul 16, 2017

Dropout segfaults with -OO #1848

Closed

yogi81 referenced this issue Aug 16, 2017

libshm needs static libstdc++ on binary build

1d9b10d

jiasenlu mentioned this issue Sep 6, 2017

segfault in python multithreaded setting #1868

Closed

M-Eng mentioned this issue Oct 12, 2017

Segfault when using data parallel on PPC64 #3089

Closed

lightChaserX mentioned this issue Oct 14, 2017

CUDA 9 + CUDNN 7 - seg fault #3081

Closed

dmarnerides mentioned this issue Oct 16, 2017

conv2d Segmentation Fault #3141

Closed

amulyahwr mentioned this issue Dec 2, 2017

Segmentation fault, pytorch version '0.2.0_3' #3985

Closed

gesape mentioned this issue Dec 28, 2017

Segfault on "import torch" #4111

Closed

wineoffree mentioned this issue Apr 5, 2018

Segmentation fault (core dumped) after iterated for thousands of time #6319

Closed

wangyirui mentioned this issue Apr 9, 2018

Segmentation fault (core dumped) #6449

Closed

avmgithub mentioned this issue May 18, 2018

[ppc64le pytorch] test_nn failure test_Softmax2d (test_nn.TestNN) ... pure virtual method called #7615

Closed

ekazakos mentioned this issue May 29, 2018

Bug: DistributedDataParallel deadlock even after setting multiprocessing start method to 'spawn' #7896

Closed

bheinzerling mentioned this issue Jun 19, 2018

[Bug] Segmentation fault when importing sentencepiece (with v0.4.0) #8358

Closed

avmgithub mentioned this issue Aug 16, 2018

[ppc64le/pytorch] test_c10d.ProcessGroupNCCLTest.test_allgather_ops illegal memory access #10587

Closed

nivha mentioned this issue Aug 29, 2018

SIGSEGV when passing numpy.int64 to torch functions (reshape, randn) #11012

Closed

This was referenced Sep 6, 2018

Move TensorOptions.cpp to the correct place in ATen/core #11244

Closed

Rename getMaybeVariableType back to getType. #11250

Closed

Move ATen/Registry.h to ATen/core/Registry.h #11270

Closed

sujuyu mentioned this issue Oct 16, 2024

In a multithreading scenario, the dynamic library obtained from aot_compile encounters a CUDA illegal memory access error during inference. #132475

Closed

gglin001 added a commit to gglin001/pytorch that referenced this issue Nov 27, 2024

support huca_refsi p1 (pytorch#5)

3f650ce

pytorch-bot bot pushed a commit that referenced this issue Dec 5, 2024

Merge pull request #5 from jianan-gu/jianan/buffer_changes

a7fd69d

fix kv block size

youth123 mentioned this issue Dec 11, 2024

AllReduce hangs due to no device_id into init_process_group #142356

Open

SOCOOL666 mentioned this issue Dec 20, 2024

LibTorch cannot load PyTorch exported model #47917

Open

YufangMo mentioned this issue Jan 30, 2025

multi node Error when dist.destroy_process_group #146021

Open

feifei-111 mentioned this issue Mar 14, 2025

[Dist] Async op isend and irecv bug #149177

Open

Dutch-voyage mentioned this issue Mar 19, 2025

(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196

Open

akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025

MIOpen integration: recent bug fixes from MIOpen (pytorch#5)

562e1e2

akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025

Update develop (pytorch#5)

d09ea4f

* refactor

akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025

Update develop (pytorch#5) (pytorch#6)

5781adf

* refactor

mckay-w mentioned this issue Apr 17, 2025

terminate called after throwing an instance of 'c10::Error' what(): isTuple() INTERNAL ASSERT FAILED at "/home/wenda/libtorch/include/ATen/core/ivalue_inl.h":927, please report a bug to PyTorch. Expected Tuple but got GenericList #53895

Open

bjourne mentioned this issue May 7, 2025

TorchRun: Option to specify which GPUs to run on #152822

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checklist for Release #5

Checklist for Release #5

soumith commented Aug 26, 2016 •

edited

Loading

pltrdy commented Apr 24, 2017

soumith commented Apr 24, 2017

jekbradbury commented Apr 27, 2017

pltrdy commented Apr 28, 2017

Checklist for Release #5

Checklist for Release #5

Comments

soumith commented Aug 26, 2016 • edited Loading

Core

Core Framework code

Operations

FBCode stuff

Open Source stuff

Backward Compatibility

Lua Bridge

Model Loading

Framework Integration

Website

Documentation, Demos, Examples, Tutorials, ModelZoo

Demos / Examples / ModelZoo

Demos + Examples

Tutorials

Documentation

Links

AIViz Integration

Distributed Training

pltrdy commented Apr 24, 2017

soumith commented Apr 24, 2017

jekbradbury commented Apr 27, 2017

pltrdy commented Apr 28, 2017

soumith commented Aug 26, 2016 •

edited

Loading