Releases: alessblaze/rocm-gfx1100
pytorch_2.4.0_nightly_beta_MPI_MKL-rocm6.1.1
This is a build using docker container of which code was pulled onJUL26/2024 9:30 PM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
https://github.com/anishsheikh/onnxruntime.git
source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git
Device Support : GFX1100, GFX1101, GFX1102, GFX1030
HSA_OVERRIDE_GFX_VERSION not needed for Supported Devices. added native support
Recomended to use anaconda or miniconda environment.
needed to install.
conda install -c conda-forge libstdcxx-ng=13
and set
export LD_LIBRARY_PATH=<your_conda_environment>/lib:$LD_LIBRARY_PATH]
What's new :
Pytorch:
From Page Migrating hip/cudamallocmanaged behavior and relentless memcpys.
we implemented the unified memory structure via hip/cudahostmalloc with fine grain zero copy coherence.
its easier to maintain the managed memory but the only thing i can't check about multi-device allocation.
there isnt much performance benefit doing this. lets see..
we might change it into coase grain allocation with device sync before allocation.
to replicate the hip/cudamalloc behavior.
all of this just to save money.
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True -> This Doesn't work in torch official because of lack of VMEM addressing code, its implemented here but try not to use it because the API is still in BETA, it may crash.
PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync -> This has no support in official Pytorch and macroed out. works here.
PYTORCH_HIP_ALLOC_CONF=backend:native -> This has special optimization for low vram devices to produce 720p-1080p images in optimal time. you can't generate 720p images on official pytorch.
Torch-XLA: (Doesn't Exist For ROCm, it does here)
PJRT_DEVICE=CUDA -> would work.
PJRT_ALLOCATOR_CUDA_ASYNC=1 -> would work
PJRT_GPU_ASYNC_CLIENT=1 -> would work
PJRT_ALLOCATOR_FRACTION=<unifiedmemory_percentage> -> would work with BFC Allocator
Automatic Mixed Precision Would Work too.
Unified Memory would work also with cuda variables.
Special Note Disclaimer on Torch-XLA while using CudaAsync
OpenMPI support added.
JUL:27:
torch-xla updated to openxla/xla@0a90c5f
fixed torch-xla race condition for dynamic modules crash.
pytorch_2.4.0_nightly_beta_MPI_MKL-rocm6.1.1
This is a build using docker container of which code was pulled on MAY15/2024 9:30 PM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
https://github.com/anishsheikh/onnxruntime.git
source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git
Device Support : GFX1100, GFX1101, GFX1102, GFX1030
HSA_OVERRIDE_GFX_VERSION not needed for Supported Devices. added native support
Recomended to use anaconda or miniconda environment.
needed to install.
conda install -c conda-forge libstdcxx-ng=13
and set
export LD_LIBRARY_PATH=<your_conda_environment>/lib:$LD_LIBRARY_PATH]
What's new :
Pytorch:
From Page Migrating hip/cudamallocmanaged behavior and relentless memcpys.
we implemented the unified memory structure via hip/cudahostmalloc with fine grain zero copy coherence.
its easier to maintain the managed memory but the only thing i can't check about multi-device allocation.
there isnt much performance benefit doing this. lets see..
we might change it into coase grain allocation with device sync before allocation.
to replicate the hip/cudamalloc behavior.
all of this just to save money.
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True -> This Doesn't work in torch official because of lack of VMEM addressing code, its implemented here but try not to use it because the API is still in BETA, it may crash.
PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync -> This has no support in official Pytorch and macroed out. works here.
PYTORCH_HIP_ALLOC_CONF=backend:native -> This has special optimization for low vram devices to produce 720p-1080p images in optimal time. you can't generate 720p images on official pytorch.
Torch-XLA: (Doesn't Exist For ROCm, it does here)
PJRT_DEVICE=CUDA -> would work.
PJRT_ALLOCATOR_CUDA_ASYNC=1 -> would work
PJRT_GPU_ASYNC_CLIENT=1 -> would work
PJRT_ALLOCATOR_FRACTION=<unifiedmemory_percentage> -> would work with BFC Allocator
Automatic Mixed Precision Would Work too.
Unified Memory would work also with cuda variables.
Special Note Disclaimer on Torch-XLA while using CudaAsync
OpenMPI support added.
MAY-15: As we are updating core pytorch and onnxruntime for above archs, alongside with some memcpy fixes.
torch-xla will be posted tomorrow probably, im lazy to do it today.
all releases support unified memory by default.
xla fixes: torch ifrt hloprogram namechange;
convulation fix pull;
argminmax change;
typedkernel in redzone allocator;
triton operandpasses and tritontotritongpupass;
--xla_gpu_simplify_all_fp_conversions flag deprecation;
xla commit : openxla/xla@a0f5d76
do not be fooled by xla version name.
All connections drop here .
MAY-16: as stated yesterday. the tritoncreate ConvertTritonToTritonGPUPass is fixed. in
pm.addPass(mt::createConvertTritonToTritonGPUPass());
pytorch_2.4.0_nightly_beta_MPI_MKL-rocm6.1.0
This is a build using docker container of which code was pulled on MAR23/2023 9:30 PM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
https://github.com/anishsheikh/onnxruntime.git
source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git
Device Support : GFX1100, GFX1101, GFX1102, GFX1030
HSA_OVERRIDE_GFX_VERSION not needed for Supported Devices. added native support
Recomended to use anaconda or miniconda environment.
needed to install.
conda install -c conda-forge libstdcxx-ng=13
and set
export LD_LIBRARY_PATH=<your_conda_environment>/lib:$LD_LIBRARY_PATH]
What's new :
Pytorch:
From Page Migrating hip/cudamallocmanaged behavior and relentless memcpys.
we implemented the unified memory structure via hip/cudahostmalloc with fine grain zero copy coherence.
its easier to maintain the managed memory but the only thing i can't check about multi-device allocation.
there isnt much performance benefit doing this. lets see..
we might change it into coase grain allocation with device sync before allocation.
to replicate the hip/cudamalloc behavior.
all of this just to save money.
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True -> This Doesn't work in torch official because of lack of VMEM addressing code, its implemented here but try not to use it because the API is still in BETA, it may crash.
PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync -> This has no support in official Pytorch and macroed out. works here.
PYTORCH_HIP_ALLOC_CONF=backend:native -> This has special optimization for low vram devices to produce 720p-1080p images in optimal time. you can't generate 720p images on official pytorch.
Torch-XLA: (Doesn't Exist For ROCm, it does here)
PJRT_DEVICE=CUDA -> would work.
PJRT_ALLOCATOR_CUDA_ASYNC=1 -> would work
PJRT_GPU_ASYNC_CLIENT=1 -> would work
PJRT_ALLOCATOR_FRACTION=<unifiedmemory_percentage> -> would work with BFC Allocator
Automatic Mixed Precision Would Work too.
Unified Memory would work also with cuda variables.
Special Note Disclaimer on Torch-XLA while using CudaAsync
OpenMPI support added.
MAR 17 : added onnxruntime along with new pytorch with coarse grained unified memory. which is 2-3x faster than before.
PYTORCH_NO_HIP_MEMORY_CACHING = 1 -> Can give some benefits on certain circumstances.
MAR 23 : bug fixes for memcpy on onnx also in pytorch.
the code was written for purely device to device copy but we using host memory.
in onnx runtime gpu memcopy is too many if else's .
we are in beta phase. the IPC communication handle while exporting to onnx via optimum-cli is broken.
somehow the interprocess communication handle uses std::string to encode the handle the recast to hipipchandle.
no idea if its just a datatype handling error or hip runtime bug seems that its documented in hip docs. let them fix it.
we fix what we need.
another question is i have got less idea how hip runtime handles the host native allocations. seems like we need to register them
to not run into segfaults and pagefaults. which is why a line of onnx runtime on cpu to cpu memcpy is in question. because is it purely
that it would be handled on the cpu or sometime it will be copied back to gpu for some procedure. so i took the safe side to just register,
the memory pointers in question and had at least 1-2gb less ram usage than last build.
APR17-> Move forward to update the same repo for rocm 6.1.0. not latest upstream build.
entered EOL for Tensorflow Frontend. will only provide for pytorch and related xla.
pytorch will enter stable tags only in next build probably as we are quite stable now with our modifications.
enabled Composable kernels for onnxruntime as we merged latest ck for rocm 6.1.0 with gfx11 support.
pytorch_2.4.0_nightly_beta_MPI_MKL-rocm6.0.3
This is a build using docker container of which code was pulled on MAR23/2023 9:30 PM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git
Device Support : GFX1100, GFX1101, GFX1102, GFX1030
HSA_OVERRIDE_GFX_VERSION not needed for Supported Devices. added native support
Recomended to use anaconda or miniconda environment.
needed to install.
conda install -c conda-forge libstdcxx-ng=13
and set
export LD_LIBRARY_PATH=<your_conda_environment>/lib:$LD_LIBRARY_PATH]
What's new :
Pytorch:
From Page Migrating hip/cudamallocmanaged behavior and relentless memcpys.
we implemented the unified memory structure via hip/cudahostmalloc with fine grain zero copy coherence.
its easier to maintain the managed memory but the only thing i can't check about multi-device allocation.
there isnt much performance benefit doing this. lets see..
we might change it into coase grain allocation with device sync before allocation.
to replicate the hip/cudamalloc behavior.
all of this just to save money.
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True -> This Doesn't work in torch official because of lack of VMEM addressing code, its implemented here but try not to use it because the API is still in BETA, it may crash.
PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync -> This has no support in official Pytorch and macroed out. works here.
PYTORCH_HIP_ALLOC_CONF=backend:native -> This has special optimization for low vram devices to produce 720p-1080p images in optimal time. you can't generate 720p images on official pytorch.
Torch-XLA: (Doesn't Exist For ROCm, it does here)
PJRT_DEVICE=CUDA -> would work.
PJRT_ALLOCATOR_CUDA_ASYNC=1 -> would work
PJRT_GPU_ASYNC_CLIENT=1 -> would work
PJRT_ALLOCATOR_FRACTION=<unifiedmemory_percentage> -> would work with BFC Allocator
Automatic Mixed Precision Would Work too.
Unified Memory would work also with cuda variables.
Special Note Disclaimer on Torch-XLA while using CudaAsync
OpenMPI support added.
MAR 17 : added onnxruntime along with new pytorch with coarse grained unified memory. which is 2-3x faster than before.
PYTORCH_NO_HIP_MEMORY_CACHING = 1 -> Can give some benefits on certain circumstances.
MAR 23 : bug fixes for memcpy on onnx also in pytorch.
the code was written for purely device to device copy but we using host memory.
in onnx runtime gpu memcopy is too many if else's .
we are in beta phase. the IPC communication handle while exporting to onnx via optimum-cli is broken.
somehow the interprocess communication handle uses std::string to encode the handle the recast to hipipchandle.
no idea if its just a datatype handling error or hip runtime bug seems that its documented in hip docs. let them fix it.
we fix what we need.
another question is i have got less idea how hip runtime handles the host native allocations. seems like we need to register them
to not run into segfaults and pagefaults. which is why a line of onnx runtime on cpu to cpu memcpy is in question. because is it purely
that it would be handled on the cpu or sometime it will be copied back to gpu for some procedure. so i took the safe side to just register,
the memory pointers in question and had at least 1-2gb less ram usage than last build.
APR 1 2:49PM -> try with some fp4 types would work. hipblaslt has issues which i didn't fix . maybe later fix if i get bored.
we didn't make it for rocm and changed all the variables and api's from cuda to rocm. rather we are cuda now. XD
int8 calculations are done with cublaslt. it would need writing in rocblas or hipblas for it to work. also wmma can be worked on but datatype and compute type needs accumulation in float rather than float16 aka half.
APR 1 11:28 -> Some linking can have issues tho. the new version supports all Stated ROCm archs.
APR3 7:10 -> i was on chill break yesterday. Spm_Coo tests would pass with some parameters along with hipblaslt.
lates hipblaslt merged gfx1100 support for perf kernels. so lets wait then fix the 100 gender row column problem.
APR 7 5:24 -> fixes some crashes on torch
pytorch_2.4.0_nightly_beta_MPI_MKL-rocm6.0.2
This is a build using docker container of which code was pulled on MAR23/2023 9:30 PM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git
Device Support : GFX1100, GFX1101, GFX1102, GFX1030
HSA_OVERRIDE_GFX_VERSION not needed for Supported Devices. added native support
Recomended to use anaconda or miniconda environment.
needed to install.
conda install -c conda-forge libstdcxx-ng=13
and set
export LD_LIBRARY_PATH=<your_conda_environment>/lib:$LD_LIBRARY_PATH]
What's new :
Pytorch:
From Page Migrating hip/cudamallocmanaged behavior and relentless memcpys.
we implemented the unified memory structure via hip/cudahostmalloc with fine grain zero copy coherence.
its easier to maintain the managed memory but the only thing i can't check about multi-device allocation.
there isnt much performance benefit doing this. lets see..
we might change it into coase grain allocation with device sync before allocation.
to replicate the hip/cudamalloc behavior.
all of this just to save money.
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True -> This Doesn't work in torch official because of lack of VMEM addressing code, its implemented here but try not to use it because the API is still in BETA, it may crash.
PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync -> This has no support in official Pytorch and macroed out. works here.
PYTORCH_HIP_ALLOC_CONF=backend:native -> This has special optimization for low vram devices to produce 720p-1080p images in optimal time. you can't generate 720p images on official pytorch.
Torch-XLA: (Doesn't Exist For ROCm, it does here)
PJRT_DEVICE=CUDA -> would work.
PJRT_ALLOCATOR_CUDA_ASYNC=1 -> would work
PJRT_GPU_ASYNC_CLIENT=1 -> would work
PJRT_ALLOCATOR_FRACTION=<unifiedmemory_percentage> -> would work with BFC Allocator
Automatic Mixed Precision Would Work too.
Unified Memory would work also with cuda variables.
Special Note Disclaimer on Torch-XLA while using CudaAsync
OpenMPI support added.
MAR 17 : added onnxruntime along with new pytorch with coarse grained unified memory. which is 2-3x faster than before.
PYTORCH_NO_HIP_MEMORY_CACHING = 1 -> Can give some benefits on certain circumstances.
MAR 23 : bug fixes for memcpy on onnx also in pytorch.
the code was written for purely device to device copy but we using host memory.
in onnx runtime gpu memcopy is too many if else's .
we are in beta phase. the IPC communication handle while exporting to onnx via optimum-cli is broken.
somehow the interprocess communication handle uses std::string to encode the handle the recast to hipipchandle.
no idea if its just a datatype handling error or hip runtime bug seems that its documented in hip docs. let them fix it.
we fix what we need.
another question is i have got less idea how hip runtime handles the host native allocations. seems like we need to register them
to not run into segfaults and pagefaults. which is why a line of onnx runtime on cpu to cpu memcpy is in question. because is it purely
that it would be handled on the cpu or sometime it will be copied back to gpu for some procedure. so i took the safe side to just register,
the memory pointers in question and had at least 1-2gb less ram usage than last build.
tensorflow-2.17-rocm610-python3.11-experimental-gfx11
built from :
in my repo only build fixes are there. support for gfx1100 stuff i didn't change
build is basically broken in tensorflow mainline google repo at this point of time.
https://github.com/anishsheikh/tensorflow
with super extra enhanced fix.
WARNING: ONLY with ryzen 5000 series CPUS or with similar instruction sets.
Supported Devices : gfx1100, gfx1101, gfx1102, gfx1030
HSA_OVERRIDE_GFX_VERSION not needed for Supported Devices. added native support
Recomended to use anaconda or miniconda environment.
needed to install.
conda install -c conda-forge libstdcxx-ng=13
and set
export LD_LIBRARY_PATH=<your_conda_environment>/lib:$LD_LIBRARY_PATH]
What's new:
TF_GPU_ALLOCATOR = cuda_malloc_async/ cuda_malloc -> works here
TF_FORCE_UNIFIED_MEMORY = 'true' -> works too
XLA_PYTHON_CLIENT_MEM_FRACTION = <fraction> -> Works here
Added More blas fixes.
APR-18 --> Final release by me.
Im kind of pulling eol for some problems. which is increasingly getting complicated to maintain so many projects.
might not work correctly as well, cause im too lazy to find out.
pytorch_2.4.0_nightly_experimental_MPI_MKL-rocm6.0.2
This is a build using docker container of which code was pulled on MAR16/2023 3:30 PM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git
Device Support : GFX1100, GFX1101, GFX1102, GFX1030
HSA_OVERRIDE_GFX_VERSION not needed for Supported Devices. added native support
Recomended to use anaconda or miniconda environment.
needed to install.
conda install -c conda-forge libstdcxx-ng=13
and set
export LD_LIBRARY_PATH=<your_conda_environment>/lib:$LD_LIBRARY_PATH]
What's new :
Pytorch:
From Page Migrating hip/cudamallocmanaged behavior and relentless memcpys.
we implemented the unified memory structure via hip/cudahostmalloc with fine grain zero copy coherence.
its easier to maintain the managed memory but the only thing i can't check about multi-device allocation.
there isnt much performance benefit doing this. lets see..
we might change it into coase grain allocation with device sync before allocation.
to replicate the hip/cudamalloc behavior.
all of this just to save money.
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True -> This Doesn't work in torch official because of lack of VMEM addressing code, its implemented here but try not to use it because the API is still in BETA, it may crash.
PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync -> This has no support in official Pytorch and macroed out. works here.
PYTORCH_HIP_ALLOC_CONF=backend:native -> This has special optimization for low vram devices to produce 720p-1080p images in optimal time. you can't generate 720p images on official pytorch.
Torch-XLA: (Doesn't Exist For ROCm, it does here)
PJRT_DEVICE=CUDA -> would work.
PJRT_ALLOCATOR_CUDA_ASYNC=1 -> would work
PJRT_GPU_ASYNC_CLIENT=1 -> would work
PJRT_ALLOCATOR_FRACTION=<unifiedmemory_percentage> -> would work with BFC Allocator
Automatic Mixed Precision Would Work too.
Unified Memory would work also with cuda variables.
Special Note Disclaimer on Torch-XLA while using CudaAsync
OpenMPI support added.
MAR 17 : added onnxruntime along with new pytorch with coarse grained unified memory. which is 2-3x faster than before.
PYTORCH_NO_HIP_MEMORY_CACHING = 1 -> Can give some benefits on certain circumstances.
pytorch_2.3.0_nightly_alpha_MPI_MKL-rocm6.0.2
This is a build using docker container of which code was pulled on MAR9/2023 2:30 AM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git
Device Support : GFX1100, GFX1101, GFX1102, GFX1030
HSA_OVERRIDE_GFX_VERSION not needed for Supported Devices. added native support
Recomended to use anaconda or miniconda environment.
needed to install.
conda install -c conda-forge libstdcxx-ng=13
and set
export LD_LIBRARY_PATH=<your_conda_environment>/lib:$LD_LIBRARY_PATH]
What's new :
Pytorch:
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True -> This Doesn't work in torch official because of lack of VMEM addressing code, its implemented here but try not to use it because the API is still in BETA, it may crash.
PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync -> This has no support in official Pytorch and macroed out. works here.
PYTORCH_HIP_ALLOC_CONF=backend:native -> This has special optimization for low vram devices to produce 720p-1080p images in optimal time. you can't generate 720p images on official pytorch.
Torch-XLA: (Doesn't Exist For ROCm, it does here)
PJRT_DEVICE=CUDA -> would work.
PJRT_ALLOCATOR_CUDA_ASYNC=1 -> would work
PJRT_GPU_ASYNC_CLIENT=1 -> would work
PJRT_ALLOCATOR_FRACTION=<unifiedmemory_percentage> -> would work with BFC Allocator
Automatic Mixed Precision Would Work too.
Unified Memory would work also with cuda variables.
Special Note Disclaimer on Torch-XLA while using CudaAsync
OpenMPI support added.
MAR 13 2:06PM -> Added onnxruntime_training without migraphx for converting models built on python 3.11 (HAHAHAHA)
MAR 14 8:05PM -> Added onnxruntime_rocm with migraphx for inference of models built on python 3.11. includes memory optimization
to use Unified Memory if language model is large and non optimized and quantized.
(tested with mistral 7b instruct v0.2 unoptimized onnx export 1e-5 passed model size ~ 28gb) (HAHAHAHA)
NOTE: I did change huggingface/optimum optimum-cli to use with rocm. and converted gpt2 to onnx. will surely fork and pust soon probably
pytorch_2.3.0_nightly_alpha_MPI_MKL-rocm6.0.2
This is a build using docker container of which code was pulled on MAR6/2023 2:30 AM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git
This Uses Latest Magma at this point of time
This Has MPI with UCX and UCC
This is an alpha build means might not work also on a stable channel on officially unsupported hardware
to get torchvision and torchaudio use
Cudaasyncallocator works here too. with the official environment variables of torch-xla
openmp install is needed mightbe
both torchvision and torchaudio maybe needs libjpeg, libpng, ffmpeg
also for xla either install python from source or export conda/lib in ubuntu as LD_LIBRARY_PATH and delete or move offending libs in
WARNING : RYZEN 5000 native cpu build.
Changes:
torch memory optimization for unified memory.
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
works now probably, added support for vmem functions.
PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync
also works with mallocsync along till a level.
rocm-smi used insted of nvlink for gpu mem query
more fixes for smi.
fixed asyncmalloc again.
fixed torch-xla autocast
This Uses Intel MKL 2024.0 Static Libraries and Magma libraries
can be done with openblas also but not needed anyway
bugs:
HSA_GFX_OVERRIDE=11.0.2 will be needed to run torch.
HSA_GFX_OVERRIDE=11.0.0 for xla.
for gfx1102.
TESTS(RX7600 8GB):
Stable diffusion on automatic1111 fork works far faster with torch xla and cudamallocasync now, no hip out of memory error for atleast 1080p(works) image generation,
sd.next upto 720p.
with 5x faster post processing pixels.
lol its still experimental. so i will keep cute bugs to fix later.
mpi will be available later.
pytorch_2.3.0_nightly_alpha_MPI_MKL-rocm6.0.2
This is a build using docker container of which code was pulled on MAR5/2023 2:30 AM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git
This Uses Latest Magma at this point of time
This Has MPI with UCX and UCC
This is an alpha build means might not work also on a stable channel on officially unsupported hardware
to get torchvision and torchaudio use
Cudaasyncallocator works here too. with the official environment variables of torch-xla
openmp install is needed mightbe
both torchvision and torchaudio maybe needs libjpeg, libpng, ffmpeg
also for xla either install python from source or export conda/lib in ubuntu as LD_LIBRARY_PATH and delete or move offending libs in
WARNING : RYZEN 5000 native cpu build.
Changes:
torch memory optimization for unified memory.
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
works now probably, added support for vmem functions.
PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync
also works with mallocsync along till a level.
rocm-smi used insted of nvlink for gpu mem query
more fixes for smi.
fixed asyncmalloc again.
This Uses Intel MKL 2024.0 Static Libraries and Magma libraries
can be done with openblas also but not needed anyway
bugs:
HSA_GFX_OVERRIDE=11.0.2 will be needed to run torch.
HSA_GFX_OVERRIDE=11.0.0 for xla.
for gfx1102.
TESTS(RX7600 8GB):
Stable diffusion on automatic1111 fork works far faster with torch xla and cudamallocasync now, no hip out of memory error for atleast 1080p(works) image generation,
sd.next upto 720p.
with 5x faster post processing pixels.
lol its still experimental. so i will keep cute bugs to fix later.
mpi will be available later.