Thanks to visit codestin.com
Credit goes to github.com

Skip to content

pytorch_2.4.0_nightly_beta_MPI_MKL-rocm6.0.3

Choose a tag to compare

@alessblaze alessblaze released this 29 Mar 15:05
· 1 commit to main since this release
e0c1124

This is a build using docker container of which code was pulled on MAR23/2023 9:30 PM IST from
https://github.com/pytorch/pytorch.git
https://github.com/openai/triton.git
https://github.com/openxla/xla.git
https://github.com/open-mpi/ompi.git
https://github.com/openucx/ucc.git
https://github.com/openucx/ucx.git
https://github.com/FFmpeg/FFmpeg.git
https://github.com/RadeonOpenCompute/ROCm.git
https://github.com/llvm/llvm-project.git
https://github.com/gcc-mirror/gcc.git
https://vulkan.lunarg.com/
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html

source:
https://github.com/anishsheikh/torch-xla.git
https://github.com/anishsheikh/xla.git

Device Support : GFX1100, GFX1101, GFX1102, GFX1030
HSA_OVERRIDE_GFX_VERSION not needed for Supported Devices. added native support
Recomended to use anaconda or miniconda environment.
needed to install.

conda install -c conda-forge libstdcxx-ng=13

and set

export LD_LIBRARY_PATH=<your_conda_environment>/lib:$LD_LIBRARY_PATH]

What's new :
Pytorch:
From Page Migrating hip/cudamallocmanaged behavior and relentless memcpys.
we implemented the unified memory structure via hip/cudahostmalloc with fine grain zero copy coherence.
its easier to maintain the managed memory but the only thing i can't check about multi-device allocation.
there isnt much performance benefit doing this. lets see..
we might change it into coase grain allocation with device sync before allocation.
to replicate the hip/cudamalloc behavior.

all of this just to save money.

   PYTORCH_HIP_ALLOC_CONF=expandable_segments:True -> This Doesn't work in torch official because of lack of VMEM addressing code, its implemented here but try not to use it because the API is still in BETA, it may crash.
   PYTORCH_HIP_ALLOC_CONF=backend:hipMallocAsync -> This has no support in official Pytorch and macroed out. works here.
   PYTORCH_HIP_ALLOC_CONF=backend:native -> This has special optimization for low vram devices to produce 720p-1080p images in optimal time. you can't generate 720p images on official pytorch.

Torch-XLA: (Doesn't Exist For ROCm, it does here)

 PJRT_DEVICE=CUDA -> would work.
 PJRT_ALLOCATOR_CUDA_ASYNC=1 -> would work
 PJRT_GPU_ASYNC_CLIENT=1 -> would work
 PJRT_ALLOCATOR_FRACTION=<unifiedmemory_percentage> -> would work with BFC Allocator
 Automatic Mixed Precision Would Work too.
 Unified Memory would work also with cuda variables.

Special Note Disclaimer on Torch-XLA while using CudaAsync
OpenMPI support added.

MAR 17 : added onnxruntime along with new pytorch with coarse grained unified memory. which is 2-3x faster than before.
PYTORCH_NO_HIP_MEMORY_CACHING = 1 -> Can give some benefits on certain circumstances.
MAR 23 : bug fixes for memcpy on onnx also in pytorch.
the code was written for purely device to device copy but we using host memory.
in onnx runtime gpu memcopy is too many if else's .
we are in beta phase. the IPC communication handle while exporting to onnx via optimum-cli is broken.
somehow the interprocess communication handle uses std::string to encode the handle the recast to hipipchandle.
no idea if its just a datatype handling error or hip runtime bug seems that its documented in hip docs. let them fix it.
we fix what we need.
another question is i have got less idea how hip runtime handles the host native allocations. seems like we need to register them
to not run into segfaults and pagefaults. which is why a line of onnx runtime on cpu to cpu memcpy is in question. because is it purely
that it would be handled on the cpu or sometime it will be copied back to gpu for some procedure. so i took the safe side to just register,
the memory pointers in question and had at least 1-2gb less ram usage than last build.

APR 1 2:49PM -> try with some fp4 types would work. hipblaslt has issues which i didn't fix . maybe later fix if i get bored.
we didn't make it for rocm and changed all the variables and api's from cuda to rocm. rather we are cuda now. XD
int8 calculations are done with cublaslt. it would need writing in rocblas or hipblas for it to work. also wmma can be worked on but datatype and compute type needs accumulation in float rather than float16 aka half.
APR 1 11:28 -> Some linking can have issues tho. the new version supports all Stated ROCm archs.
APR3 7:10 -> i was on chill break yesterday. Spm_Coo tests would pass with some parameters along with hipblaslt.
lates hipblaslt merged gfx1100 support for perf kernels. so lets wait then fix the 100 gender row column problem.
APR 7 5:24 -> fixes some crashes on torch