Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[ROCm] PyTorch slow on TTS #150168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
winstonma opened this issue Mar 28, 2025 · 16 comments
Open

[ROCm] PyTorch slow on TTS #150168

winstonma opened this issue Mar 28, 2025 · 16 comments
Labels
module: rocm AMD GPU support for Pytorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@winstonma
Copy link

winstonma commented Mar 28, 2025

πŸ› Describe the bug

I installed Kokoro TTS and PyTorch onto my machine, running AMD 6800U with Radeon 680M.

# Installed PyTorch ROCm already

# Install Kokoro
pip install -q kokoro>=0.9.2 soundfile
apt-get -qq -y install espeak-ng > /dev/null 2>&1

# Run create a sample TTS
echo 'Hello! How are you today?' | kokoro -o output.wav

And then I got a lot of warning message:

MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 11325440, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 11325440, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 11325440, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 11325440, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 4853760, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 4853760, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 4853760, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 4853760, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 17797120, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 17797120, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 17797120, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 17797120, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 53396992, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 53396992, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 53396992, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 53396992, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 14562816, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 14562816, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 14562816, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 14562816, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 33979904, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 33979904, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 33979904, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 33979904, provided ptr: 0 size: 0

After reporting to MIOpen team they suggested me to tune the performance database. After tuning the database using a text file, I ran executed the TTS using the same text file (this would ensure that no new entries and results are taken from the database). However on my machine the fully-trained PyTorch ROCm and the PyTorch CPU requires the same amount of time to execute.

I use the following command run with the tuned performance database and using the text in result.txt

# Installed PyTorch ROCm and Kokoro already

export MIOPEN_FIND_MODE=FAST

# Download a small paragraph
wget https://github.com/user-attachments/files/19500860/result.txt

# Run create a sample TTS
kokoro -i result.txt -o output.wav

Also as CPU version won't require any tuning, just wonder if tuning performance database should be needed?

Versions

PyTorch CPU version

Collecting environment information...
PyTorch version: 2.6.0+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.10 (x86_64)
GCC version: (Ubuntu 14.2.0-4ubuntu2) 14.2.0
Clang version: 19.1.7 (1ubuntu2~kisak~o)
CMake version: version 3.30.3
Libc version: glibc-2.40

Python version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.14.0-061400-generic-x86_64-with-glibc2.40
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               16
On-line CPU(s) list:                  0-15
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 7 6800U with Radeon Graphics
CPU family:                           25
Model:                                68
Thread(s) per core:                   2
Core(s) per socket:                   8
Socket(s):                            1
Stepping:                             1
Frequency boost:                      enabled
CPU(s) scaling MHz:                   39%
CPU max MHz:                          4769.0000
CPU min MHz:                          400.0000
BogoMIPS:                             5390.09
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca debug_swap
Virtualization:                       AMD-V
L1d cache:                            256 KiB (8 instances)
L1i cache:                            256 KiB (8 instances)
L2 cache:                             4 MiB (8 instances)
L3 cache:                             16 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-15
Vulnerability Gather data sampling:   Not affected
Vulnerability Ghostwrite:             Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] flake8==7.1.1
[pip3] mypy==1.14.1
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.1.2
[pip3] numpydoc==1.7.0
[pip3] pytorch-triton-rocm==3.2.0
[pip3] torch==2.6.0+cpu
[pip3] torchaudio==2.6.0+cpu
[pip3] torchsde==0.2.6
[pip3] torchvision==0.21.0+cpu
[conda] _anaconda_depends         2025.03             py312_mkl_0  
[conda] blas                      1.0                         mkl  
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] mkl-service               2.4.0           py312h5eee18b_2  
[conda] mkl_fft                   1.3.11          py312h5eee18b_0  
[conda] mkl_random                1.2.8           py312h526ad5a_0  
[conda] numpy                     1.26.4          py312hc5e2394_0  
[conda] numpy-base                1.26.4          py312h0da6c21_0  
[conda] numpydoc                  1.7.0           py312h06a4308_0  
[conda] pytorch-triton-rocm       3.2.0                    pypi_0    pypi
[conda] torch                     2.6.0+cpu                pypi_0    pypi
[conda] torchaudio                2.6.0+cpu                pypi_0    pypi
[conda] torchvision               0.21.0+cpu               pypi_0    pypi

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Mar 28, 2025
@winstonma winstonma changed the title [ROCm] PyTorch build is slow [ROCm] PyTorch slow on TTS Mar 29, 2025
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 31, 2025
@naromero77amd
Copy link
Collaborator

Just to double-check. You are interested in PyTorch running this workload on CPU or the GPU?

If it's the GPU, what is the output of rocminfo?

@winstonma
Copy link
Author

winstonma commented Apr 3, 2025

I would like to run the TTS on my Radeon 680M GPU. I ran the CPU PyTorch just to see if the ROCm version would provide any improvement.

Here's the output of rocminfo:

$ rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 6800U with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 6800U with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4769                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    15065140(0xe5e034) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    15065140(0xe5e034) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    15065140(0xe5e034) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5761(0x1681)                       
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2200                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 122                                
  SDMA engine uCode::      47                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    14680064(0xe00000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    14680064(0xe00000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Other similar issues across GitHub:

@naromero77amd
Copy link
Collaborator

This issue bare some resemblance to #146998 (comment)

Can you see if the steps in the comment make any difference in your case?

@winstonma
Copy link
Author

winstonma commented Apr 4, 2025

Thank you. After upgrading to the PyTorch 2.8.0 nightly build and setting

torch.backends.cudnn.benchmark=True

, the ROCm version start building a database, similar to when I used

export MIOPEN_FIND_MODE=3
export MIOPEN_FIND_ENFORCE=3

with PyTorch 2.6. However, the performance remains equivalent to the CPU version. I'm confused by the following:

  • Why does PyTorch ROCm is the only version that requires time consuming database building?
  • The TTS performance of PyTorch ROCm is similar to PyTorch CPU on my AMD 6800U, should I expect the PyTorch ROCm faster than PyTorch CPU like stable diffusion/ollama?

Thanks

@naromero77amd
Copy link
Collaborator

Two questions:

  1. Did you disable your integrated GPU? (BIOS setting)
  2. Did the benchmark = True improvement performance over the original performance?

@winstonma
Copy link
Author

winstonma commented Apr 21, 2025

Thanks

Two questions:

1. Did you disable your integrated GPU? (BIOS setting)

I didn't disable my integrated GPU (it is the only GPU on my laptop). The custom build of ollama can detect and use my integrated GPU to accelerate (here is the custom build instruction)

2. Did the benchmark = True improvement performance over the original performance?

Maybe let me write down the sequence:

  • Run the TTS with torch.backends.cudnn.benchmark=True. The first time is very slow.
  • Run the TTS again with export MIOPEN_FIND_MODE=FAST. The second time is faster. However if I uninstall the PyTorch ROCm, installed PyTorch CPU and run the TTS program, I notice that the performance of CPU version is similar to the ROCm version with Database

@naromero77amd
Copy link
Collaborator

If it is running faster the second time, then it is working as expected. It is unfortunate that the GPU performance is similar to the CPU performance, but that is indeed its current state.

@winstonma
Copy link
Author

Thanks for the reply

There is one additional point that I think it is worth to take a look. In PyTorch ROCm version I could speed up the TTS by prebuilding the database. The PyTorch CPU version doesn't require any database building. Would it be possible that PyTorch ROCm version would be able to run as the CPU version that doesn't require database building.

From the first post there are references of other users getting the same issue when they do TTS (not only Kokoro TTS but also other TTS) through PyTorch ROCm.

@naromero77amd
Copy link
Collaborator

The main issue that you are reporting is that MIOpen on your gfx1030 has similar or worse performance than on your CPU, correct?

@winstonma
Copy link
Author

winstonma commented Apr 24, 2025

I downloaded an article (as an example) and used the first half to build the database. Because of my slow GPU, it took three days to build, resulting in a database of about 300MB. Then, I attempted to process the second half of the article under two conditions:

  • Using the database built from the first half
  • Deleting the database file before processing

In both cases, the performance improvement was minimal.

I was advised by MIOpen to consult with the framework support team. Since the warning message appears not only in Kokoro TTS but also in Alltalk TTS , I believe it is appropriate to reach out to the PyTorch ROCm team for assistance.

@phil2sat
Copy link

phil2sat commented Apr 25, 2025

Found this thread had similar issues.
what worked for me:
export MIOPEN_USER_DB_PATH=~/tts/miopen_cache # specify your folder for your specific task/model/app/tts
export MIOPEN_FIND_MODE=1 #training into DB

Run your app feed with data like:
The quick brown fox jumps over the lazy dog. #In first thats enough but maybe you like some more, like i did.

after the first run you could test hybrid or hybrid dynamic mode so further training gets a little bit faster.

Artificial intelligence is transforming the world at an unprecedented pace.
How much wood would a woodchuck chuck if a woodchuck could chuck wood?
She sells seashells by the seashore, and the shells she sells are surely seashells.
The rain in Spain stays mainly in the plainsβ€”what a fascinating phenomenon!
Hey buddy, could you pass me the salt? Thanks a million!
Beware the Jabberwock, my son! The jaws that bite, the claws that catch!

after that you can see the db file growing or you can see its fresh modified by your traning data

export MIOPEN_USER_DB_PATH=~/tts/miopen_cache
export MIOPEN_FIND_MODE=fast #fast or 2 uses the created db and is what it says

that are the other possibilities
#export MIOPEN_FIND_MODE=1 # train
#export MIOPEN_FIND_MODE=3 # hybrid
#export MIOPEN_FIND_MODE=5 # dynamic_hybrid

i know most of you use docker, i dont, i wont, ive no clue with docker.
im using coqui-ai fork 0.26, pytorch-rocm 2.7.0, an VEGA 448sp iGPU from Ryzen 4700u and get around 0.4x realtime for XTTS-v2 with Speaker.wav

BTW: 53 KB Fri Apr 25 09:06:20 2025 ο…œ gfx900_xnack_7.HIP.3_3_0_a85ca8a54-dirty.ufdb.txt

@naromero77amd
Copy link
Collaborator

@winstonma Have you tried profiling your workload with the Torch profiler?

If you believe that your workload is bound by MIOpen performance on the GPU, then there is not much that can be done at the moment. We are aware of the MIOpen performance gap on our consumer grade GPUs, see e.g.
#146998 (comment)

@winstonma
Copy link
Author

winstonma commented Apr 26, 2025

@phil2sat Thanks for the tips. Using export MIOPEN_FIND_MODE=FAST to skip training feels like a secret trick to me-I had assumed the TTS would work smoothly without extra configuration. It seems that TTS users who notice slow performance and then search for solutions or related bug reports might be facing challenges beyond typical usage.

@naromero77amd Thanks for the comment. Would it be a good idea that PyTorch ROCm set to default find mode in PyTorch is FAST in the future release?

@phil2sat
Copy link

@winstonma Have you tried profiling your workload with the Torch profiler?

If you believe that your workload is bound by MIOpen performance on the GPU, then there is not much that can be done at the moment. We are aware of the MIOpen performance gap on our consumer grade GPUs, see e.g. #146998 (comment)

I cant proof but i guess our gpus in my case an unsupported gfx90c iGPU is so damn slow that the miopen default mode 5 dynamic hybrid makes models run faster on CPU. On a modern GPU with sufficient VRAM and Horsepower should be plenty of room to compute and train.

But reading testing and fiddeling out what all the miopen modes do was fun. So fast is fast, even if i reach only 0.4x realtime its faster than on my CPU.

At least for me it runs stable enough and i can learn something, even on my 4yrs old Laptop.

@winstonma
Copy link
Author

I cant proof but i guess our gpus in my case an unsupported gfx90c iGPU is so damn slow that the miopen default mode 5 dynamic hybrid makes models run faster on CPU. On a modern GPU with sufficient VRAM and Horsepower should be plenty of room to compute and train.

I've noticed that some people are using external GPUs, and their CPUs are also significantly faster (see this issue). Therefore, I don't believe the slowdown is due to unsupported integrated GPU (iGPU) drivers alone.

I am here because the MIOpen team (see this issue) suggested me to find PyTorch ROCm team and see if there are assistance.

But reading testing and fiddeling out what all the miopen modes do was fun. So fast is fast, even if i reach only 0.4x realtime its faster than on my CPU.

My GPU achieves 2x real-time speed just by setting export MIOPEN_FIND_MODE=FAST. My CPU performance is roughly the same, but since we're using different TTS engines, comparing 0.4x vs. 2x isn't very meaningful in this case.

@MarArMar
Copy link

MarArMar commented May 4, 2025

Best settings for me were also :

export MIOPEN_FIND_MODE=FAST
export MIOPEN_USER_DB_PATH="~/tts/miopen_cache"

Speedup was around 10X VS without those params.

Using ROCm 6.3 & torch-2.7.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: rocm AMD GPU support for Pytorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: No status
Development

No branches or pull requests

5 participants