-
Notifications
You must be signed in to change notification settings - Fork 24.1k
[ROCm] PyTorch slow on TTS #150168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just to double-check. You are interested in PyTorch running this workload on CPU or the GPU? If it's the GPU, what is the output of |
I would like to run the TTS on my Radeon 680M GPU. I ran the CPU PyTorch just to see if the ROCm version would provide any improvement. Here's the output of $ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 6800U with Radeon Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 6800U with Radeon Graphics
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4769
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 15065140(0xe5e034) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 15065140(0xe5e034) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 15065140(0xe5e034) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1030
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 2048(0x800) KB
Chip ID: 5761(0x1681)
ASIC Revision: 2(0x2)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2200
BDFID: 768
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 122
SDMA engine uCode:: 47
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 14680064(0xe00000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS:
Size: 14680064(0xe00000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1030
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done *** Other similar issues across GitHub:
|
This issue bare some resemblance to #146998 (comment) Can you see if the steps in the comment make any difference in your case? |
Thank you. After upgrading to the PyTorch 2.8.0 nightly build and setting
, the ROCm version start building a database, similar to when I used
with PyTorch 2.6. However, the performance remains equivalent to the CPU version. I'm confused by the following:
Thanks |
Two questions:
|
Thanks
I didn't disable my integrated GPU (it is the only GPU on my laptop). The custom build of ollama can detect and use my integrated GPU to accelerate (here is the custom build instruction)
Maybe let me write down the sequence:
|
If it is running faster the second time, then it is working as expected. It is unfortunate that the GPU performance is similar to the CPU performance, but that is indeed its current state. |
Thanks for the reply There is one additional point that I think it is worth to take a look. In PyTorch ROCm version I could speed up the TTS by prebuilding the database. The PyTorch CPU version doesn't require any database building. Would it be possible that PyTorch ROCm version would be able to run as the CPU version that doesn't require database building. From the first post there are references of other users getting the same issue when they do TTS (not only Kokoro TTS but also other TTS) through PyTorch ROCm. |
The main issue that you are reporting is that MIOpen on your gfx1030 has similar or worse performance than on your CPU, correct? |
I downloaded an article (as an example) and used the first half to build the database. Because of my slow GPU, it took three days to build, resulting in a database of about 300MB. Then, I attempted to process the second half of the article under two conditions:
In both cases, the performance improvement was minimal. I was advised by MIOpen to consult with the framework support team. Since the warning message appears not only in Kokoro TTS but also in Alltalk TTS , I believe it is appropriate to reach out to the PyTorch ROCm team for assistance. |
Found this thread had similar issues. Run your app feed with data like: after the first run you could test hybrid or hybrid dynamic mode so further training gets a little bit faster. Artificial intelligence is transforming the world at an unprecedented pace. after that you can see the db file growing or you can see its fresh modified by your traning data
that are the other possibilities i know most of you use docker, i dont, i wont, ive no clue with docker. BTW: |
@winstonma Have you tried profiling your workload with the Torch profiler? If you believe that your workload is bound by MIOpen performance on the GPU, then there is not much that can be done at the moment. We are aware of the MIOpen performance gap on our consumer grade GPUs, see e.g. |
@phil2sat Thanks for the tips. Using @naromero77amd Thanks for the comment. Would it be a good idea that PyTorch ROCm set to default find mode in PyTorch is FAST in the future release? |
I cant proof but i guess our gpus in my case an unsupported gfx90c iGPU is so damn slow that the miopen default mode 5 dynamic hybrid makes models run faster on CPU. On a modern GPU with sufficient VRAM and Horsepower should be plenty of room to compute and train. But reading testing and fiddeling out what all the miopen modes do was fun. So fast is fast, even if i reach only 0.4x realtime its faster than on my CPU. At least for me it runs stable enough and i can learn something, even on my 4yrs old Laptop. |
I've noticed that some people are using external GPUs, and their CPUs are also significantly faster (see this issue). Therefore, I don't believe the slowdown is due to unsupported integrated GPU (iGPU) drivers alone. I am here because the MIOpen team (see this issue) suggested me to find PyTorch ROCm team and see if there are assistance.
My GPU achieves 2x real-time speed just by setting export MIOPEN_FIND_MODE=FAST. My CPU performance is roughly the same, but since we're using different TTS engines, comparing 0.4x vs. 2x isn't very meaningful in this case. |
Best settings for me were also : export MIOPEN_FIND_MODE=FAST
export MIOPEN_USER_DB_PATH="~/tts/miopen_cache" Speedup was around 10X VS without those params. Using ROCm 6.3 & torch-2.7.0. |
π Describe the bug
I installed Kokoro TTS and PyTorch onto my machine, running AMD 6800U with Radeon 680M.
And then I got a lot of warning message:
After reporting to MIOpen team they suggested me to tune the performance database. After tuning the database using a text file, I ran executed the TTS using the same text file (this would ensure that no new entries and results are taken from the database). However on my machine the fully-trained PyTorch ROCm and the PyTorch CPU requires the same amount of time to execute.
I use the following command run with the tuned performance database and using the text in result.txt
Also as CPU version won't require any tuning, just wonder if tuning performance database should be needed?
Versions
PyTorch CPU version
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd
The text was updated successfully, but these errors were encountered: