Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@gaetansnl
Copy link
Contributor

@gaetansnl gaetansnl commented Oct 18, 2022

This PR requires full test run because we modify replacement

@github-actions github-actions bot added feature and removed feature labels Oct 18, 2022
@pommedeterresautee
Copy link
Member

can you check if there is an error in the reference implementation?

@gaetansnl gaetansnl marked this pull request as ready for review October 25, 2022 09:05
@gaetansnl
Copy link
Contributor Author

IMO needs full test and benchmark comparaison on 3090. I will post a10g

@gaetansnl gaetansnl changed the title feat: rms replacement base feat: layernorm rms replacement for T5 Oct 25, 2022
@github-actions github-actions bot added feature and removed feature labels Oct 25, 2022
@gaetansnl
Copy link
Contributor Author

a10g Without replacement

test/test_torchdynamo.py ................................                                                                                                                                                                         [100%]
shape=(1, 128) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128-t5-small]                      10.8974 (1.0)    11.1234 (1.0)  10.8347 (1.0)  12.2346 (1.0)  12.8946 (1.0)  13.2912 (1.0)  11.5108 (1.0)  15.1023 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-t5-small]  1.7125 (6.36)    1.7125 (6.5)   1.7101 (6.34)  1.7148 (7.13)  1.7771 (7.26)  1.7833 (7.45)  1.7644 (6.52)  1.8999 (7.95)

shape=(1, 16) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x16-t5-small]                      10.2884 (1.0)    10.3485 (1.0)  10.1981 (1.0)  10.8974 (1.0)  10.7813 (1.0)  11.0701 (1.0)  10.7284 (1.0)  12.5391 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-t5-small]  1.2064 (8.53)    1.2064 (8.58)  1.2045 (8.47)  1.2083 (9.02)  1.266 (8.52)   1.2699 (8.72)  1.2596 (8.52)  1.3924 (9.01)

shape=(1, 256) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256-t5-small]                      10.2764 (1.0)    10.6193 (1.0)  10.2149 (1.0)  11.9115 (1.0)  10.8126 (1.0)  10.9122 (1.0)  10.6578 (1.0)  11.9531 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-t5-small]  2.434 (4.22)     2.4342 (4.36)  2.4318 (4.2)   2.4367 (4.89)  2.4996 (4.33)  2.5131 (4.34)  2.4898 (4.28)  2.6649 (4.49)

shape=(1, 33) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  ------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x33-t5-small]                      9.8122 (1.0)     9.8136 (1.0)   9.7389 (1.0)  9.8744 (1.0)   10.3948 (1.0)  10.4958 (1.0)  10.3239 (1.0)  11.3421 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-t5-small]  1.2999 (7.55)    1.2999 (7.55)  1.2981 (7.5)  1.3025 (7.58)  1.3595 (7.65)  1.3627 (7.7)   1.3517 (7.64)  1.4769 (7.68)

shape=(1, 384) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384-t5-small]                      10.4739 (1.0)    10.5035 (1.0)  10.3692 (1.0)  10.6571 (1.0)  10.9789 (1.0)  11.1108 (1.0)  10.9487 (1.0)  11.9647 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-t5-small]  3.0922 (3.39)    3.0926 (3.4)   3.0873 (3.36)  3.0972 (3.44)  3.1538 (3.48)  3.1633 (3.51)  3.1445 (3.48)  3.2996 (3.63)

shape=(1, 512) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512-t5-small]                      10.4758 (1.0)    10.4756 (1.0)  10.3837 (1.0)  10.602 (1.0)   10.9657 (1.0)  11.1505 (1.0)  10.9168 (1.0)  12.5448 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-t5-small]  3.8435 (2.73)    3.8435 (2.73)  3.8371 (2.71)  3.8477 (2.76)  3.9102 (2.8)   3.9153 (2.85)  3.8989 (2.8)   4.0407 (3.1)

shape=(32, 128) reference_fp32=t5-small
Name                                                                          Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128-t5-small]                      24.2754 (1.0)    24.274 (1.0)    24.2648 (1.0)   24.2801 (1.0)   24.3513 (1.0)   24.4583 (1.0)   24.3388 (1.0)   24.7558 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-t5-small]  17.7034 (1.37)   17.7038 (1.37)  17.6949 (1.37)  17.7172 (1.37)  17.7806 (1.37)  17.8018 (1.37)  17.7695 (1.37)  17.8853 (1.38)

shape=(32, 16) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16-t5-small]                      11.1029 (1.0)    11.3133 (1.0)  10.9093 (1.0)  12.162 (1.0)   11.3779 (1.0)  11.5764 (1.0)  11.3415 (1.0)  12.5898 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-t5-small]  2.127 (5.22)     2.1272 (5.32)  2.1229 (5.14)  2.1313 (5.71)  2.1855 (5.21)  2.1901 (5.29)  2.1795 (5.2)   2.3096 (5.45)

shape=(32, 256) reference_fp32=t5-small
Name                                                                          Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256-t5-small]                      55.8947 (1.0)    55.8947 (1.0)   55.8947 (1.0)   55.8947 (1.0)   56.7266 (1.0)   56.7266 (1.0)   56.7266 (1.0)   56.7266 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-t5-small]  39.1213 (1.43)   39.1244 (1.43)  39.1213 (1.43)  39.1274 (1.43)  39.3141 (1.44)  39.3228 (1.44)  39.3141 (1.44)  39.3315 (1.44)

shape=(32, 33) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-32x33-t5-small]                      10.3974 (1.0)    10.5949 (1.0)  10.301 (1.0)   12.04 (1.0)    10.8182 (1.0)  10.921 (1.0)  10.7937 (1.0)  11.504 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-t5-small]  3.6937 (2.81)    3.6939 (2.87)  3.6882 (2.79)  3.6981 (3.26)  3.7561 (2.88)  3.7643 (2.9)  3.7509 (2.88)  3.8989 (2.95)

shape=(8, 128) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128-t5-small]                      10.3795 (1.0)    10.4239 (1.0)  10.2703 (1.0)  10.8493 (1.0)  11.1293 (1.0)  11.2902 (1.0)  10.9973 (1.0)  12.1218 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-t5-small]  4.1269 (2.52)    4.1267 (2.53)  4.1203 (2.49)  4.1326 (2.63)  4.1976 (2.65)  4.2113 (2.68)  4.1896 (2.62)  4.3273 (2.8)

shape=(8, 16) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x16-t5-small]                      11.3719 (1.0)    11.4684 (1.0)  11.3335 (1.0)  11.8419 (1.0)  12.0279 (1.0)  12.2447 (1.0)  11.9642 (1.0)  13.3369 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-t5-small]  1.4376 (7.91)    1.4375 (7.98)  1.4347 (7.9)   1.4404 (8.22)  1.497 (8.03)   1.5005 (8.16)  1.491 (8.02)   1.6273 (8.2)

shape=(8, 256) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -----------
test_benchmark_implementations[baseline-8x256-t5-small]                      14.5639 (1.0)    14.5639 (1.0)  14.5597 (1.0)  14.567 (1.0)   14.9526 (1.0)  15.0748 (1.0)  14.9329 (1.0)  15.54 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-t5-small]  9.7372 (1.5)     9.7379 (1.5)   9.728 (1.5)    9.7488 (1.49)  9.8034 (1.53)  9.8208 (1.53)  9.7916 (1.53)  9.96 (1.56)

shape=(8, 33) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x33-t5-small]                      11.1895 (1.0)    11.2395 (1.0)  11.082 (1.0)   11.604 (1.0)  11.7356 (1.0)  11.8823 (1.0)  11.6374 (1.0)  12.6045 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-t5-small]  1.9298 (5.8)     1.9299 (5.82)  1.9281 (5.75)  1.932 (6.01)  1.9875 (5.9)   1.9925 (5.96)  1.9826 (5.87)  2.1161 (5.96)

shape=(8, 384) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  -------------
test_benchmark_implementations[baseline-8x384-t5-small]                      27.929 (1.0)     27.919 (1.0)    27.8952 (1.0)   27.9327 (1.0)   28.0908 (1.0)   28.2274 (1.0)   28.0879 (1.0)   28.5035 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-t5-small]  17.6612 (1.58)   17.6622 (1.58)  17.6372 (1.58)  17.6796 (1.58)  17.7526 (1.58)  17.7631 (1.59)  17.7249 (1.58)  17.8466 (1.6)

shape=(8, 512) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean           Min             Max
---------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  -------------  --------------  --------------
test_benchmark_implementations[baseline-8x512-t5-small]                      47.5101 (1.0)    47.6515 (1.0)   47.5101 (1.0)   47.7929 (1.0)   47.9705 (1.0)   48.2465 (1.0)  47.9705 (1.0)   48.5224 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-t5-small]  26.7525 (1.78)   26.7271 (1.78)  26.6523 (1.78)  26.7764 (1.78)  26.8328 (1.79)  26.84 (1.8)    26.8281 (1.79)  26.8589 (1.81)

@gaetansnl
Copy link
Contributor Author

a10g with rms

shape=(1, 128) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128-t5-small]                      11.029 (1.0)     11.0453 (1.0)  10.9686 (1.0)  11.1269 (1.0)  11.5238 (1.0)  12.3192 (1.0)  11.4882 (1.0)  15.8571 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-t5-small]  1.4235 (7.75)    1.4236 (7.76)  1.4214 (7.72)  1.426 (7.8)    1.4896 (7.74)  1.5029 (8.2)   1.4765 (7.78)  1.5986 (9.92)

shape=(1, 16) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)     Median          Mean            Min             Max
--------------------------------------------------------------------------  ---------------  --------------  --------------  -------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-1x16-t5-small]                      10.5959 (1.0)    10.5788 (1.0)   10.3835 (1.0)   10.946 (1.0)   11.0055 (1.0)   11.1147 (1.0)   10.9274 (1.0)   11.774 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-t5-small]  0.9577 (11.06)   0.9577 (11.05)  0.9548 (10.88)  0.9598 (11.4)  1.0146 (10.85)  1.0181 (10.92)  1.0086 (10.83)  1.1328 (10.39)

shape=(1, 256) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256-t5-small]                      10.7452 (1.0)    10.8366 (1.0)  10.3772 (1.0)  12.067 (1.0)   11.1614 (1.0)  11.5054 (1.0)  10.9191 (1.0)  13.2872 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-t5-small]  2.1578 (4.98)    2.1577 (5.02)  2.1557 (4.81)  2.1605 (5.59)  2.2151 (5.04)  2.2199 (5.18)  2.2112 (4.94)  2.3451 (5.67)

shape=(1, 33) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x33-t5-small]                      10.0617 (1.0)    10.0719 (1.0)  10.0033 (1.0)  10.1398 (1.0)  10.5688 (1.0)  10.7208 (1.0)  10.507 (1.0)   11.4625 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-t5-small]  1.0553 (9.53)    1.0554 (9.54)  1.0521 (9.51)  1.0588 (9.58)  1.1139 (9.49)  1.1188 (9.58)  1.1042 (9.52)  1.2549 (9.13)

shape=(1, 384) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384-t5-small]                      10.6152 (1.0)    10.6547 (1.0)  10.5708 (1.0)  10.8305 (1.0)  11.201 (1.0)   11.3768 (1.0)  11.142 (1.0)   12.0004 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-t5-small]  2.778 (3.82)     2.7777 (3.84)  2.7732 (3.81)  2.7818 (3.89)  2.8404 (3.94)  2.8441 (4.0)   2.8275 (3.94)  2.9572 (4.06)

shape=(1, 512) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512-t5-small]                      10.6596 (1.0)    10.6931 (1.0)  10.5691 (1.0)  10.908 (1.0)   11.2008 (1.0)  11.4652 (1.0)  11.0623 (1.0)  12.7114 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-t5-small]  3.4962 (3.05)    3.4965 (3.06)  3.4911 (3.03)  3.5024 (3.11)  3.5558 (3.15)  3.5601 (3.22)  3.5503 (3.12)  3.6653 (3.47)

shape=(32, 128) reference_fp32=t5-small
Name                                                                          Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128-t5-small]                      24.2781 (1.0)    24.2795 (1.0)   24.2635 (1.0)   24.29 (1.0)     24.3616 (1.0)   24.4732 (1.0)   24.3556 (1.0)   24.8117 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-t5-small]  13.2036 (1.84)   13.2028 (1.84)  13.1913 (1.84)  13.2106 (1.84)  13.2764 (1.83)  13.2893 (1.84)  13.2639 (1.84)  13.3791 (1.85)

shape=(32, 16) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16-t5-small]                      11.0888 (1.0)    11.1226 (1.0)  11.0654 (1.0)  11.2716 (1.0)  11.8778 (1.0)  12.2726 (1.0)  11.6101 (1.0)  15.0215 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-t5-small]  1.7764 (6.24)    1.7763 (6.26)  1.7724 (6.24)  1.7792 (6.34)  1.8328 (6.48)  1.8397 (6.67)  1.8274 (6.35)  1.9587 (7.67)

shape=(32, 256) reference_fp32=t5-small
Name                                                                          Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
----------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256-t5-small]                      55.9114 (1.0)    55.9114 (1.0)   55.9114 (1.0)   55.9114 (1.0)   56.553 (1.0)    56.553 (1.0)    56.553 (1.0)    56.553 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-t5-small]  30.487 (1.83)    30.4806 (1.83)  30.4509 (1.84)  30.5037 (1.83)  30.5607 (1.85)  30.5848 (1.85)  30.5551 (1.85)  30.6387 (1.85)

shape=(32, 33) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x33-t5-small]                      10.6136 (1.0)    10.6388 (1.0)  10.5168 (1.0)  10.8556 (1.0)  11.1354 (1.0)  11.1911 (1.0)  11.0141 (1.0)  11.8491 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-t5-small]  3.1233 (3.4)     3.1233 (3.41)  3.1189 (3.37)  3.128 (3.47)   3.1838 (3.5)   3.1875 (3.51)  3.1768 (3.47)  3.301 (3.59)

shape=(8, 128) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128-t5-small]                      10.5112 (1.0)    10.5188 (1.0)  10.364 (1.0)   10.7777 (1.0)  10.9929 (1.0)  11.1348 (1.0)  10.9541 (1.0)  12.1555 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-t5-small]  3.5589 (2.95)    3.5586 (2.96)  3.5511 (2.92)  3.5644 (3.02)  3.6237 (3.03)  3.6302 (3.07)  3.612 (3.03)   3.7298 (3.26)

shape=(8, 16) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min            Max
--------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  -------------  -------------
test_benchmark_implementations[baseline-8x16-t5-small]                      11.6463 (1.0)    11.6482 (1.0)   11.5398 (1.0)   11.8025 (1.0)   12.0958 (1.0)   12.2655 (1.0)   11.9916 (1.0)  13.0905 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-t5-small]  1.1495 (10.13)   1.1495 (10.13)  1.1468 (10.06)  1.1515 (10.25)  1.2064 (10.03)  1.2097 (10.14)  1.1996 (10.0)  1.3237 (9.89)

shape=(8, 256) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
---------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-8x256-t5-small]                      14.5856 (1.0)    14.586 (1.0)   14.5731 (1.0)  14.601 (1.0)   15.1165 (1.0)  15.168 (1.0)  14.9603 (1.0)  15.5511 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-t5-small]  7.9708 (1.83)    7.9715 (1.83)  7.9562 (1.83)  7.9966 (1.83)  8.0344 (1.88)  8.045 (1.89)  8.0172 (1.87)  8.1483 (1.91)

shape=(8, 33) reference_fp32=t5-small
Name                                                                        Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
--------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x33-t5-small]                      11.4275 (1.0)    11.5003 (1.0)  11.3355 (1.0)  11.7752 (1.0)  11.9373 (1.0)  12.1501 (1.0)  11.8968 (1.0)  13.5944 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-t5-small]  1.6373 (6.98)    1.6375 (7.02)  1.6356 (6.93)  1.6396 (7.18)  1.6983 (7.03)  1.7026 (7.14)  1.6896 (7.04)  1.8334 (7.41)

shape=(8, 384) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384-t5-small]                      27.9311 (1.0)    27.9374 (1.0)   27.9285 (1.0)   27.9526 (1.0)   28.3195 (1.0)   28.3491 (1.0)   28.1371 (1.0)   28.5907 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-t5-small]  14.2856 (1.96)   14.2857 (1.96)  14.2685 (1.96)  14.3028 (1.95)  14.3354 (1.98)  14.3622 (1.97)  14.3136 (1.97)  14.4246 (1.98)

shape=(8, 512) reference_fp32=t5-small
Name                                                                         Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
---------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512-t5-small]                      47.4965 (1.0)    47.6494 (1.0)   47.4965 (1.0)   47.8023 (1.0)   48.1978 (1.0)   48.4258 (1.0)   48.1978 (1.0)   48.6537 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-t5-small]  22.015 (2.16)    22.0358 (2.16)  21.9808 (2.16)  22.0826 (2.16)  22.0265 (2.19)  22.0977 (2.19)  22.0088 (2.19)  22.3052 (2.18)


@github-actions github-actions bot added feature and removed feature labels Oct 25, 2022
@gaetansnl
Copy link
Contributor Author

gaetansnl commented Oct 25, 2022

a10g BERT feat/rms-replacement branch for regression

shape=(1, 128) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x128-bert-base-uncased]                      6.3768 (1.0)     6.9185 (1.0)   6.2511 (1.0)   8.3863 (1.0)   6.8144 (1.0)   6.9155 (1.0)   6.7677 (1.0)   8.009 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-bert-base-uncased]  1.6157 (3.95)    1.6156 (4.28)  1.6137 (3.87)  1.6178 (5.18)  1.6695 (4.08)  1.6734 (4.13)  1.6628 (4.07)  1.7979 (4.45)

shape=(1, 16) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x16-bert-base-uncased]                      6.1886 (1.0)     6.2148 (1.0)   6.1249 (1.0)   6.4814 (1.0)   6.6933 (1.0)   6.7711 (1.0)   6.6501 (1.0)   7.4266 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-bert-base-uncased]  0.7631 (8.11)    0.7632 (8.14)  0.7608 (8.05)  0.7665 (8.46)  0.8148 (8.21)  0.8175 (8.28)  0.8093 (8.22)  0.9219 (8.06)

shape=(1, 256) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256-bert-base-uncased]                      6.2659 (1.0)     6.2773 (1.0)   6.1705 (1.0)   6.4937 (1.0)  6.812 (1.0)    7.0869 (1.0)   6.7092 (1.0)   8.4995 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-bert-base-uncased]  1.9217 (3.26)    1.9217 (3.27)  1.9181 (3.22)  1.925 (3.37)  1.9767 (3.45)  1.9811 (3.58)  1.9697 (3.41)  2.0793 (4.09)

shape=(1, 33) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-1x33-bert-base-uncased]                      6.3605 (1.0)     6.3749 (1.0)   6.333 (1.0)    6.4586 (1.0)   6.95 (1.0)     7.1235 (1.0)  6.87 (1.0)     8.7455 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-bert-base-uncased]  0.8979 (7.08)    0.8979 (7.1)   0.8956 (7.07)  0.9007 (7.17)  0.9495 (7.32)  0.954 (7.47)  0.9436 (7.28)  1.0593 (8.26)

shape=(1, 384) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x384-bert-base-uncased]                      6.3904 (1.0)     6.4845 (1.0)   6.3237 (1.0)   7.2649 (1.0)   6.8681 (1.0)   6.9416 (1.0)   6.8218 (1.0)   7.7629 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-bert-base-uncased]  2.1564 (2.96)    2.1567 (3.01)  2.1528 (2.94)  2.1602 (3.36)  2.2105 (3.11)  2.2162 (3.13)  2.2063 (3.09)  2.3393 (3.32)

shape=(1, 512) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512-bert-base-uncased]                      6.032 (1.0)      6.0298 (1.0)   5.9541 (1.0)   6.133 (1.0)    6.601 (1.0)    6.8261 (1.0)   6.5124 (1.0)   8.4258 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-bert-base-uncased]  3.0147 (2.0)     3.0151 (2.0)   3.0112 (1.98)  3.0203 (2.03)  3.0723 (2.15)  3.0772 (2.22)  3.0663 (2.12)  3.1782 (2.65)

shape=(32, 128) reference_fp32=bert-base-uncased
Name                                                                                   Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
-------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x128-bert-base-uncased]                      26.7683 (1.0)    26.7678 (1.0)   26.7457 (1.0)   26.7895 (1.0)   26.9032 (1.0)   26.9335 (1.0)   26.8283 (1.0)   27.0688 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-bert-base-uncased]  14.2598 (1.88)   14.2607 (1.88)  14.2554 (1.88)  14.2687 (1.88)  14.3311 (1.88)  14.3507 (1.88)  14.3261 (1.87)  14.4651 (1.87)

shape=(32, 16) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16-bert-base-uncased]                      6.4059 (1.0)     6.4324 (1.0)   6.3587 (1.0)   6.6748 (1.0)  6.9709 (1.0)   7.2847 (1.0)   6.8709 (1.0)   9.3627 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-bert-base-uncased]  2.6115 (2.45)    2.6115 (2.46)  2.6085 (2.44)  2.616 (2.55)  2.6687 (2.61)  2.6731 (2.73)  2.6593 (2.58)  2.7939 (3.35)

shape=(32, 256) reference_fp32=bert-base-uncased
Name                                                                                   Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median          Mean            Min             Max
-------------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256-bert-base-uncased]                      59.766 (1.0)     59.766 (1.0)   59.766 (1.0)   59.766 (1.0)  60.678 (1.0)    60.678 (1.0)    60.678 (1.0)    60.678 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-bert-base-uncased]  28.4519 (2.1)    28.4496 (2.1)  28.4429 (2.1)  28.454 (2.1)  28.3177 (2.14)  28.3557 (2.14)  28.3151 (2.14)  28.4343 (2.13)

shape=(32, 33) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median        Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  ------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x33-bert-base-uncased]                      7.4187 (1.0)     7.4177 (1.0)   7.4107 (1.0)   7.4234 (1.0)   7.7245 (1.0)  7.7755 (1.0)   7.7064 (1.0)   8.3104 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-bert-base-uncased]  4.9207 (1.51)    4.921 (1.51)   4.9152 (1.51)  4.9261 (1.51)  4.98 (1.55)   4.9858 (1.56)  4.9735 (1.55)  5.0863 (1.63)

shape=(8, 128) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  ------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128-bert-base-uncased]                      7.8429 (1.0)     7.8438 (1.0)   7.839 (1.0)   7.8521 (1.0)   8.1116 (1.0)   8.1607 (1.0)   8.1048 (1.0)   8.5234 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-bert-base-uncased]  4.807 (1.63)     4.8067 (1.63)  4.803 (1.63)  4.8089 (1.63)  4.8633 (1.67)  4.8688 (1.68)  4.8541 (1.67)  4.9637 (1.72)

shape=(8, 16) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  ------------
test_benchmark_implementations[baseline-8x16-bert-base-uncased]                      7.2084 (1.0)     7.7711 (1.0)   6.6948 (1.0)   9.7749 (1.0)   7.1956 (1.0)   7.2967 (1.0)   7.1063 (1.0)   8.5743 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-bert-base-uncased]  1.5118 (4.77)    1.5117 (5.14)  1.5083 (4.44)  1.5145 (6.45)  1.5616 (4.61)  1.5637 (4.67)  1.5583 (4.56)  1.669 (5.14)

shape=(8, 256) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x256-bert-base-uncased]                      17.603 (1.0)     17.6065 (1.0)  17.5904 (1.0)  17.6345 (1.0)  17.7363 (1.0)  17.7816 (1.0)  17.7234 (1.0)  17.9846 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-bert-base-uncased]  7.954 (2.21)     7.9557 (2.21)  7.9492 (2.21)  7.9723 (2.21)  8.0174 (2.21)  8.0355 (2.21)  8.0111 (2.21)  8.1668 (2.2)

shape=(8, 33) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median        Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  ------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x33-bert-base-uncased]                      6.4695 (1.0)     6.4796 (1.0)   6.3939 (1.0)   6.6359 (1.0)   7.0264 (1.0)  7.0725 (1.0)   6.9219 (1.0)   7.6823 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-bert-base-uncased]  1.8465 (3.5)     1.8467 (3.51)  1.8424 (3.47)  1.8515 (3.58)  1.8987 (3.7)  1.9013 (3.72)  1.8919 (3.66)  2.0079 (3.83)

shape=(8, 384) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean           Min             Max
------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  -------------  --------------  --------------
test_benchmark_implementations[baseline-8x384-bert-base-uncased]                      26.6721 (1.0)    26.6725 (1.0)   26.6649 (1.0)   26.6805 (1.0)   26.748 (1.0)    26.8092 (1.0)  26.6933 (1.0)   26.9863 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-bert-base-uncased]  11.3699 (2.35)   11.3739 (2.35)  11.3636 (2.35)  11.4051 (2.34)  11.4345 (2.34)  11.464 (2.34)  11.4248 (2.34)  11.6134 (2.32)

shape=(8, 512) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean            Min             Max
------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512-bert-base-uncased]                      40.1401 (1.0)    40.1565 (1.0)   40.1401 (1.0)   40.1729 (1.0)   40.2056 (1.0)   40.3196 (1.0)   40.2056 (1.0)   40.4336 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-bert-base-uncased]  15.1022 (2.66)   15.1223 (2.66)  15.0967 (2.66)  15.2206 (2.64)  15.1669 (2.65)  15.2034 (2.65)  15.1637 (2.65)  15.3519 (2.63)

@gaetansnl
Copy link
Contributor Author

gaetansnl commented Oct 25, 2022

a10g BERT current main branch for regression

shape=(1, 128) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)     Median        Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  ------------  -------------  ------------  -------------  -------------  ------------
test_benchmark_implementations[baseline-1x128-bert-base-uncased]                      6.4745 (1.0)     6.5347 (1.0)   6.4249 (1.0)  6.8636 (1.0)   7.0102 (1.0)  7.073 (1.0)    6.9157 (1.0)   7.9081 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x128-bert-base-uncased]  1.6136 (4.01)    1.6137 (4.05)  1.611 (3.99)  1.6166 (4.25)  1.6692 (4.2)  1.6725 (4.23)  1.6633 (4.16)  1.771 (4.47)

shape=(1, 16) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x16-bert-base-uncased]                      6.3166 (1.0)     6.3661 (1.0)   6.2621 (1.0)   6.6915 (1.0)   6.9763 (1.0)   7.0185 (1.0)   6.777 (1.0)    7.5827 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x16-bert-base-uncased]  0.763 (8.28)     0.763 (8.34)   0.7612 (8.23)  0.7649 (8.75)  0.8172 (8.54)  0.8222 (8.54)  0.8113 (8.35)  0.9263 (8.19)

shape=(1, 256) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x256-bert-base-uncased]                      6.4301 (1.0)     6.4301 (1.0)   6.3726 (1.0)   6.5064 (1.0)   6.8834 (1.0)   6.9537 (1.0)   6.8297 (1.0)   7.7801 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x256-bert-base-uncased]  1.9245 (3.34)    1.9244 (3.34)  1.9198 (3.32)  1.9274 (3.38)  1.9832 (3.47)  1.9931 (3.49)  1.9739 (3.46)  2.0981 (3.71)

shape=(1, 33) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)      Median         Mean           Min            Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  --------------  -------------  -------------  -------------  ------------
test_benchmark_implementations[baseline-1x33-bert-base-uncased]                      7.3055 (1.0)     7.5118 (1.0)   6.5675 (1.0)   10.5772 (1.0)   7.0373 (1.0)   7.0921 (1.0)   6.9857 (1.0)   7.7935 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x33-bert-base-uncased]  0.8978 (8.14)    0.8977 (8.37)  0.8951 (7.34)  0.9001 (11.75)  0.9564 (7.36)  0.9592 (7.39)  0.9462 (7.38)  1.06 (7.35)

shape=(1, 384) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean          Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  ------------  -------------  -------------
test_benchmark_implementations[baseline-1x384-bert-base-uncased]                      6.5506 (1.0)     6.7648 (1.0)   6.4632 (1.0)   8.3688 (1.0)   7.0997 (1.0)   7.2304 (1.0)  7.0263 (1.0)   7.9431 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x384-bert-base-uncased]  2.1527 (3.04)    2.1526 (3.14)  2.1493 (3.01)  2.1551 (3.88)  2.2093 (3.21)  2.217 (3.26)  2.2028 (3.19)  2.3563 (3.37)

shape=(1, 512) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-1x512-bert-base-uncased]                      6.1592 (1.0)     6.183 (1.0)    6.1239 (1.0)   6.2917 (1.0)  6.7093 (1.0)   6.9006 (1.0)   6.6471 (1.0)   8.5638 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-1x512-bert-base-uncased]  3.0143 (2.04)    3.0142 (2.05)  3.0116 (2.03)  3.018 (2.08)  3.0832 (2.18)  3.0863 (2.24)  3.0631 (2.17)  3.1934 (2.68)

shape=(32, 128) reference_fp32=bert-base-uncased
Name                                                                                   Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median          Mean           Min             Max
-------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  --------------  -------------  --------------  --------------
test_benchmark_implementations[baseline-32x128-bert-base-uncased]                      26.787 (1.0)     26.788 (1.0)    26.7816 (1.0)   26.7953 (1.0)   26.9505 (1.0)   26.9582 (1.0)  26.8213 (1.0)   27.1029 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x128-bert-base-uncased]  14.2549 (1.88)   14.2576 (1.88)  14.2537 (1.88)  14.2668 (1.88)  14.3316 (1.88)  14.351 (1.88)  14.3231 (1.87)  14.4744 (1.87)

shape=(32, 16) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x16-bert-base-uncased]                      6.5405 (1.0)     6.5564 (1.0)   6.4965 (1.0)   6.6395 (1.0)   7.0734 (1.0)   7.2476 (1.0)   7.0176 (1.0)   9.3778 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x16-bert-base-uncased]  2.6129 (2.5)     2.6132 (2.51)  2.6103 (2.49)  2.6183 (2.54)  2.6704 (2.65)  2.6739 (2.71)  2.6638 (2.63)  2.7846 (3.37)

shape=(32, 256) reference_fp32=bert-base-uncased
Name                                                                                   Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median          Mean            Min             Max
-------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  --------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-32x256-bert-base-uncased]                      59.7668 (1.0)    59.7668 (1.0)  59.7668 (1.0)  59.7668 (1.0)  60.705 (1.0)    60.705 (1.0)    60.705 (1.0)    60.705 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x256-bert-base-uncased]  28.4928 (2.1)    28.4276 (2.1)  28.264 (2.11)  28.5261 (2.1)  28.4379 (2.13)  28.4149 (2.14)  28.2548 (2.15)  28.5519 (2.13)

shape=(32, 33) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)    Max (CUDA)    Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  ------------  ------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-32x33-bert-base-uncased]                      7.4211 (1.0)     7.4207 (1.0)   7.4165 (1.0)  7.4247 (1.0)  7.7488 (1.0)   7.7872 (1.0)   7.7255 (1.0)   8.3273 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-32x33-bert-base-uncased]  4.951 (1.5)      4.9513 (1.5)   4.9477 (1.5)  4.9562 (1.5)  5.0112 (1.55)  5.0182 (1.55)  5.0035 (1.54)  5.1176 (1.63)

shape=(8, 128) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x128-bert-base-uncased]                      7.8344 (1.0)     7.8348 (1.0)   7.829 (1.0)    7.8419 (1.0)   8.1249 (1.0)   8.1592 (1.0)   8.1012 (1.0)   8.5067 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x128-bert-base-uncased]  4.8204 (1.63)    4.8202 (1.63)  4.8149 (1.63)  4.8251 (1.63)  4.8849 (1.66)  4.9037 (1.66)  4.8748 (1.66)  4.9854 (1.71)

shape=(8, 16) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)    Median         Mean           Min           Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  ------------  -------------  -------------  ------------  -------------
test_benchmark_implementations[baseline-8x16-bert-base-uncased]                      6.8668 (1.0)     6.9322 (1.0)   6.7848 (1.0)   7.2879 (1.0)  7.4401 (1.0)   7.5709 (1.0)   7.3452 (1.0)  8.9539 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x16-bert-base-uncased]  1.5163 (4.53)    1.5162 (4.57)  1.5135 (4.48)  1.5192 (4.8)  1.5694 (4.74)  1.5724 (4.81)  1.5638 (4.7)  1.6774 (5.34)

shape=(8, 256) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min            Max
------------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  -------------  -------------
test_benchmark_implementations[baseline-8x256-bert-base-uncased]                      17.682 (1.0)     17.681 (1.0)   17.6727 (1.0)  17.6869 (1.0)  17.8151 (1.0)  17.863 (1.0)   17.7961 (1.0)  18.0392 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x256-bert-base-uncased]  7.9689 (2.22)    7.9689 (2.22)  7.9604 (2.22)  7.9799 (2.22)  8.0328 (2.22)  8.0464 (2.22)  8.0255 (2.22)  8.1786 (2.21)

shape=(8, 33) reference_fp32=bert-base-uncased
Name                                                                                 Median (CUDA)    Mean (CUDA)    Min (CUDA)     Max (CUDA)     Median         Mean           Min           Max
-----------------------------------------------------------------------------------  ---------------  -------------  -------------  -------------  -------------  -------------  ------------  -------------
test_benchmark_implementations[baseline-8x33-bert-base-uncased]                      6.5898 (1.0)     6.6062 (1.0)   6.5476 (1.0)   6.6883 (1.0)   7.1271 (1.0)   7.2005 (1.0)   7.0563 (1.0)  7.7295 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x33-bert-base-uncased]  1.8494 (3.56)    1.8495 (3.57)  1.8473 (3.54)  1.8528 (3.61)  1.9043 (3.74)  1.9088 (3.77)  1.898 (3.72)  2.0123 (3.84)

shape=(8, 384) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median         Mean            Min             Max
------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  -------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x384-bert-base-uncased]                      26.6523 (1.0)    26.6516 (1.0)   26.6452 (1.0)   26.6573 (1.0)   26.7573 (1.0)  26.8159 (1.0)   26.7286 (1.0)   26.9619 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x384-bert-base-uncased]  11.3882 (2.34)   11.3907 (2.34)  11.3846 (2.34)  11.3986 (2.34)  11.455 (2.34)  11.4695 (2.34)  11.4457 (2.34)  11.5579 (2.33)

shape=(8, 512) reference_fp32=bert-base-uncased
Name                                                                                  Median (CUDA)    Mean (CUDA)     Min (CUDA)      Max (CUDA)      Median         Mean            Min             Max
------------------------------------------------------------------------------------  ---------------  --------------  --------------  --------------  -------------  --------------  --------------  --------------
test_benchmark_implementations[baseline-8x512-bert-base-uncased]                      40.1021 (1.0)    40.1257 (1.0)   40.1021 (1.0)   40.1493 (1.0)   40.1615 (1.0)  40.2904 (1.0)   40.1615 (1.0)   40.4193 (1.0)
test_benchmark_implementations[dynamo_optimized_cuda_graphs-8x512-bert-base-uncased]  15.1029 (2.66)   15.1062 (2.66)  15.0979 (2.66)  15.1165 (2.66)  15.188 (2.64)  15.2321 (2.65)  15.1684 (2.65)  15.3538 (2.63)

@github-actions github-actions bot added feature and removed feature labels Oct 25, 2022
@pommedeterresautee
Copy link
Member

test pass

=========================================================================================================== warnings summary ===========================================================================================================
conftest.py:41
  /mnt/workspace/kernl/conftest.py:41: PytestDeprecationWarning: The hookimpl pytest_configure uses old-style configuration options (marks or attributes).
  Please use the pytest.hookimpl(trylast=True) decorator instead
   to configure the hooks.
   See https://docs.pytest.org/en/latest/deprecations.html#configuring-hook-specs-impls-using-markers
    @pytest.mark.trylast

test/test_debugger.py::test_matmul
  /mnt/workspace/kernl/test/test_debugger.py:172: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
    group_id = pid // num_pid_in_group

test/test_debugger.py::test_matmul
  /mnt/workspace/kernl/test/test_debugger.py:176: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
    pid_n = (pid % num_pid_in_group) // group_size_m

test/test_torchdynamo.py::test_t5
  /home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5_fast.py:156: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
  For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
  - Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
  - If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
  - To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================================== 2461 passed, 356 skipped, 4 warnings in 6722.78s (1:52:02) ======================================================================================

Copy link
Member

@pommedeterresautee pommedeterresautee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix imports


import torch

from src.kernl.implementations.layer_norm import _layer_norm_fwd_fused_single_pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove src.


import torch

from src.kernl.optimizer.layer_norm import replace_layer_norm_rms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove src.

Copy link
Member

@pommedeterresautee pommedeterresautee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
checked good speedup e2e

@gaetansnl gaetansnl merged commit 1463e39 into main Oct 27, 2022
@gaetansnl gaetansnl deleted the feat/rms-replacement branch October 27, 2022 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

3 participants