Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[1/n][Optimus][Auto-AC] Support activation quantization without scaling #148380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mengluy0125
Copy link
Contributor

@mengluy0125 mengluy0125 commented Mar 3, 2025

Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:

unit test

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten

Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017
Network: Up: 4.3MiB Down: 42MiB (reSessionID-fef7e727-68b1-4645-a519-5652854df38d)
Executing actions. Remaining 0/4 6.7s exec time total
Command: test. Finished 2 local
Time elapsed: 3:11.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

E2E

how to enable (you can overrite the dtype, if nothing given, the default is fp8)

post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}
        },

Differential Revision: D70522237

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Copy link

pytorch-bot bot commented Mar 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148380

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 17 Pending

As of commit 7e92ff3 with merge base d2ee606 (image):
💚 Looks good so far! There are no failures yet. 💚

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

pytorch-bot bot pushed a commit that referenced this pull request Mar 6, 2025
Summary:

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/fc8469a3-54f7-425d-9b1f-e54840c0793a
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3377699989853946
Network: Up: 10KiB  Down: 0B  (reSessionID-ab248457-6ac0-4b72-96da-0d3c427e260a)
Executing actions. Remaining     0/3                                                                                        0.7s exec time total
Command: test.     Finished 2 local                                                                                                                                                              
Time elapsed: 1:03.1s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable (you can overrite the dtype and clamp range, if nothing given, the default is fp8)

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": torch.float8_e5m2, "clamp_min": -57344.0, "clamp_max": 57344.0}
        },
```

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

pytorch-bot bot pushed a commit that referenced this pull request Apr 9, 2025
Summary:

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 9, 2025
Summary:
X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 9, 2025
Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 9, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 10, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 10, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 10, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 10, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2025
…ng (#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: #148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request Apr 22, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't done full review yet. I see that _inductor/pattern_matcher.py was moved to functorch/pattern_matcher.py.

For one, we haven't been especially dogmatic about dependencies from one subdirectory going into the other.. So I don't know if we need the move. If we do it, we need to import torch/_functorch/pattern_matcher.py -> torch/_inductor/pattern_matcher.py to avoid breaking BC of anyone relying on the old file. I know users such as vllm rely on it.

If we do want to do it, could we stack this as a separate refactoring pr? since additional code was added to pattern_matcher.py it makes it harder to review.

still have not done full review - cc @bdhirsh as well

@@ -4588,3 +4588,7 @@ def maybe_disable_inference_mode_for_fake_prop() -> Generator[None, None, None]:
yield
else:
yield


def is_node_meta_valid(node: Optional[torch.fx.Node]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: -> bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to move part of the functions in torch/_inductor/pattern_matcher.py to torch/_functorch/pattern_matcher.py due to the circular dependency problem. For any existing dependency on torch/_inductor/pattern_matcher.py, I will redirect it to torch/_functorch/pattern_matcher.py by adding the "dummy" module imports there to avoid current dependency break.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the circular dependency ?

Copy link
Contributor

@eellison eellison Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@register_graph_pattern(
    Placeholder("*", users=MULTIPLE),
    pass_dict=construct_pattern_matcher_pass("activation_quantization_fwd_aten_pass"),
)

As mentioned earlier - this should just be graph.find_nodes. Which will be O(1) instead of O(n) where n is the number of graphs in the node.

If the only reason for all the churn is to do the above, which is slower, would really prefer not to.

Moving the file makes all the blame harder to follow which is a real cost. and the use case here for refactoring is not real.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also do something like

def foo():
     import ....


foo()

to register and avoid the refactoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do graph.find_nodes instead, then we don't need to do the code refactoring @eellison

):
GraphTransformObserver = functools.partial(
torch.fx.passes.graph_transform_observer.GraphTransformObserver,
subsystem="default_partition_passes",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subsystem is primarily to bisect application of passes with CompilerBisector.

BisectSubsystem("joint_graph_passes"), # passes applied on joint graph
BisectSubsystem(
"post_grad_passes"
), # passes applied individually on forward, and backward in inductor
.

Should we add there ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

subsystem="default_partition_passes",
)

if inductor_config.pattern_matcher:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

if not inductor_config.pattern_matcher:
     return

to reduce nesting



def should_quantize(node: torch.fx.Node) -> bool:
allowed_dtypes = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also do: dtype.is_floating_point and dtype.itemsize > torch.float8_e5m2.itemsize

Comment on lines 2294 to 2295
and len(shape) > 1
and shape[0] >= 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we only quantize if batch size >= 1024 ? as opposed to looking at total numel or something ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's model dependent, this is the heuristic borrowed by some experiment on IGCTR. In my followup PR, I would give the flexibility to customers to set the quantize constraints themselves instead of hardcoding the requirements, thus they can customize their needs to find the best quantization condition.

Comment on lines 1896 to 1928
@functools.lru_cache(None)
def _warn_complex_not_supported():
warnings.warn(
"Torchinductor does not support code generation for complex operators. Performance may be worse than eager."
)


# There are some types (CPU) which we accept as input but not as
# output.
def unsupported_input_tensor(t: torch.Tensor, parent=None, node=None):
"Do not support reading or writing to this tensor"
if t.is_complex():
# Complex views are supported with IR ComplexView
if parent and parent.target in (
torch.ops.aten.view.dtype,
torch.ops.prims.convert_element_type.default,
):
return False
_warn_complex_not_supported()
return True

if t.is_meta:
return True

if t.dtype == torch.float8_e8m0fnu:
if not node:
return True

# allow bitcast, views, memory movement, but not arithmetic
# TODO: delete once triton adds native support
return not (
isinstance(parent.target, torch._ops.OpOverload)
and parent.target
in (
aten.view.dtype,
aten.cat.default,
aten._scaled_mm.default,
)
or (isinstance(node.target, torch._ops.OpOverload) and is_view(node.target))
)

return False


def unsupported_output_tensor(t: torch.Tensor, parent=None, node=None):
"Do not support writing tensor but can read from it"
if unsupported_input_tensor(t, parent):
return True
return t.is_cpu and config.disable_cpp_codegen


def fallback_node_due_to_unsupported_type(node: torch.fx.Node, allow_cpu_inputs=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason why this was deleted ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not deleted but moving it to the torch/_inductor/pattern_matcher.py due to the circular dependency problem.

Comment on lines +2173 to +2308
static_lifetime_input_indices=static_lifetime_input_indices,
static_lifetime_input_nodes=node_info.static_lifetime_input_nodes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of strange to me that we have two different ways of representing the same thing here. Is there a reason for that ? cc @bdhirsh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, cc @mengluy0125 . Why do we need a separate concept here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defined my enable quantization function as follows, where I use the static_lifetime_input_nodes to check if this saved node from auto_ac should be included or excluded in the quantization @bdhirsh @eellison

def enable_activation_quantization(
    saved_values: list[fx.Node],
    fwd_module: fx.GraphModule,
    bwd_module: fx.GraphModule,
    static_lifetime_input_nodes: Optional[OrderedSet[fx.Node]] = None,
):

pytorch-bot bot pushed a commit that referenced this pull request Apr 24, 2025
…ng (#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: #148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request Apr 24, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request Apr 24, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
@eellison eellison requested review from eellison and bdhirsh April 24, 2025 22:43
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing comments not addressed..

Comment on lines 1896 to 1928
@functools.lru_cache(None)
def _warn_complex_not_supported():
warnings.warn(
"Torchinductor does not support code generation for complex operators. Performance may be worse than eager."
)


# There are some types (CPU) which we accept as input but not as
# output.
def unsupported_input_tensor(t: torch.Tensor, parent=None, node=None):
"Do not support reading or writing to this tensor"
if t.is_complex():
# Complex views are supported with IR ComplexView
if parent and parent.target in (
torch.ops.aten.view.dtype,
torch.ops.prims.convert_element_type.default,
):
return False
_warn_complex_not_supported()
return True

if t.is_meta:
return True

if t.dtype == torch.float8_e8m0fnu:
if not node:
return True

# allow bitcast, views, memory movement, but not arithmetic
# TODO: delete once triton adds native support
return not (
isinstance(parent.target, torch._ops.OpOverload)
and parent.target
in (
aten.view.dtype,
aten.cat.default,
aten._scaled_mm.default,
)
or (isinstance(node.target, torch._ops.OpOverload) and is_view(node.target))
)

return False


def unsupported_output_tensor(t: torch.Tensor, parent=None, node=None):
"Do not support writing tensor but can read from it"
if unsupported_input_tensor(t, parent):
return True
return t.is_cpu and config.disable_cpp_codegen


def fallback_node_due_to_unsupported_type(node: torch.fx.Node, allow_cpu_inputs=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 1

bwd_module: fx.GraphModule,
static_lifetime_input_nodes: Optional[OrderedSet[fx.Node]] = None,
):
GraphTransformObserver = functools.partial(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a function, not a class - I would stick with function/variable naming, e.g. graph_transform_observer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the naming convention in the codebase to keep it consistent lol @bdhirsh

GraphTransformObserver = functools.partial(

@mengluy0125
Copy link
Contributor Author

To improve the readability, I will make a separate PR to refactor the code to resolve the circular dependency problem @eellison @bdhirsh

pytorch-bot bot pushed a commit that referenced this pull request May 1, 2025
…ng (#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000


baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot


### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request May 1, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000


baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot


### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request May 1, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000


baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot


### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Reviewed By: avicizhu

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request May 7, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Reviewed By: avicizhu

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 7, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants