ENH Add array API for PolynomialFeatures #31580

OmarManzoor · 2025-06-18T06:55:52Z

Reference Issues/PRs

Towards #26024

What does this implement/fix? Explain your changes.

Array api implementation for preprocessing.PolynomialFeatures

Any other comments?

github-actions · 2025-06-18T06:56:39Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6620df5. Link to the linter CI: here}

OmarManzoor · 2025-06-18T07:12:41Z

Some benchmarks:

Kaggle notebook

Avg fit time for numpy: 0.007334113121032715
Avg transform time for numpy: 3.54639573097229

Avg fit time for torch cuda: 0.050480985641479494
Avg transform time for torch cuda: 0.011217021942138672

from time import time

import numpy as np
import torch as xp
from tqdm import tqdm

from sklearn._config import config_context
from sklearn.preprocessing._polynomial import PolynomialFeatures

X_np = np.random.rand(100000, 100)
X_xp_cuda = xp.asarray(X_np, device="cuda")

# Numpy benchmarks
fit_times = []
transform_times = []
for _ in tqdm(range(10), desc="Numpy Flow"):
    start = time()
    pf_np = PolynomialFeatures(degree=2)
    pf_np.fit(X_np)
    fit_times.append(time() - start)

    start = time()
    pf_np.transform(X_np)
    transform_times.append(time() - start)

avg_fit_time = sum(fit_times) / 10
avg_transform_time = sum(transform_times) / 10
print(f"Avg fit time for numpy: {avg_fit_time}")
print(f"Avg transform time for numpy: {avg_transform_time}")


# Torch cuda benchmarks
fit_times = []
transform_times = []
for _ in tqdm(range(10), desc="Torch cuda Flow"):
    with config_context(array_api_dispatch=True):
        start = time()
        pf_xp = PolynomialFeatures(degree=2)
        pf_xp.fit(X_xp_cuda)
        fit_times.append(time() - start)

        start = time()
        pf_xp.transform(X_xp_cuda)
        transform_times.append(time() - start)

avg_fit_time = sum(fit_times) / 10
avg_transform_time = sum(transform_times) / 10
print(f"Avg fit time for torch cuda: {avg_fit_time}")
print(f"Avg transform time for torch cuda: {avg_transform_time}")

Local System with MPS (just changed device and dtype to float32 in the above code)

Avg fit time for numpy: 0.0025035619735717775
Avg transform time for numpy: 1.2987385749816895

Avg fit time for torch mps: 0.16063039302825927
Avg transform time for torch mps: 0.051826977729797365

I don't think we can expect any improvements (and actually some downgrade) in the fit time because I did not have to change anything in the fit part to support array api which means it can't really benefit from it. But the transform times are significantly better.

CC: @ogrisel @lesteve @lucyleeow @StefanieSenger for reviews

OmarManzoor · 2025-06-18T09:49:55Z

The code coverage warning can be ignored because it is related to a special case for mps devices.

ogrisel

Thanks @OmarManzoor. This looks good to me.

I edited the benchmark as follows to insert a block call on the resulting array to force cuda synchronization. We still get a 36x speed-up on this data and runtime!

from time import time

import numpy as np
import torch as xp
from tqdm import tqdm

from sklearn._config import config_context
from sklearn.preprocessing._polynomial import PolynomialFeatures

X_np = np.random.rand(100000, 100)
X_xp_cuda = xp.asarray(X_np, device="cuda")

# Numpy benchmarks
fit_times = []
transform_times = []
n_iter = 10
for _ in tqdm(range(n_iter), desc="Numpy Flow"):
    start = time()
    pf_np = PolynomialFeatures(degree=2)
    pf_np.fit(X_np)
    fit_times.append(time() - start)

    start = time()
    pf_np.transform(X_np)
    transform_times.append(time() - start)

avg_fit_time_numpy = sum(fit_times) / n_iter
avg_transform_time_numpy = sum(transform_times) / n_iter
print(f"Avg fit time for numpy: {avg_fit_time_numpy:.3f}")
print(f"Avg transform time for numpy: {avg_transform_time_numpy:.3f}")


# Torch cuda benchmarks
fit_times = []
transform_times = []
for _ in tqdm(range(n_iter), desc="Torch cuda Flow"):
    with config_context(array_api_dispatch=True):
        start = time()
        pf_xp = PolynomialFeatures(degree=2)
        pf_xp.fit(X_xp_cuda)
        fit_times.append(time() - start)

        start = time()
        float(pf_xp.transform(X_xp_cuda)[0, 0])
        transform_times.append(time() - start)

avg_fit_time_cuda = sum(fit_times) / n_iter
avg_transform_time_cuda = sum(transform_times) / n_iter
print(
    f"Avg fit time for torch cuda: {avg_fit_time_cuda:.3f}, "
    f"speed-up: {avg_fit_time_numpy / avg_fit_time_cuda:.1f}x"
)
print(
    f"Avg transform time for torch cuda: {avg_transform_time_cuda:.3f} "
    f"speed-up: {avg_transform_time_numpy / avg_transform_time_cuda:.1f}x"
)

Numpy Flow: 100%|██████████| 10/10 [00:37<00:00,  3.70s/it]

Avg fit time for numpy: 0.008
Avg transform time for numpy: 3.695

Torch cuda Flow: 100%|██████████| 10/10 [00:01<00:00,  9.76it/s]

Avg fit time for torch cuda: 0.001, speed-up: 6.8x
Avg transform time for torch cuda: 0.100 speed-up: 36.8x

I think the supported_float_dtypes function could be simplified by leveraging the new inspection API. Otherwise, +1 for merge.

sklearn/utils/_array_api.py

ogrisel · 2025-06-18T14:01:48Z

I also get a 5.5x speed-up over numpy using the MPS GPU on my M1 laptop (compared to your 25x speed-up on your MPS GPU).

sklearn/utils/_array_api.py

ogrisel

One more follow-up comment below.

Besides, the __sklearn_tags__ method should be updated to declare that this transformer supports array API inputs.

ogrisel · 2025-06-20T16:00:07Z

sklearn/utils/_array_api.py

-        return (xp.float64, xp.float32, xp.float16)
-    else:
-        return (xp.float64, xp.float32)
+        valid_float_dtypes.append(xp.float16)


Is this still needed? I think it can be wrong: some devices might not support float16 even when the namespace exposes it.

The array API specification does not support float16 which is why we have this condition. https://data-apis.org/array-api/latest/API_specification/data_types.html

ogrisel · 2025-06-20T16:00:21Z

sklearn/utils/_array_api.py

+        kind="real floating", device=device
+    )
+    valid_float_dtypes = []
+    for dtype_key in ("float64", "float32"):


Suggested change

for dtype_key in ("float64", "float32"):

for dtype_key in ("float64", "float32", "float16"):

OmarManzoor added 7 commits June 17, 2025 20:06

ENH Add array api support for PolynomialFeatures

f6ff0ed

Add benchmark

63111d4

Add benchmark

e9e1b16

Remove benchmark file for testing

17fa68a

Add benchmark again for testing

a72f6e9

Remove benchmark file for testing

c58ad46

Add in documentation

1518126

github-actions bot added module:preprocessing module:utils labels Jun 18, 2025

OmarManzoor changed the title ~~Array api poly features~~ ENH Add array API for PolynomialFeatures Jun 18, 2025

OmarManzoor added the Array API label Jun 18, 2025

Add changelog

2e96584

OmarManzoor added the CUDA CI label Jun 18, 2025

github-actions bot removed the CUDA CI label Jun 18, 2025

OmarManzoor mentioned this pull request Jun 18, 2025

Make more of the "tools" of scikit-learn Array API compatible #26024

Open

ogrisel approved these changes Jun 18, 2025

View reviewed changes

sklearn/utils/_array_api.py Show resolved Hide resolved

OmarManzoor added 2 commits June 19, 2025 10:22

Merge branch 'main' into array-api-poly-features

a4f45b9

Refactor code in supported_float_dtypes

a9055f5

OmarManzoor added the CUDA CI label Jun 19, 2025

github-actions bot removed the CUDA CI label Jun 19, 2025

lesteve reviewed Jun 19, 2025

View reviewed changes

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved

OmarManzoor added 4 commits June 19, 2025 20:24

Merge branch 'main' into array-api-poly-features

e38629f

Update the supported float dtypes function

7aaa83b

Merge branch 'main' into array-api-poly-features

11906cc

Add device check in test

6620df5

OmarManzoor added the CUDA CI label Jun 20, 2025

github-actions bot removed the CUDA CI label Jun 20, 2025

ogrisel reviewed Jun 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Add array API for PolynomialFeatures #31580

ENH Add array API for PolynomialFeatures #31580

Uh oh!

OmarManzoor commented Jun 18, 2025

Uh oh!

github-actions bot commented Jun 18, 2025 •

edited

Loading

Uh oh!

OmarManzoor commented Jun 18, 2025 •

edited

Loading

Uh oh!

OmarManzoor commented Jun 18, 2025

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

ogrisel commented Jun 18, 2025

Uh oh!

Uh oh!

ogrisel left a comment •

edited

Loading

Uh oh!

ogrisel Jun 20, 2025

Uh oh!

OmarManzoor Jun 20, 2025

Uh oh!

ogrisel Jun 20, 2025

Uh oh!

Uh oh!

	for dtype_key in ("float64", "float32"):
	for dtype_key in ("float64", "float32", "float16"):

Uh oh!

ENH Add array API for PolynomialFeatures #31580

Are you sure you want to change the base?

ENH Add array API for PolynomialFeatures #31580

Uh oh!

Conversation

OmarManzoor commented Jun 18, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

OmarManzoor commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Jun 18, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Jun 18, 2025

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

OmarManzoor Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jun 18, 2025 •

edited

Loading

OmarManzoor commented Jun 18, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading