FEA Add array API support for LogisticRegression with LBFGS #32644

OmarManzoor · 2025-11-04T11:48:20Z

Reference Issues/PRs

Towards: #32611

What does this implement/fix? Explain your changes.

Adds array API support for LogisticRegression with the LBFGS method

Any other comments?

github-actions · 2025-11-04T11:49:19Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: df00875. Link to the linter CI: here}

ogrisel

Could you also please profile a run using mps or cuda using py-spy?

sklearn/linear_model/_linear_loss.py

OmarManzoor · 2025-11-04T16:00:58Z

Some Benchmarks

from time import time

import numpy as np
import torch as xp
from tqdm import tqdm

from sklearn import config_context
from sklearn.linear_model import LogisticRegression

n_samples, n_features, n_classes = 1000000, 300, 20
device = "cuda"
n_iter = 10

X_np = np.random.rand(n_samples, n_features)
y_np = np.random.randint(0, 10, n_samples)
numpy_fit_times = []
numpy_predict_times = []
for _ in tqdm(range(n_iter), desc="Numpy"):
    lr = LogisticRegression(C=0.8, solver="lbfgs", max_iter=200)
    start = time()
    lr.fit(X_np, y_np)
    numpy_fit_times.append(round(time() - start, 3))
    start = time()
    pred = lr.predict_proba(X_np)
    numpy_predict_times.append(round(time() - start, 3))

avg_numpy_fit = round(sum(numpy_fit_times) / n_iter, 3)
avg_numpy_predict = round(sum(numpy_predict_times) / n_iter, 3)

torch_fit_times = []
torch_predict_times = []
X_xp = xp.asarray(X_np, device=device)
y_xp = xp.asarray(y_np, device=device)
for _ in tqdm(range(n_iter), desc=f"Torch {device}"):
    with config_context(array_api_dispatch=True):
        lr = LogisticRegression(C=0.8, solver="lbfgs", max_iter=200)
        start = time()
        lr.fit(X_xp, y_xp)
        torch_fit_times.append(round(time() - start, 3))
        start = time()
        pred = lr.predict_proba(X_xp)
        first = float(pred[0, 0])
        torch_predict_times.append(round(time() - start, 3))

avg_torch_fit = round(sum(torch_fit_times) / n_iter, 3)
avg_torch_predict = round(sum(torch_predict_times) / n_iter, 3)

print(f"Average fit time numpy: {avg_numpy_fit}")
print(f"Average fit time torch {device}: {avg_torch_fit}")
print(f"Torch {device} fit speedup: {round(avg_numpy_fit / avg_torch_fit, 2)}X")


print(f"Average predict time numpy: {avg_numpy_predict}")
print(f"Average predict time torch {device}: {avg_torch_predict}")
print(
    f"Torch {device} predict speedup: {round(avg_numpy_predict / avg_torch_predict, 2)}"
    "X"
)

n_samples, n_features, n_classes = 1000000, 300, 20

Average fit time numpy: 23.526
Average fit time torch cuda: 8.104
Torch cuda fit speedup: 2.9X

Average predict time numpy: 1.133
Average predict time torch cuda: 0.17
Torch cuda predict speedup: 6.66X

ogrisel · 2025-11-04T16:26:14Z

It's nice to get a speed-up with CUDA besides the conversion of the raw predictions and pointwise gradient values of the loss at each iteration.

Can you post the SVG of the py-spy profiling results both for the pytorch/CUDA and the numpy/CPU runs?

If the conversion of the raw predictions / pointwise gradients are significant, I think we should try to implement an alternative to the Cython gradient function using the array API to skip those conversions directly as part of this PR.

sklearn/linear_model/tests/test_logistic.py

ogrisel

Some more feedback.

sklearn/linear_model/_logistic.py

sklearn/utils/_array_api.py

OmarManzoor · 2025-11-05T06:26:11Z

Cuda py-spy flamegraph

OmarManzoor · 2025-11-05T06:37:53Z

Numpy py-spy flamegraph

ogrisel

Another pass of feedback:

sklearn/linear_model/_linear_loss.py

doc/whats_new/upcoming_changes/array-api/32644.feature.rst

sklearn/linear_model/_base.py

sklearn/linear_model/_linear_loss.py

sklearn/linear_model/_logistic.py

sklearn/utils/_array_api.py

sklearn/linear_model/_logistic.py

sklearn/linear_model/_base.py

sklearn/linear_model/tests/test_logistic.py

…ance

ogrisel · 2025-11-12T16:04:04Z

For the record, I observe a speed-up of 2x when using PyTorch/MPS vs NumPy/CPU (with OpenBLAS) on an Apple M4 laptop using a 50-class classification problem. I see no significant speed-up for binary classification.

EDIT: I increased the dataset size and I now also observe a factor of 2x speed up when using PyTorch/MPS vs NumPy/CPU (with OpenBLAS) for binary classification.

ogrisel · 2025-11-12T17:00:45Z

While evaluating locally, I found a bug for a specific dataset size:

import os

os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["SCIPY_ARRAY_API"] = "1"


import torch
import numpy as np
from sklearn import set_config
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import minmax_scale

# Uncomment when using newaccelerate BLAS on macOS.
# import warnings
# warnings.filterwarnings("ignore", category=RuntimeWarning)


set_config(array_api_dispatch=True)

X, y = make_classification(
    n_samples=int(1e6), n_classes=2, n_features=100, n_informative=90, random_state=0
)
X = X.astype("float32")
# X = minmax_scale(X)

X_mps = torch.from_numpy(X).to("mps")
y_mps = torch.from_numpy(y).to("mps")

clf = LogisticRegression(max_iter=1000)

clf.fit(X, y).n_iter_, clf.score(X, y)

outputs:

(array([37], dtype=int32), 0.851278)

while:

clf.fit(X_mps, y_mps).n_iter_, clf.score(X_mps, y_mps)

outputs a bad model without any error message nor ConvergenceWarning:

(tensor([1], device='mps:0', dtype=torch.int32), 0.4999749958515167)

@OmarManzoor can you reproduce?

The fact that the model converges to a chance level prediction function after only 1 iteration without any error message sounds like a bug to me.

Note that the problem goes away when:

reducing the number of data points or the number of informative features in the training set;
scaling the features.

ogrisel · 2025-11-12T17:03:28Z

Since the problem happens at the first iteration, we could debug by printing (parts of) the sample-wise and parameter-wise gradient vectors for each run with max_iter=1 along with the values for the convergence criterion.

OmarManzoor · 2025-11-13T07:24:47Z

I updated the loss_array_api for the half binomial case to take into account some important additional conditions. I think using simple xp.log1p is not accurate enough.

This is the output of the above script which was causing issues:
tensor([38], device='mps:0', dtype=torch.int32) 0.8512759804725647

sklearn/linear_model/tests/test_logistic.py

ogrisel

I started to take a deeper look at the actual changes, and here is a first pass of feedback.

sklearn/_loss/loss.py

sklearn/linear_model/tests/test_logistic.py

sklearn/_loss/tests/test_loss.py

ogrisel

Some more feedback:

sklearn/_loss/loss.py

betatim · 2025-11-14T16:28:32Z

I've not yet looked at the diff but have run the benchmark script from #32644 (comment) with the following setting n_samples, n_features, n_classes = 1_000_000, 1_000, 2

The results are:

Average fit time numpy: 6.502
Average fit time torch cuda: 3.51
Torch cuda fit speedup: 1.85X
Average predict time numpy: 0.38
Average predict time torch cuda: 0.133
Torch cuda predict speedup: 2.86X

I've not systematically tried different shapes/sizes, but from trying smaller n_features it seems like the speed ups are roughly the same.

The GPU is a A6000, CPU details in the fold out.

Details

processor	: 63
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 24
model name	: AMD Ryzen Threadripper PRO 7975WX 32-Cores
stepping	: 1
microcode	: 0xa108108
cpu MHz		: 2194.000
cache size	: 1024 KB
physical id	: 0
siblings	: 64
core id		: 31
cpu cores	: 32
apicid		: 63
initial apicid	: 63
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips	: 7988.57
TLB size	: 3584 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 52 bits physical, 57 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

System:
    python: 3.13.9 | packaged by conda-forge | (main, Oct 22 2025, 23:33:35) [GCC 14.3.0]
executable: /home/thead/miniforge3/envs/sklearn-20251114/bin/python3.13
   machine: Linux-6.14.0-27-generic-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.8.dev0
          pip: 25.3
   setuptools: 80.9.0
        numpy: 2.3.4
        scipy: 1.16.3
       Cython: 3.2.1
       pandas: 2.3.3
   matplotlib: 3.10.8
       joblib: 1.5.2
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
    num_threads: 32
         prefix: libmkl_rt
       filepath: /home/thead/miniforge3/envs/sklearn-20251114/lib/libmkl_rt.so.2
        version: 2025.3-Product
threading_layer: intel

       user_api: openmp
   internal_api: openmp
    num_threads: 64
         prefix: libomp
       filepath: /home/thead/miniforge3/envs/sklearn-20251114/lib/libomp.so
        version: None

OmarManzoor · 2025-11-14T17:13:28Z

@betatim This seems to be a lot slower than what I observed on a colab T4 gpu, maybe because of the higher n_samples

ogrisel · 2025-11-17T08:54:43Z

@betatim This seems to be a lot slower than what I observed on a colab T4 gpu, maybe because of the higher n_samples

It's also possible that the CPUs on the T4 instance of collab are particularly slow and therefore inflate the impact of using CUDA on that machine.

sklearn/utils/validation.py

OmarManzoor · 2025-11-17T09:48:11Z

I checked for CPU on colab. I think the CPU might be the difference in these observed timings:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
stepping	: 3
microcode	: 0xffffffff
cpu MHz		: 2000.180
cache size	: 39424 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed bhi its
bogomips	: 4000.36
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
stepping	: 3
microcode	: 0xffffffff
cpu MHz		: 2000.180
cache size	: 39424 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed bhi its
bogomips	: 4000.36
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

With n_samples, n_features, n_classes = 1000000, 300, 2

Average fit time numpy: 22.72
Average fit time torch cuda: 4.321
Torch cuda fit speedup: 5.26X

Average predict time numpy: 1.082
Average predict time torch cuda: 0.186
Torch cuda predict speedup: 5.82X

betatim · 2025-11-17T14:31:37Z

I think seeing (quite) different speed ups depending on which combination of CPU and GPU you use is expected. Or at least not too surprising. For me the precise speed up is less important than seeing a general trend. It seems we see speed improvements across different CPU/GPU combinations and also different choices for n_samples and n_features (we don't have to cherry pick any to demonstrate a speed up).

FEA Add array API support for LogisticRegression with LBFGS

61a3d4e

github-actions bot added module:linear_model module:utils labels Nov 4, 2025

ogrisel reviewed Nov 4, 2025

View reviewed changes

sklearn/linear_model/_linear_loss.py Outdated Show resolved Hide resolved

OmarManzoor added 2 commits November 4, 2025 18:15

Add tests for predict methods

7462b07

Some updates

5a57a58

ogrisel reviewed Nov 4, 2025

View reviewed changes

sklearn/linear_model/tests/test_logistic.py Outdated Show resolved Hide resolved

ogrisel reviewed Nov 4, 2025

View reviewed changes

sklearn/linear_model/_logistic.py Outdated Show resolved Hide resolved

sklearn/linear_model/_logistic.py Outdated Show resolved Hide resolved

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved

lucyleeow added the Array API label Nov 5, 2025

OmarManzoor added 4 commits November 5, 2025 12:29

Implement some PR suggestions by Olivier

1479823

Add changelog

04509e6

Return X.dtype simply for numpy

9f3790f

Handle predict method

133206b

ogrisel reviewed Nov 5, 2025

View reviewed changes

sklearn/linear_model/_logistic.py Outdated Show resolved Hide resolved

ogrisel reviewed Nov 5, 2025

View reviewed changes

sklearn/linear_model/_base.py Outdated Show resolved Hide resolved

OmarManzoor added 2 commits November 5, 2025 15:25

Further updates

182ffe5

Fix doc

2ecd357

OmarManzoor added the CUDA CI label Nov 5, 2025

github-actions bot removed the CUDA CI label Nov 5, 2025

ogrisel reviewed Nov 5, 2025

View reviewed changes

sklearn/linear_model/tests/test_logistic.py Outdated Show resolved Hide resolved

ogrisel reviewed Nov 5, 2025

View reviewed changes

sklearn/linear_model/tests/test_logistic.py Outdated Show resolved Hide resolved

OmarManzoor and others added 3 commits November 5, 2025 18:49

Add support for class_weight and update the tests

2eb12fd

Fix _bincount minlength

dd292dd

Stricter convergence tol in test_logistic_regression_array_api_compli…

1bd1bc4

…ance

Fix half binomial loss to add the additional conditions

f316caf

OmarManzoor added 3 commits November 13, 2025 12:25

Merge branch 'main' into array-api-logistic

052941c

Remove the scaling

236219d

Test code minor refactor

10a8077

ogrisel reviewed Nov 13, 2025

View reviewed changes

sklearn/linear_model/tests/test_logistic.py Outdated Show resolved Hide resolved

Keep the scaling

d2c0804

ogrisel reviewed Nov 14, 2025

View reviewed changes

sklearn/_loss/loss.py Outdated Show resolved Hide resolved

sklearn/linear_model/tests/test_logistic.py Outdated Show resolved Hide resolved

sklearn/_loss/tests/test_loss.py Outdated Show resolved Hide resolved

sklearn/_loss/tests/test_loss.py Outdated Show resolved Hide resolved

ogrisel reviewed Nov 14, 2025

View reviewed changes

sklearn/_loss/loss.py Outdated Show resolved Hide resolved

OmarManzoor added 2 commits November 17, 2025 11:37

Merge branch 'main' into array-api-logistic

042fdf9

Updates based on suggestions on PR

277dce8

OmarManzoor added the CUDA CI label Nov 17, 2025

github-actions bot removed the CUDA CI label Nov 17, 2025

ogrisel reviewed Nov 17, 2025

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

ogrisel reviewed Nov 17, 2025

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

ogrisel reviewed Nov 17, 2025

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

OmarManzoor added 3 commits November 17, 2025 15:39

Attempt to fix test

a1aec1c

Merge branch 'main' into array-api-logistic

814255a

Fix test

775542e

OmarManzoor added the CUDA CI label Nov 17, 2025

github-actions bot removed the CUDA CI label Nov 17, 2025

Merge branch 'main' into array-api-logistic

df00875

Uh oh!

FEA Add array API support for LogisticRegression with LBFGS #32644

Are you sure you want to change the base?

FEA Add array API support for LogisticRegression with LBFGS #32644

Conversation

OmarManzoor commented Nov 4, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

OmarManzoor commented Nov 4, 2025

Some Benchmarks

Uh oh!

ogrisel commented Nov 4, 2025

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OmarManzoor commented Nov 5, 2025

Cuda py-spy flamegraph

Uh oh!

OmarManzoor commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Numpy py-spy flamegraph

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 12, 2025

Uh oh!

ogrisel commented Nov 12, 2025

Uh oh!

OmarManzoor commented Nov 13, 2025

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

betatim commented Nov 14, 2025

Uh oh!

OmarManzoor commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 17, 2025

Uh oh!

Uh oh!

github-actions bot commented Nov 4, 2025 •

edited

Loading

OmarManzoor commented Nov 5, 2025 •

edited

Loading

ogrisel commented Nov 12, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading

OmarManzoor commented Nov 14, 2025 •

edited

Loading

OmarManzoor commented Nov 17, 2025 •

edited

Loading