Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@OmarManzoor
Copy link
Contributor

Reference Issues/PRs

Towards: #32611

What does this implement/fix? Explain your changes.

  • Adds array API support for LogisticRegression with the LBFGS method

Any other comments?

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: df00875. Link to the linter CI: here

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please profile a run using mps or cuda using py-spy?

@OmarManzoor
Copy link
Contributor Author

Some Benchmarks

from time import time

import numpy as np
import torch as xp
from tqdm import tqdm

from sklearn import config_context
from sklearn.linear_model import LogisticRegression

n_samples, n_features, n_classes = 1000000, 300, 20
device = "cuda"
n_iter = 10

X_np = np.random.rand(n_samples, n_features)
y_np = np.random.randint(0, 10, n_samples)
numpy_fit_times = []
numpy_predict_times = []
for _ in tqdm(range(n_iter), desc="Numpy"):
    lr = LogisticRegression(C=0.8, solver="lbfgs", max_iter=200)
    start = time()
    lr.fit(X_np, y_np)
    numpy_fit_times.append(round(time() - start, 3))
    start = time()
    pred = lr.predict_proba(X_np)
    numpy_predict_times.append(round(time() - start, 3))

avg_numpy_fit = round(sum(numpy_fit_times) / n_iter, 3)
avg_numpy_predict = round(sum(numpy_predict_times) / n_iter, 3)

torch_fit_times = []
torch_predict_times = []
X_xp = xp.asarray(X_np, device=device)
y_xp = xp.asarray(y_np, device=device)
for _ in tqdm(range(n_iter), desc=f"Torch {device}"):
    with config_context(array_api_dispatch=True):
        lr = LogisticRegression(C=0.8, solver="lbfgs", max_iter=200)
        start = time()
        lr.fit(X_xp, y_xp)
        torch_fit_times.append(round(time() - start, 3))
        start = time()
        pred = lr.predict_proba(X_xp)
        first = float(pred[0, 0])
        torch_predict_times.append(round(time() - start, 3))

avg_torch_fit = round(sum(torch_fit_times) / n_iter, 3)
avg_torch_predict = round(sum(torch_predict_times) / n_iter, 3)

print(f"Average fit time numpy: {avg_numpy_fit}")
print(f"Average fit time torch {device}: {avg_torch_fit}")
print(f"Torch {device} fit speedup: {round(avg_numpy_fit / avg_torch_fit, 2)}X")


print(f"Average predict time numpy: {avg_numpy_predict}")
print(f"Average predict time torch {device}: {avg_torch_predict}")
print(
    f"Torch {device} predict speedup: {round(avg_numpy_predict / avg_torch_predict, 2)}"
    "X"
)

n_samples, n_features, n_classes = 1000000, 300, 20

Average fit time numpy: 23.526
Average fit time torch cuda: 8.104
Torch cuda fit speedup: 2.9X

Average predict time numpy: 1.133
Average predict time torch cuda: 0.17
Torch cuda predict speedup: 6.66X

@ogrisel
Copy link
Member

ogrisel commented Nov 4, 2025

It's nice to get a speed-up with CUDA besides the conversion of the raw predictions and pointwise gradient values of the loss at each iteration.

Can you post the SVG of the py-spy profiling results both for the pytorch/CUDA and the numpy/CPU runs?

If the conversion of the raw predictions / pointwise gradients are significant, I think we should try to implement an alternative to the Cython gradient function using the array API to skip those conversions directly as part of this PR.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more feedback.

@OmarManzoor
Copy link
Contributor Author

Cuda py-spy flamegraph

profile

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Nov 5, 2025

Numpy py-spy flamegraph

profile (1)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass of feedback:

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2025

For the record, I observe a speed-up of 2x when using PyTorch/MPS vs NumPy/CPU (with OpenBLAS) on an Apple M4 laptop using a 50-class classification problem. I see no significant speed-up for binary classification.

EDIT: I increased the dataset size and I now also observe a factor of 2x speed up when using PyTorch/MPS vs NumPy/CPU (with OpenBLAS) for binary classification.

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2025

While evaluating locally, I found a bug for a specific dataset size:

import os

os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["SCIPY_ARRAY_API"] = "1"


import torch
import numpy as np
from sklearn import set_config
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import minmax_scale

# Uncomment when using newaccelerate BLAS on macOS.
# import warnings
# warnings.filterwarnings("ignore", category=RuntimeWarning)


set_config(array_api_dispatch=True)

X, y = make_classification(
    n_samples=int(1e6), n_classes=2, n_features=100, n_informative=90, random_state=0
)
X = X.astype("float32")
# X = minmax_scale(X)

X_mps = torch.from_numpy(X).to("mps")
y_mps = torch.from_numpy(y).to("mps")

clf = LogisticRegression(max_iter=1000)

clf.fit(X, y).n_iter_, clf.score(X, y)

outputs:

(array([37], dtype=int32), 0.851278)

while:

clf.fit(X_mps, y_mps).n_iter_, clf.score(X_mps, y_mps)

outputs a bad model without any error message nor ConvergenceWarning:

(tensor([1], device='mps:0', dtype=torch.int32), 0.4999749958515167)

@OmarManzoor can you reproduce?

The fact that the model converges to a chance level prediction function after only 1 iteration without any error message sounds like a bug to me.

Note that the problem goes away when:

  • reducing the number of data points or the number of informative features in the training set;
  • scaling the features.

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2025

Since the problem happens at the first iteration, we could debug by printing (parts of) the sample-wise and parameter-wise gradient vectors for each run with max_iter=1 along with the values for the convergence criterion.

@OmarManzoor
Copy link
Contributor Author

I updated the loss_array_api for the half binomial case to take into account some important additional conditions. I think using simple xp.log1p is not accurate enough.

This is the output of the above script which was causing issues:
tensor([38], device='mps:0', dtype=torch.int32) 0.8512759804725647

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to take a deeper look at the actual changes, and here is a first pass of feedback.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more feedback:

@betatim
Copy link
Member

betatim commented Nov 14, 2025

I've not yet looked at the diff but have run the benchmark script from #32644 (comment) with the following setting n_samples, n_features, n_classes = 1_000_000, 1_000, 2

The results are:

Average fit time numpy: 6.502
Average fit time torch cuda: 3.51
Torch cuda fit speedup: 1.85X
Average predict time numpy: 0.38
Average predict time torch cuda: 0.133
Torch cuda predict speedup: 2.86X

I've not systematically tried different shapes/sizes, but from trying smaller n_features it seems like the speed ups are roughly the same.

The GPU is a A6000, CPU details in the fold out.

Details

processor	: 63
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 24
model name	: AMD Ryzen Threadripper PRO 7975WX 32-Cores
stepping	: 1
microcode	: 0xa108108
cpu MHz		: 2194.000
cache size	: 1024 KB
physical id	: 0
siblings	: 64
core id		: 31
cpu cores	: 32
apicid		: 63
initial apicid	: 63
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips	: 7988.57
TLB size	: 3584 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 52 bits physical, 57 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
System:
    python: 3.13.9 | packaged by conda-forge | (main, Oct 22 2025, 23:33:35) [GCC 14.3.0]
executable: /home/thead/miniforge3/envs/sklearn-20251114/bin/python3.13
   machine: Linux-6.14.0-27-generic-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.8.dev0
          pip: 25.3
   setuptools: 80.9.0
        numpy: 2.3.4
        scipy: 1.16.3
       Cython: 3.2.1
       pandas: 2.3.3
   matplotlib: 3.10.8
       joblib: 1.5.2
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
    num_threads: 32
         prefix: libmkl_rt
       filepath: /home/thead/miniforge3/envs/sklearn-20251114/lib/libmkl_rt.so.2
        version: 2025.3-Product
threading_layer: intel

       user_api: openmp
   internal_api: openmp
    num_threads: 64
         prefix: libomp
       filepath: /home/thead/miniforge3/envs/sklearn-20251114/lib/libomp.so
        version: None

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Nov 14, 2025

@betatim This seems to be a lot slower than what I observed on a colab T4 gpu, maybe because of the higher n_samples

@ogrisel
Copy link
Member

ogrisel commented Nov 17, 2025

@betatim This seems to be a lot slower than what I observed on a colab T4 gpu, maybe because of the higher n_samples

It's also possible that the CPUs on the T4 instance of collab are particularly slow and therefore inflate the impact of using CUDA on that machine.

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Nov 17, 2025

I checked for CPU on colab. I think the CPU might be the difference in these observed timings:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
stepping	: 3
microcode	: 0xffffffff
cpu MHz		: 2000.180
cache size	: 39424 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed bhi its
bogomips	: 4000.36
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
stepping	: 3
microcode	: 0xffffffff
cpu MHz		: 2000.180
cache size	: 39424 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed bhi its
bogomips	: 4000.36
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

With n_samples, n_features, n_classes = 1000000, 300, 2

Average fit time numpy: 22.72
Average fit time torch cuda: 4.321
Torch cuda fit speedup: 5.26X

Average predict time numpy: 1.082
Average predict time torch cuda: 0.186
Torch cuda predict speedup: 5.82X

@betatim
Copy link
Member

betatim commented Nov 17, 2025

I think seeing (quite) different speed ups depending on which combination of CPU and GPU you use is expected. Or at least not too surprising. For me the precise speed up is less important than seeing a general trend. It seems we see speed improvements across different CPU/GPU combinations and also different choices for n_samples and n_features (we don't have to cherry pick any to demonstrate a speed up).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants