Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@khmyznikov
Copy link

Fixes #32491, a segmentation fault occurring in sklearn/tree/_utils on Windows ARM64 during decision tree training with the MAE criterion.

Root Cause
The crash was caused by an out-of-bounds memory access in precompute_absolute_errors.

The WeightedFenwickTree.search method is designed to return an index used to retrieve values from sorted_y. However, under certain conditions (likely due to floating-point precision issues or edge cases where target_weight >= total_weight), search could return an index equal to self.size (i.e., n).

Since sorted_y is a 0-indexed array of size n, accessing sorted_y[n] resulted in an invalid read, triggering the crash.

The Fix
The fix involves clamping the indices returned by WeightedFenwickTree.search to ensure they never exceed self.size - 1. This change was applied in three specific locations within _utils.pyx to cover all exit paths of the function:

  • Standard Search Path: When no exact match for the weight is found, current_idx is now clamped before returning.
  • Exact Match Path (Upper Bound): When an exact match is found, current_idx is clamped.
  • Exact Match Path (Lower Bound): The prev_idx (returned via pointer) is also clamped, as it is used to calculate the median in the exact match scenario.

This ensures that the returned rank is always a valid index for
the sorted_y array, preventing the segmentation fault.

Exception analysys via WinDbg:
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


KEY_VALUES_STRING: 1

    Key  : AV.Fault
    Value: Read

    Key  : Analysis.CPU.mSec
    Value: 796

    Key  : Analysis.Elapsed.mSec
    Value: 1786

    Key  : Analysis.IO.Other.Mb
    Value: 0

    Key  : Analysis.IO.Read.Mb
    Value: 0

    Key  : Analysis.IO.Write.Mb
    Value: 0

    Key  : Analysis.Init.CPU.mSec
    Value: 3531

    Key  : Analysis.Init.Elapsed.mSec
    Value: 43456

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 129

    Key  : Failure.Bucket
    Value: INVALID_POINTER_READ_c0000005__utils.cp314-win_arm64.pyd!Unknown

    Key  : Failure.Hash
    Value: {94d1b53b-d1cb-71c1-a5eb-cc5cc4c74f4f}

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 1632856

    Key  : Timeline.Process.Start.DeltaSec
    Value: 16

    Key  : WER.OS.Branch
    Value: ge_release

    Key  : WER.OS.Version
    Value: 10.0.26100.1

    Key  : WER.Process.Version
    Value: 3.14.150.1013


FILE_IN_CAB:  crash_41.dmp

NTGLOBALFLAG:  70

APPLICATION_VERIFIER_FLAGS:  0

CONTEXT:  (.ecxr)
 x0=000001a33a491a40   x1=ffffffffffffffff   x2=0000000000000000   x3=0000000000000001
 x4=0000000000000000   x5=0000000000000001   x6=0000000000000000   x7=0000000040004000
 x8=000001a331a90000   x9=fffffffffffffff0  x10=000001a331a9c000  x11=00007ffd0db800c8
x12=000001a33b3b3040  x13=000001a33940e9e0  x14=0000000000000001  x15=0000000000000000
x16=0000af0515adfa72  x17=000001a33940e9c8  x18=0000000000000000  x19=000001a33a91dad0
x20=0000005d12cdb0d0  x21=0000000000000006  x22=0000000000000000  x23=00000000000003de
x24=00000000000003dd  x25=000001a33a857980  x26=000001a33b1d9c30  x27=0000000000000000
x28=0000000000000001   fp=0000005d12cdb140   lr=00007ffd0719b9d8   sp=0000005d12cda8e0
 pc=00007ffd0719b6c8  psr=60001040 -ZC- EL0
_utils_cp314_win_arm64!PyInit__utils+0x70b0:
00007ffd`0719b6c8 fc696900 ldr         d0,[x8,x9]
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ffd0719b6c8 (_utils_cp314_win_arm64!PyInit__utils+0x00000000000070b0)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 000001a331a8fff0
Attempt to read from address 000001a331a8fff0

PROCESS_NAME:  python.exe

READ_ADDRESS:  000001a331a8fff0 

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR:  c0000005

EXCEPTION_PARAMETER1:  0000000000000000

EXCEPTION_PARAMETER2:  000001a331a8fff0

STACK_TEXT:  
0000005d`12cda8e0 00007ffd`0719b9d8     : 000001a3`3940e380 00007ffd`1307ff74 3fe08072`24a5a584 3fd91a0f`f25f82e0 : _utils_cp314_win_arm64!PyInit__utils+0x70b0
0000005d`12cda8e0 00007ffd`1307ff74     : 000001a3`3940e380 00007ffd`1307ff74 3fe08072`24a5a584 3fd91a0f`f25f82e0 : _utils_cp314_win_arm64!PyInit__utils+0x73c0
0000005d`12cda900 00007ffd`0db4cda4     : 0000005d`12cdab50 00007ffd`0db4cda4 00000000`00000000 00000000`00000000 : _criterion_cp314_win_arm64!PyInit__criterion+0xb1e4
0000005d`12cda920 00007ffd`0db4e1f0     : 3ff00000`00000000 0000005d`12cdab50 00000000`00000001 00000000`00000000 : _tree_cp314_win_arm64+0x1cda4
0000005d`12cda9f0 00007ffd`0db368ec     : 00000000`00000003 0000005d`12cdb0d0 0000005d`12cdab50 00000000`00000000 : _tree_cp314_win_arm64+0x1e1f0
0000005d`12cdb1b0 00007ffd`0db42820     : 00000000`00000020 000001a3`26aa7c00 00000000`00000000 00000009`00000021 : _tree_cp314_win_arm64+0x68ec
0000005d`12cdb790 00007ffc`cd88e370     : 00007ffc`cd88e370 000001a3`3a2ed620 0000005d`12cdbf00 00007ffc`cd8916ac : _tree_cp314_win_arm64+0x12820
0000005d`12cdb7a0 00007ffc`cd8916ac     : 0000005d`12cdbf00 00007ffc`cd8916ac 000001a3`3a8b0e51 000001a3`25dd1b30 : python314!PyObject_Vectorcall+0xc0
0000005d`12cdb810 00007ffc`cd88cfd8     : 000001a3`25dd19e0 00007ffc`cdf28d10 00007ffc`00000001 00000000`00000000 : python314!PyEval_EvalFrameDefault+0xee4
0000005d`12cdbf70 00007ffc`cd8f23a8     : 000001a3`27b1f100 000001a3`3a8cf250 00007ffc`cd88ce18 000001a3`3a80f290 : python314!PyEvalFramePushAndInit+0x318
0000005d`12cdc020 00007ffc`cd8f3fc0     : 0000005d`12cdc0a0 00007ffc`cd8f3fc0 000001a3`3a2aabc0 000001a3`00000001 : python314!PyObject_VectorcallDict+0xe8
0000005d`12cdc070 00007ffc`cd8f3f14     : 000001a3`27b1f100 00007ffc`cd9ad178 000001a3`3a6c05c0 000001a3`277a69f0 : python314!PyWrapper_New+0x4a8
0000005d`12cdc0f0 00007ffc`cd88e8a0     : 0000005d`12cdc140 00007ffc`cd88e8a0 000001a3`274c8040 00000000`00000002 : python314!PyWrapper_New+0x3fc
0000005d`12cdc140 00007ffc`cd896d24     : 0000005d`12cdc8a0 00007ffc`cd896d24 000001a3`268e1c10 000001a3`25dd1558 : python314!PyObject_Vectorcall+0x5f0
0000005d`12cdc1b0 00007ffc`cd88cfd8     : 000001a3`25dd13d0 00007ffc`cdf28d10 00007ffc`00000002 00000000`00000000 : python314!PyEval_EvalFrameDefault+0x655c
0000005d`12cdc910 00007ffc`cd8f23a8     : 000001a3`27b1f420 000001a3`3a8cf250 00007ffc`cd88ce18 000001a3`3a80f7f0 : python314!PyEvalFramePushAndInit+0x318
0000005d`12cdc9c0 00007ffc`cd8f3fc0     : 0000005d`12cdca40 00007ffc`cd8f3fc0 000001a3`38717e80 000001a3`2a6286c0 : python314!PyObject_VectorcallDict+0xe8
0000005d`12cdca10 00007ffc`cd8f3f14     : 000001a3`27b1f420 00007ffc`cd9ad178 0000005d`12cdca70 00007ffc`cd85cc18 : python314!PyWrapper_New+0x4a8
0000005d`12cdca90 00007ffc`cd8559a8     : 0000005d`12cdcae0 00007ffc`cd8559a8 000001a3`274c8040 00000000`00000002 : python314!PyWrapper_New+0x3fc
0000005d`12cdcae0 00007ffc`cd895fd8     : 0000005d`12cdd220 00007ffc`cd895fd8 000001a3`25dd10e8 000001a3`278873b4 : python314!PyObject_Call+0xb0
0000005d`12cdcb30 00007ffc`cd88cfd8     : 000001a3`25dd1058 00007ffc`cdf28d10 0000005d`00000002 000001a3`25d08e00 : python314!PyEval_EvalFrameDefault+0x5810
0000005d`12cdd290 00007ffc`cd8f23a8     : 000001a3`27b1f5b0 000001a3`3a8cf250 000001a3`3a8cf350 000001a3`3a2f6710 : python314!PyEvalFramePushAndInit+0x318
0000005d`12cdd340 00007ffc`cd8f3fc0     : 0000005d`12cdd3c0 00007ffc`cd8f3fc0 000001a3`3a511980 00007ffc`00000001 : python314!PyObject_VectorcallDict+0xe8
0000005d`12cdd390 00007ffc`cd8f3f14     : 000001a3`27b1f5b0 00007ffc`cd9ad178 000001a3`3a2e0e40 000001a3`277a6eb0 : python314!PyWrapper_New+0x4a8
0000005d`12cdd410 00007ffc`cd88e8a0     : 0000005d`12cdd460 00007ffc`cd88e8a0 000001a3`274c8040 00000000`00000002 : python314!PyWrapper_New+0x3fc
0000005d`12cdd460 00007ffc`cd896d24     : 0000005d`12cddbc0 00007ffc`cd896d24 000001a3`268e1c10 000001a3`25dd0ac8 : python314!PyObject_Vectorcall+0x5f0
0000005d`12cdd4d0 00007ffc`cd88cfd8     : 000001a3`25dd0a50 00007ffc`cdf28d10 0000005d`00000002 00000000`00000000 : python314!PyEval_EvalFrameDefault+0x655c
0000005d`12cddc30 00007ffc`cd8f23a8     : 000001a3`27b1f6a0 000001a3`27b8d9d0 00007ffc`cd88ce18 000001a3`3a2f6a70 : python314!PyEvalFramePushAndInit+0x318
0000005d`12cddce0 00007ffc`cd8f3fc0     : 0000005d`12cddd60 00007ffc`cd8f3fc0 000001a3`3a267c00 00007ffc`cd9b1598 : python314!PyObject_VectorcallDict+0xe8
0000005d`12cddd30 00007ffc`cd8f3f14     : 000001a3`27b1f6a0 00007ffc`cd9ad178 000001a3`386f8300 00007ffc`cdeed2f0 : python314!PyWrapper_New+0x4a8
0000005d`12cdddb0 00007ffc`cd88e8a0     : 0000005d`12cdde00 00007ffc`cd88e8a0 000001a3`274c8040 00000000`00000002 : python314!PyWrapper_New+0x3fc
0000005d`12cdde00 00007ffc`cd898d84     : 0000005d`12cde560 00007ffc`cd898d84 000001a3`268e1c10 000001a3`25dd07d0 : python314!PyObject_Vectorcall+0x5f0
0000005d`12cdde70 00007ffc`cd88cfd8     : 000001a3`25dd0768 00007ffc`cdf28d10 0000005d`00000001 0000005d`12cde2a0 : python314!PyEval_EvalFrameDefault+0x85bc
0000005d`12cde5d0 00007ffc`cd8f23a8     : 000001a3`27b1de90 000001a3`27a5f620 00007ffc`cd88ce18 000001a3`27b40410 : python314!PyEvalFramePushAndInit+0x318
0000005d`12cde680 00007ffc`cd8f3fc0     : 0000005d`12cde700 00007ffc`cd8f3fc0 000001a3`27c4e6c0 0000005d`12cde760 : python314!PyObject_VectorcallDict+0xe8
0000005d`12cde6d0 00007ffc`cd8f3f14     : 000001a3`27b1de90 00007ffc`cd9ad178 000001a3`00000000 000001a3`25d7be40 : python314!PyWrapper_New+0x4a8
0000005d`12cde750 00007ffc`cd88e8a0     : 0000005d`12cde7a0 00007ffc`cd88e8a0 000001a3`274c8040 00000000`00000002 : python314!PyWrapper_New+0x3fc
0000005d`12cde7a0 00007ffc`cd898d84     : 0000005d`12cdef00 00007ffc`cd898d84 000001a3`268e1c10 000001a3`25dd0388 : python314!PyObject_Vectorcall+0x5f0
0000005d`12cde810 00007ffc`cd8f2654     : 000001a3`25dd02d0 00007ffc`cdf28d10 0000005d`00000001 0000005d`12cdec40 : python314!PyEval_EvalFrameDefault+0x85bc
0000005d`12cdef70 00007ffc`cd920d60     : 0000005d`12cdef90 0000005d`12cdeff0 0000005d`12cdefa0 00007ffc`cd86ff7c : python314!PyTuple_FromArray+0x14c
0000005d`12cdf010 00007ffc`cd91fb00     : 0000005d`12cdf0b0 00007ffc`cd91fb00 000001a3`27036400 000001a3`25d773c0 : python314!PyEval_EvalCode+0xa8
0000005d`12cdf090 00007ffc`cd91f970     : 00000000`00000001 00000000`00000001 00007ffc`cdf28d10 00000000`00000000 : python314!PyDict_GetItemStringRef+0x688
0000005d`12cdf110 00007ffc`cd888010     : 00007ffc`cdf28d10 00007ffc`cd91d680 0000005d`12cdf170 00007ffc`cd8ff760 : python314!PyDict_GetItemStringRef+0x4f8
0000005d`12cdf190 00007ffc`cd88e370     : 0000005d`12cdf1c0 00007ffc`cd88e370 0000005d`12cdf7c0 000001a3`272074ae : python314!PyObject_RichCompare+0x1720
0000005d`12cdf1c0 00007ffc`cd8922d8     : 0000005d`12cdf920 00007ffc`cd8922d8 000001a3`25d773c0 000001a3`26153cf9 : python314!PyObject_Vectorcall+0xc0
0000005d`12cdf230 00007ffc`cd88cfd8     : 000001a3`25dd00e0 00007ffc`cdf28d10 00000000`00000002 0000005d`12cdf660 : python314!PyEval_EvalFrameDefault+0x1b10
0000005d`12cdf990 00007ffc`cd85596c     : 000001a3`270a4b10 00007ffc`cdde4ad1 0000005d`12cdfa40 00007ffc`00000000 : python314!PyEvalFramePushAndInit+0x318
0000005d`12cdfa40 00007ffc`cd9fb93c     : 0000005d`12cdfa90 00007ffc`cd9fb93c 00000000`00000001 000001a3`270a4b10 : python314!PyObject_Call+0x74
0000005d`12cdfa90 00007ffc`cd94a08c     : 0000005d`12cdfad0 00007ffc`cd94a08c 000001a3`27036470 00007ffc`cdef3de8 : python314!PyByteArray_Concat+0x5e4
0000005d`12cdfad0 00007ffc`cd949ec4     : 0000005d`12cdfb40 00007ffc`cd949ec4 000001a3`27036470 00000000`00000000 : python314!Py_RunMain+0x1ec
0000005d`12cdfb40 00007ffc`cd919884     : 0000005d`12cdfb70 00007ffc`cd919884 00000001`00000000 00000000`00000000 : python314!Py_RunMain+0x24
0000005d`12cdfb70 00007ffc`cd918fc0     : 0000005d`12cdfba0 00007ffc`cd918fc0 00000000`00000000 00000000`00000000 : python314!PyConfig_InitCompatConfig+0x2cc
0000005d`12cdfba0 00007ff6`dcf013a4     : 0000005d`12cdfbd0 00007ff6`dcf013a4 00000000`00000004 3a047ff6`00000000 : python314!Py_Main+0x20
0000005d`12cdfbd0 00007ff6`dcf0143c     : 0000005d`12cdfc10 37537ff6`dcf0143c 00000000`00000000 00000000`00000000 : python+0x13a4
0000005d`12cdfc10 00007ffd`97738740     : 0000005d`12cdfc20 0d427ffd`97738740 0000005d`12cdfc30 3301fffd`9b4843b4 : python+0x143c
0000005d`12cdfc20 00007ffd`9b4843b4     : 0000005d`12cdfc30 3301fffd`9b4843b4 00000000`00000000 5e418000`00000000 : kernel32!BaseThreadInitThunk+0x40
0000005d`12cdfc30 00000000`00000000     : 00000000`00000000 5e418000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x44


STACK_COMMAND:  ~0s; .ecxr ; kb

SYMBOL_NAME:  _utils_cp314_win_arm64+70b0

MODULE_NAME: _utils_cp314_win_arm64

IMAGE_NAME:  _utils.cp314-win_arm64.pyd

FAILURE_BUCKET_ID:  INVALID_POINTER_READ_c0000005__utils.cp314-win_arm64.pyd!Unknown

OS_VERSION:  10.0.26100.1

BUILDLAB_STR:  ge_release

OSPLATFORM_TYPE:  arm64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {94d1b53b-d1cb-71c1-a5eb-cc5cc4c74f4f}

Followup:     MachineOwner
---------

0:000> lmvm _utils_cp314_win_arm64
Browse full module list
start             end                 module name
00007ffd`07190000 00007ffd`071ac000   _utils_cp314_win_arm64   (export symbols)       _utils.cp314-win_arm64.pyd
    Loaded symbol image file: _utils.cp314-win_arm64.pyd
    Image path: X:\GitHub\scikit-learn\build\cp314\sklearn\tree\_utils.cp314-win_arm64.pyd
    Image name: _utils.cp314-win_arm64.pyd
    Browse all global symbols  functions  data
    Timestamp:        Tue Oct 28 19:44:20 2025 (69017F84)
    CheckSum:         0001B704
    ImageSize:        0001C000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

@github-actions
Copy link

github-actions bot commented Nov 21, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 56b33da. Link to the linter CI: here

@lesteve
Copy link
Member

lesteve commented Nov 24, 2025

@khmyznikov would you be kind enough to explain how you managed to reproduce. I guess this is something like #32491 (comment), i.e. running the scikit-learn tests in many parallel processes?

I am going to post your original script as a snippet below for further reference (easier to access than a .zip), please post an updated snippet in case you have tweaked your approach 🙏.

Details
$workingDir = Get-Location
$logDir = Join-Path $workingDir "test_logs"

# Create log directory if it doesn't exist
if (-not (Test-Path $logDir)) {
    New-Item -ItemType Directory -Path $logDir | Out-Null
}

$allResults = @()
$totalRuns = 0

for ($batch = 1; $batch -le 5; $batch++) {
    Write-Host "`n=== Starting Batch $batch of 5 ===" -ForegroundColor Magenta
    $jobs = @()
    
    for ($i = 1; $i -le 10; $i++) {
        $runNumber = ($batch - 1) * 10 + $i
        $totalRuns++
        Write-Host "Starting run $runNumber" -ForegroundColor Cyan
        $job = Start-Job -ScriptBlock {
            param($runNumber, $dir, $logPath)
            Set-Location $dir
            & "C:\Program Files (x86)\Windows Kits\10\Debuggers\arm64\cdb.exe" -g -G -o -v python -m pytest sklearn\tree\tests\test_monotonic_tree.py *>&1 | Tee-Object -FilePath $logPath
            return @{
                Run = $runNumber
                ExitCode = $LASTEXITCODE
                LogFile = $logPath
            }
        } -ArgumentList $runNumber, $workingDir, (Join-Path $logDir "run_$runNumber.log")
        $jobs += $job
    }

    Write-Host "Waiting for batch $batch to complete..." -ForegroundColor Yellow
    $jobs | Wait-Job | Out-Null

    foreach ($job in $jobs) {
        $jobResult = Receive-Job -Job $job
        $status = if ($jobResult.ExitCode -eq 0) { "PASSED" } else { "FAILED" }
        $allResults += [PSCustomObject]@{
            Run = $jobResult.Run
            Status = $status
            ExitCode = $jobResult.ExitCode
            LogFile = $jobResult.LogFile
        }
        Remove-Job -Job $job
    }
}

$allResults = $allResults | Sort-Object Run

Write-Host "`n=== Test Results Summary ===" -ForegroundColor Yellow
$allResults | Format-Table -AutoSize
$passed = ($allResults | Where-Object { $_.Status -eq "PASSED" }).Count
$failed = ($allResults | Where-Object { $_.Status -eq "FAILED" }).Count
Write-Host "Total: $totalRuns | Passed: $passed | Failed: $failed" -ForegroundColor Cyan
Write-Host "Logs saved to: $logDir" -ForegroundColor Cyan

@lesteve
Copy link
Member

lesteve commented Nov 24, 2025

@cakedev0 would you mind having a look at this PR 🙏?

Ideally it would be nice to have a non-regression test, but given that the nature of the bug was only noticed on Windows arm and was non-deterministic, I don't know how easy this is ...

@cakedev0
Copy link
Contributor

cakedev0 commented Nov 24, 2025

The description of the bug makes a lot of sense, I'll look into it today!

@cakedev0
Copy link
Contributor

cakedev0 commented Nov 24, 2025

It cant really be fixing #32491 : precompute_absolute_errors was merged in main ~1 month later than the issue was opened (in PR #32100).

Also, while I understand the bug, I don't see how it could happen in the current code. I can see it happening once we've extended this to the quantile loss, with quantile=1, so still a valid thing we want to fix I'd say, or enforce quantile < 1 when extending to the quantile loss.

@khmyznikov Can you share the stacktrace and/or a reproducer as well as your sklearn.show_versions() please?

@khmyznikov
Copy link
Author

@cakedev0 @lesteve

System:
    python: 3.14.0 (tags/v3.14.0:ebf955d, Oct  7 2025, 10:57:41) [MSC v.1944 64 bit (ARM64)]
executable: X:\GitHub\scikit-learn\.env\tests\Scripts\python.exe
   machine: Windows-11-10.0.26200-SP0

Python dependencies:
      sklearn: 1.8.dev0
          pip: 25.2
   setuptools: None
        numpy: 2.3.4
        scipy: 1.16.3
       Cython: 3.1.6
       pandas: None
   matplotlib: None
       joblib: 1.5.2
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: vcomp
       filepath: C:\Windows\System32\vcomp140.dll
        version: None

The script may be a bit buggy while showing the results in console but the actual crash logs are right.

$workingDir = Get-Location
$logDir = Join-Path $workingDir "test_logs"

# Create log directory if it doesn't exist
if (-not (Test-Path $logDir)) {
    New-Item -ItemType Directory -Path $logDir | Out-Null
}

$allResults = @()
$totalRuns = 0

for ($batch = 1; $batch -le 5; $batch++) {
    Write-Host "`n=== Starting Batch $batch of 5 ===" -ForegroundColor Magenta
    $jobs = @()
    
    for ($i = 1; $i -le 12; $i++) {
        $runNumber = ($batch - 1) * 12 + $i
        $totalRuns++
        Write-Host "Starting run $runNumber" -ForegroundColor Cyan
        $job = Start-Job -ScriptBlock {
            param($runNumber, $dir, $logPath)
            Set-Location $dir
            $dumpPath = Join-Path $dir "crash_$runNumber.dmp"
            $cdbOutput = & "C:\Program Files (x86)\Windows Kits\10\Debuggers\arm64\cdb.exe" -g -G -o -v -c ".dump /ma $dumpPath;g" python -m pytest sklearn\tree\tests\test_monotonic_tree.py *>&1
            $cdbOutput | Tee-Object -FilePath $logPath
            $exitCode = $LASTEXITCODE
            
            if ($exitCode -ne 0) {
                Write-Host "Test failed, retrieving call stack..." -ForegroundColor Red
                $crashAnalysis = & "C:\Program Files (x86)\Windows Kits\10\Debuggers\arm64\cdb.exe" -z $dumpPath -c "kb;q" *>&1
                $crashAnalysis | Out-File -Append -FilePath $logPath
            }
            
            return @{
                Run = $runNumber
                ExitCode = $exitCode
                LogFile = $logPath
            }
        } -ArgumentList $runNumber, $workingDir, (Join-Path $logDir "run_$runNumber.log")
        $jobs += $job
    }

    Write-Host "Waiting for batch $batch to complete..." -ForegroundColor Yellow
    $jobs | Wait-Job | Out-Null

    foreach ($job in $jobs) {
        $jobResult = Receive-Job -Job $job
        $status = if ($jobResult.ExitCode -eq 0) { "PASSED" } else { "FAILED" }
        $allResults += [PSCustomObject]@{
            Run = $jobResult.Run
            Status = $status
            ExitCode = $jobResult.ExitCode
            LogFile = $jobResult.LogFile
        }
        Remove-Job -Job $job
    }
}

$allResults = $allResults | Sort-Object Run

Write-Host "`n=== Test Results Summary ===" -ForegroundColor Yellow
$allResults | Format-Table -AutoSize
$passed = ($allResults | Where-Object { $_.Status -eq "PASSED" }).Count
$failed = ($allResults | Where-Object { $_.Status -eq "FAILED" }).Count
Write-Host "Total: $totalRuns | Passed: $passed | Failed: $failed" -ForegroundColor Cyan
Write-Host "Logs saved to: $logDir" -ForegroundColor Cyan

The way I do it is simple. Script launches the test in 12 parallel processes (core counts I have) and make it in 5 batches (next batch launches after prev finishes). Usually I have the crash each time I launch the script, but from 60 it's only 1-2 tests results as crash. After I've applied the fix and rebuild the scikit-learn, I can't reproduce this crash anymore. I've tried about 10 times.

@khmyznikov
Copy link
Author

@lesteve @cakedev0 Ok I've just reverted my changes locally and pulled all new changes from main, I don't see any crashes anymore. Maybe something was done last week, but looks like this fix no longer needed. Feel free to close it if you feel the same.

@lesteve
Copy link
Member

lesteve commented Nov 25, 2025

@lesteve @cakedev0 Ok I've just reverted my changes locally and pulled all new changes from main, I don't see any crashes anymore

OK this is weird, for further reference, do you happen to be able to find which commit you were on before (when the bug was reproducible) and which commit you are on now (when the bug doesn't happen any more)?

Feel free to close it if you feel the same.

Yep I feel the same another unsolved mystery, which I guess is a shame, but I can leave with it.

Closing this one, happy to reopen in case the segmentation fault re-appears. Thanks a lot @khmyznikov for your help on this weird issue 🙏!

@lesteve lesteve closed this Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI Intermittent segmentation fault in Windows arm64 wheels test (vanilla CPython)

3 participants