-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
Fix test_monotonic_tree CI segfault on win_arm64 #32754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@khmyznikov would you be kind enough to explain how you managed to reproduce. I guess this is something like #32491 (comment), i.e. running the scikit-learn tests in many parallel processes? I am going to post your original script as a snippet below for further reference (easier to access than a .zip), please post an updated snippet in case you have tweaked your approach 🙏. Details$workingDir = Get-Location
$logDir = Join-Path $workingDir "test_logs"
# Create log directory if it doesn't exist
if (-not (Test-Path $logDir)) {
New-Item -ItemType Directory -Path $logDir | Out-Null
}
$allResults = @()
$totalRuns = 0
for ($batch = 1; $batch -le 5; $batch++) {
Write-Host "`n=== Starting Batch $batch of 5 ===" -ForegroundColor Magenta
$jobs = @()
for ($i = 1; $i -le 10; $i++) {
$runNumber = ($batch - 1) * 10 + $i
$totalRuns++
Write-Host "Starting run $runNumber" -ForegroundColor Cyan
$job = Start-Job -ScriptBlock {
param($runNumber, $dir, $logPath)
Set-Location $dir
& "C:\Program Files (x86)\Windows Kits\10\Debuggers\arm64\cdb.exe" -g -G -o -v python -m pytest sklearn\tree\tests\test_monotonic_tree.py *>&1 | Tee-Object -FilePath $logPath
return @{
Run = $runNumber
ExitCode = $LASTEXITCODE
LogFile = $logPath
}
} -ArgumentList $runNumber, $workingDir, (Join-Path $logDir "run_$runNumber.log")
$jobs += $job
}
Write-Host "Waiting for batch $batch to complete..." -ForegroundColor Yellow
$jobs | Wait-Job | Out-Null
foreach ($job in $jobs) {
$jobResult = Receive-Job -Job $job
$status = if ($jobResult.ExitCode -eq 0) { "PASSED" } else { "FAILED" }
$allResults += [PSCustomObject]@{
Run = $jobResult.Run
Status = $status
ExitCode = $jobResult.ExitCode
LogFile = $jobResult.LogFile
}
Remove-Job -Job $job
}
}
$allResults = $allResults | Sort-Object Run
Write-Host "`n=== Test Results Summary ===" -ForegroundColor Yellow
$allResults | Format-Table -AutoSize
$passed = ($allResults | Where-Object { $_.Status -eq "PASSED" }).Count
$failed = ($allResults | Where-Object { $_.Status -eq "FAILED" }).Count
Write-Host "Total: $totalRuns | Passed: $passed | Failed: $failed" -ForegroundColor Cyan
Write-Host "Logs saved to: $logDir" -ForegroundColor Cyan |
|
@cakedev0 would you mind having a look at this PR 🙏? Ideally it would be nice to have a non-regression test, but given that the nature of the bug was only noticed on Windows arm and was non-deterministic, I don't know how easy this is ... |
|
The description of the bug makes a lot of sense, I'll look into it today! |
|
It cant really be fixing #32491 : Also, while I understand the bug, I don't see how it could happen in the current code. I can see it happening once we've extended this to the quantile loss, with quantile=1, so still a valid thing we want to fix I'd say, or enforce quantile < 1 when extending to the quantile loss. @khmyznikov Can you share the stacktrace and/or a reproducer as well as your |
The script may be a bit buggy while showing the results in console but the actual crash logs are right. $workingDir = Get-Location
$logDir = Join-Path $workingDir "test_logs"
# Create log directory if it doesn't exist
if (-not (Test-Path $logDir)) {
New-Item -ItemType Directory -Path $logDir | Out-Null
}
$allResults = @()
$totalRuns = 0
for ($batch = 1; $batch -le 5; $batch++) {
Write-Host "`n=== Starting Batch $batch of 5 ===" -ForegroundColor Magenta
$jobs = @()
for ($i = 1; $i -le 12; $i++) {
$runNumber = ($batch - 1) * 12 + $i
$totalRuns++
Write-Host "Starting run $runNumber" -ForegroundColor Cyan
$job = Start-Job -ScriptBlock {
param($runNumber, $dir, $logPath)
Set-Location $dir
$dumpPath = Join-Path $dir "crash_$runNumber.dmp"
$cdbOutput = & "C:\Program Files (x86)\Windows Kits\10\Debuggers\arm64\cdb.exe" -g -G -o -v -c ".dump /ma $dumpPath;g" python -m pytest sklearn\tree\tests\test_monotonic_tree.py *>&1
$cdbOutput | Tee-Object -FilePath $logPath
$exitCode = $LASTEXITCODE
if ($exitCode -ne 0) {
Write-Host "Test failed, retrieving call stack..." -ForegroundColor Red
$crashAnalysis = & "C:\Program Files (x86)\Windows Kits\10\Debuggers\arm64\cdb.exe" -z $dumpPath -c "kb;q" *>&1
$crashAnalysis | Out-File -Append -FilePath $logPath
}
return @{
Run = $runNumber
ExitCode = $exitCode
LogFile = $logPath
}
} -ArgumentList $runNumber, $workingDir, (Join-Path $logDir "run_$runNumber.log")
$jobs += $job
}
Write-Host "Waiting for batch $batch to complete..." -ForegroundColor Yellow
$jobs | Wait-Job | Out-Null
foreach ($job in $jobs) {
$jobResult = Receive-Job -Job $job
$status = if ($jobResult.ExitCode -eq 0) { "PASSED" } else { "FAILED" }
$allResults += [PSCustomObject]@{
Run = $jobResult.Run
Status = $status
ExitCode = $jobResult.ExitCode
LogFile = $jobResult.LogFile
}
Remove-Job -Job $job
}
}
$allResults = $allResults | Sort-Object Run
Write-Host "`n=== Test Results Summary ===" -ForegroundColor Yellow
$allResults | Format-Table -AutoSize
$passed = ($allResults | Where-Object { $_.Status -eq "PASSED" }).Count
$failed = ($allResults | Where-Object { $_.Status -eq "FAILED" }).Count
Write-Host "Total: $totalRuns | Passed: $passed | Failed: $failed" -ForegroundColor Cyan
Write-Host "Logs saved to: $logDir" -ForegroundColor CyanThe way I do it is simple. Script launches the test in 12 parallel processes (core counts I have) and make it in 5 batches (next batch launches after prev finishes). Usually I have the crash each time I launch the script, but from 60 it's only 1-2 tests results as crash. After I've applied the fix and rebuild the scikit-learn, I can't reproduce this crash anymore. I've tried about 10 times. |
OK this is weird, for further reference, do you happen to be able to find which commit you were on before (when the bug was reproducible) and which commit you are on now (when the bug doesn't happen any more)?
Yep I feel the same another unsolved mystery, which I guess is a shame, but I can leave with it. Closing this one, happy to reopen in case the segmentation fault re-appears. Thanks a lot @khmyznikov for your help on this weird issue 🙏! |
Fixes #32491, a segmentation fault occurring in
sklearn/tree/_utilson Windows ARM64 during decision tree training with the MAE criterion.Root Cause
The crash was caused by an out-of-bounds memory access in
precompute_absolute_errors.The
WeightedFenwickTree.searchmethod is designed to return an index used to retrieve values fromsorted_y. However, under certain conditions (likely due to floating-point precision issues or edge cases wheretarget_weight>=total_weight), search could return an index equal toself.size(i.e.,n).Since
sorted_yis a 0-indexed array of sizen, accessingsorted_y[n]resulted in an invalid read, triggering the crash.The Fix
The fix involves clamping the indices returned by
WeightedFenwickTree.searchto ensure they never exceedself.size - 1. This change was applied in three specific locations within_utils.pyxto cover all exit paths of the function:current_idxis now clamped before returning.current_idxis clamped.prev_idx(returned via pointer) is also clamped, as it is used to calculate the median in the exact match scenario.This ensures that the returned rank is always a valid index for
the
sorted_yarray, preventing the segmentation fault.Exception analysys via WinDbg: