Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Segmentation error for torch==2.2.1 on MacOs #121101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
CloseChoice opened this issue Mar 3, 2024 · 16 comments
Open

Segmentation error for torch==2.2.1 on MacOs #121101

CloseChoice opened this issue Mar 3, 2024 · 16 comments
Labels
module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: intel Specific to x86 architecture module: macos Mac OS related issues module: openmp Related to OpenMP (omp) support in PyTorch needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@CloseChoice
Copy link

CloseChoice commented Mar 3, 2024

πŸ› Describe the bug

At shap, we have run into problems with our CI jobs on macOs, e.g. see here. I tracked this down to an issue with torch==2.2.1.

Here is code to reproduce the issue (this works on torch==2.2.0):

import time

import torch
from sklearn.datasets import fetch_california_housing


def test_something():
    X, y = fetch_california_housing(return_X_y=True)
    torch.tensor(X)
    time.sleep(3)

(execute with python -m pytest <filename>)

Stacktrace:

bash-3.2$ python -m pytest tests/explainers/test_segfault_minimal_example2.py                                                                                                                               
=========================================================================================== test session starts ============================================================================================
platform darwin -- Python 3.11.8, pytest-8.1.0, pluggy-1.4.0
Matplotlib: 3.8.3
Freetype: 2.6.1
rootdir: /Users/runner/work/shap/shap
configfile: pyproject.toml
plugins: cov-4.1.0, mpl-0.17.0
collected 1 item                                                                                                                                                                                           

tests/explainers/test_segfault_minimal_example2.py Fatal Python error: Segmentation fault

Thread 0x00000001140ad600 (most recent call first):
  File "/Users/runner/work/shap/shap/tests/explainers/test_segfault_minimal_example2.py", line 8 in test_something
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/python.py", line 1769 in runtestSegmentation fault: 11

Versions

PyTorch version: 2.2.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 12.7.3 (x86_64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.202)
CMake version: version 3.28.3
Libc version: N/A

Python version: 3.11.8 (v3.11.8:db85d51d3e, Feb  6 2024, 18:02:37) [Clang 13.0.0 (clang-1300.0.29.30)] (64-bit runtime)
Python platform: macOS-12.7.3-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.1
[pip3] torchvision==0.17.0
[conda] No relevant packages

cc @malfet @albanD @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@malfet malfet added module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: macos Mac OS related issues triage review labels Mar 5, 2024
@malfet
Copy link
Contributor

malfet commented Mar 5, 2024

Is this reproducible if one uses Apple Silicon M1 runners? (Though Torch-2.2 is the last release to support Intel Macs per #114602 )

At least I can not reproduce it on M1, trying it in x86 Rosetta mode.
Can not reproduce it in Rosetta environment either:

arch -arch x86_64 "/Applications/Python 3.11//IDLE.app/Contents/MacOS/Python" -mpytest ~/test/bug-121101.py

Nor can I repro in GitHub CI: https://github.com/malfet/deleteme/actions/runs/8150940508/job/22278030319?pr=79

@malfet malfet added needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: intel Specific to x86 architecture and removed triage review labels Mar 5, 2024
@connortann
Copy link

connortann commented Mar 12, 2024

I can reproduce in GitHub CI (over in the shap repo) with a slightly different setup:

I'll see if I can identify what the relevant difference is between that job and your run above- perhaps it's related to having different dependencies installed.

@connortann
Copy link

connortann commented Mar 12, 2024

Reproducing the issue on GitHub Actions

I can reproduce the minimal reproducible example above on GitHub Actions, with the environment below.

The test snippet passes in an environment created with pip install pytest torch scikit-learn, but fails if the env also includes lightgbm.

The examples below ran on GitHub Actions with macos-latest, python=3.11.8, torch 2.2.1.

Reproducible example

As above:

import time

import torch
from sklearn.datasets import fetch_california_housing


def test_something():
    X, y = fetch_california_housing(return_X_y=True)
    torch.tensor(X)
    time.sleep(3)

Passing run

Example passing run: https://github.com/shap/shap/actions/runs/8248044359/job/22557508223
Output of pip list:

Package           Version
----------------- -----------
certifi           2024.2.2
filelock          3.13.1
fsspec            2024.2.0
iniconfig         2.0.0
Jinja2            3.1.3
joblib            1.3.2
MarkupSafe        2.1.5
mpmath            1.3.0
networkx          3.2.1
numpy             1.26.4
packaging         24.0
pip               24.0
pluggy            1.4.0
pytest            8.1.1
scikit-learn      1.4.1.post1
scipy             1.12.0
setuptools        65.5.0
sympy             1.12
threadpoolctl     3.3.0
torch             2.2.1
typing_extensions 4.10.0

Failing run

Example failing run: https://github.com/shap/shap/actions/runs/8248015803/job/22557423230
Output of pip list (identical apart from lightgbm):

Package           Version
----------------- -----------
certifi           2024.2.2
filelock          3.13.1
fsspec            2024.2.0
iniconfig         2.0.0
Jinja2            3.1.3
joblib            1.3.2
lightgbm          4.3.0
MarkupSafe        2.1.5
mpmath            1.3.0
networkx          3.2.1
numpy             1.26.4
packaging         24.0
pip               24.0
pluggy            1.4.0
pytest            8.1.1
scikit-learn      1.4.1.post1
scipy             1.12.0
setuptools        65.5.0
sympy             1.12
threadpoolctl     3.3.0
torch             2.2.1
typing_extensions 4.10.0

@malfet malfet added the module: openmp Related to OpenMP (omp) support in PyTorch label May 28, 2024
@MarcBresson
Copy link

any news on that issue ? We are having the same problem.

@connortann
Copy link

connortann commented Aug 7, 2024

Over at the "shap" project we are still seeing issue on CI, and it's preventing us from testing against the latest pytorch on MacOS. Example failing run here. We still see the issue with torch==2.4.0.

@malfet to help the investigation progress, here's a full minimal GitHub Actions workflow to reproduce the error:

# run_tests.yml
jobs:
  run_tests:
    runs-on: macos-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: 3.11
      - run: brew install libomp
      - run: pip install pytest torch scikit-learn lightgbm
      - run: pip list
      - run: pytest --noconftest test_bug.py
# test_bug.py
import time

import lightgbm
import torch
from sklearn.datasets import fetch_california_housing


def test_something():
    X, y = fetch_california_housing(return_X_y=True)
    torch.tensor(X)
    time.sleep(3)

Leads to Fatal Python error: Segmentation fault. Full output:

Run pytest --noconftest tests/test_bug121101.py
============================= test session starts ==============================
platform darwin -- Python 3.11.9, pytest-8.3.2, pluggy-1.5.0
rootdir: /Users/runner/work/shap/shap
configfile: pyproject.toml
collected 1 item

Fatal Python error: Segmentation fault

Thread 0x0000000204c1cc00 (most recent call first):
tests/test_bug121[10](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:11)1.py 
  File "/Users/runner/work/shap/shap/tests/test_bug121101.py", line 12 in test_something
  File "/Library/Frameworks/Python.framework/Versions/3.[11](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:12)/lib/python3.11/site-packages/_pytest/python.py", line 159 in pytest_pyfunc_call
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line [12](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:13)0 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 5[13](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:14) in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/python.py", line [16](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:17)27 in runtest
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line [17](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:18)4 in pytest_runtest_call
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 242 in <lambda>
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 341 in from_call
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 241 in call_and_report
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 132 in runtestprotocol
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 113 in pytest_runtest_protocol
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line 362 in pytest_runtestloop
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line Fatal Python error: Segmentation fault

337 in _main
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line 283 in wrap_session
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line 330 in pytest_cmdline_main
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 175 in main
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 201 in console_main
  File "/Users/runner/hostedtoolcache/Python/3.11.9/arm64/bin/pytest", line 8 in <module>

Extension modules: numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, scipy._lib._ccallback_c, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt[19](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:20)937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator
Extension modules: , numpy._core._multiarray_umathscipy.sparse._sparsetools, numpy._core._multiarray_tests, _csparsetools, numpy.linalg._umath_linalg, scipy.sparse._csparsetools, scipy._lib._ccallback_c, scipy.linalg._fblas, numpy.random._common, scipy.linalg._flapack, numpy.random.bit_generator, , scipy.linalg.cython_lapacknumpy.random._bounded_integers, , scipy.linalg._cythonized_array_utilsnumpy.random._mt19937, , scipy.linalg._solve_toeplitznumpy.random.mtrand, , numpy.random._philoxscipy.linalg._decomp_lu_cython, numpy.random._pcg64, scipy.linalg._matfuncs_sqrtm_triu, numpy.random._sfc64, scipy.linalg.cython_blas, numpy.random._generator, scipy.linalg._matfuncs_expm, scipy.sparse._sparsetools, scipy.linalg._decomp_update, _csparsetools, , scipy.sparse._csparsetoolsscipy.sparse.linalg._dsolve._superlu, , scipy.linalg._fblasscipy.sparse.linalg._eigen.arpack._arpack, scipy.linalg._flapack, , scipy.linalg.cython_lapackscipy.sparse.linalg._propack._spropack, scipy.linalg._cythonized_array_utils, scipy.sparse.linalg._propack._dpropack, scipy.linalg._solve_toeplitz, scipy.sparse.linalg._propack._cpropack, scipy.linalg._decomp_lu_cython, scipy.sparse.linalg._propack._zpropack, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.sparse.csgraph._tools, scipy.linalg._matfuncs_expm, scipy.sparse.csgraph._shortest_path, scipy.linalg._decomp_update, scipy.sparse.csgraph._traversal, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, , scipy.sparse.csgraph._min_spanning_treescipy.sparse.linalg._propack._spropack, , scipy.sparse.csgraph._flowscipy.sparse.linalg._propack._dpropack, , scipy.sparse.csgraph._matchingscipy.sparse.linalg._propack._cpropack, , scipy.sparse.csgraph._reorderingscipy.sparse.linalg._propack._zpropack, , scipy.sparse.csgraph._toolssklearn.__check_build._check_build, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, , scipy.sparse.csgraph._reorderingscipy.special._ufuncs_cxx, , sklearn.__check_build._check_buildscipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._ufuncs, scipy.spatial._ckdtree, scipy.special._specfun, scipy._lib.messagestream, scipy.special._comb, scipy.spatial._qhull, scipy.special._ellip_harm_2, scipy.spatial._voronoi, scipy.spatial._ckdtree, , scipy.spatial._distance_wrapscipy._lib.messagestream, , scipy.spatial._hausdorffscipy.spatial._qhull, scipy.spatial._voronoi, , scipy.spatial._distance_wrapscipy.spatial.transform._rotation, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.special.cython_special, scipy.stats._stats, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, , scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNCscipy.stats._mvn, scipy.optimize._cobyla, scipy.stats._rcont.rcont, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.stats._unuran.unuran_wrapper, scipy.optimize._lsq.givens_elimination, , scipy.optimize._zeros, scipy.ndimage._nd_imagescipy.optimize._highs.cython.src._highs_wrapper, , scipy.optimize._highs._highs_wrapper_ni_label, , scipy.optimize._highs.cython.src._highs_constantsscipy.ndimage._ni_label, scipy.optimize._highs._highs_constants, sklearn.utils._isfinite, scipy.linalg._interpolative, sklearn.utils.sparsefuncs_fast, scipy.optimize._bglu_dense, sklearn.utils.murmurhash, scipy.optimize._lsap, , sklearn.utils._openmp_helpersscipy.optimize._direct, scipy.integrate._odepack, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, scipy.integrate._quadpack, sklearn.metrics._dist_metrics, scipy.integrate._vode, sklearn.metrics._pairwise_distances_reduction._datasets_pair, scipy.integrate._dop, scipy.integrate._lsoda, sklearn.utils._cython_blas, scipy.interpolate._fitpack, sklearn.metrics._pairwise_distances_reduction._base, scipy.interpolate._dfitpack, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, scipy.interpolate._bspl, sklearn.utils._heap, scipy.interpolate._ppoly, sklearn.utils._sorting, scipy.interpolate.interpnd, sklearn.metrics._pairwise_distances_reduction._argkmin, scipy.interpolate._rbfinterp_pythran, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, scipy.interpolate._rgi_cython, scipy.special.cython_special, sklearn.utils._vector_sentinel, scipy.stats._stats, , sklearn.metrics._pairwise_distances_reduction._radius_neighborsscipy.stats._biasedurn, , sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmodescipy.stats._levy_stable.levyst, , scipy.stats._stats_pythransklearn.metrics._pairwise_fast, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, sklearn.utils._random, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, torch._C, scipy.stats._rcont.rcont, , scipy.stats._unuran.unuran_wrappertorch._C._fft, , scipy.ndimage._nd_imagetorch._C._linalg, , _ni_labeltorch._C._nested, , scipy.ndimage._ni_labeltorch._C._nn, , sklearn.utils._isfinitetorch._C._sparse, , sklearn.utils.sparsefuncs_fasttorch._C._special, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sklearn.utils._random, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, , scipy.io.matlab._mio_utilstorch._C._nn, torch._C._sparse, scipy.io.matlab._streams, torch._C._special, scipy.io.matlab._mio5_utils, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, , sklearn.datasets._svmlight_format_fastscipy.io.matlab._mio5_utils, sklearn.datasets._svmlight_format_fast, sklearn.feature_extraction._hashing_fast (total: 130, )sklearn.feature_extraction._hashing_fast
 (total: 130)
/Users/runner/work/_temp/7013399c-b6ff-43a4-b289-cc08191dbadb.sh: line 1:  2783 Segmentation fault: 11  pytest --noconftest tests/test_bug1[21](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:22)101.py

Result of pip list:

Package           Version
----------------- --------
certifi           2024.7.4
filelock          3.15.4
fsspec            2024.6.1
iniconfig         2.0.0
Jinja2            3.1.4
joblib            1.4.2
lightgbm          4.5.0
MarkupSafe        2.1.5
mpmath            1.3.0
networkx          3.3
numpy             2.0.1
packaging         24.1
pip               24.2
pluggy            1.5.0
pytest            8.3.2
scikit-learn      1.5.1
scipy             1.14.0
setuptools        65.5.0
sympy             1.13.1
threadpoolctl     3.5.0
torch             2.4.0

@malfet
Copy link
Contributor

malfet commented Aug 7, 2024

@connortann thank you for the reproducer. Crash is due to multiple OpenMP runtimes loaded into the process address space:

$ lldb -- python bug-121101.py
(lldb) r
Process 16319 launched: '/Users/malfet/py3.12-torch2.4/bin/python' (arm64)
Process 16319 stopped
* thread #2, stop reason = exec
    frame #0: 0x0000000100014b70 dyld`_dyld_start
dyld`_dyld_start:
->  0x100014b70 <+0>:  mov    x0, sp
    0x100014b74 <+4>:  and    sp, x0, #0xfffffffffffffff0
    0x100014b78 <+8>:  mov    x29, #0x0 ; =0 
    0x100014b7c <+12>: mov    x30, #0x0 ; =0 
(lldb) c
Process 16319 resuming
Process 16319 stopped
* thread #3, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
  thread #4, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
  thread #5, stop reason = EXC_BAD_ACCESS (code=1, address=0x18)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
  thread #6, stop reason = EXC_BAD_ACCESS (code=1, address=0x20)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
  thread #8, stop reason = EXC_BAD_ACCESS (code=1, address=0x30)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
(lldb) image list libomp.dylib
[  0] E3A31AB3-3AE5-3371-87D0-7FD870A41A0D 0x00000001034f4000 /Users/malfet/py3.12-torch2.4/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib 
[  1] ACB8253B-DF8F-36C8-8100-C896CD3382ED 0x00000001063d4000 /opt/homebrew/Cellar/libomp/18.1.4/lib/libomp.dylib 
[  2] F53B1E01-AF16-30FC-8690-F7B131EB6CE5 0x0000000106744000 /Users/malfet/py3.12-torch2.4/lib/python3.12/site-packages/torch/lib/libomp.dylib 
(lldb) 

@connortann
Copy link

connortann commented Aug 7, 2024

If I comment out the brew install libomp step on CI, we get a different error Library not loaded: **@rpath/libomp.dylib.
From this comment, microsoft/LightGBM#6262 (comment) , the issue is apparently from OpenMP not being installed.

Full traceback if brew install libomp is commented out:

Run pytest --noconftest tests/test_bug121101.py
============================= test session starts ==============================
platform darwin -- Python 3.11.9, pytest-8.3.2, pluggy-1.5.0
rootdir: /Users/runner/work/shap/shap
configfile: pyproject.toml
collected 0 items / 1 error

==================================== ERRORS ====================================
___________________ ERROR collecting tests/test_bug121[10](https://github.com/shap/shap/actions/runs/10281535297/job/28451295814#step:6:11)1.py ___________________
tests/test_bug121101.py:5: in <module>
    import lightgbm
/Library/Frameworks/Python.framework/Versions/3.[11](https://github.com/shap/shap/actions/runs/10281535297/job/28451295814#step:6:12)/lib/python3.11/site-packages/lightgbm/__init__.py:9: in <module>
    from .basic import Booster, Dataset, Sequence, register_logger
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/basic.py:281: in <module>
    _LIB = _load_lib()
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/basic.py:265: in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ctypes/__init__.py:454: in LoadLibrary
    return self._dlltype(name)
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ctypes/__init__.py:376: in __init__
    self._handle = _dlopen(self._name, mode)
E   OSError: dlopen(/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.dylib, 0x0006): Library not loaded: @rpath/libomp.dylib
E     Referenced from: <D3923ACB-D836-32D3-A031-CF91999FDAFC> /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.dylib
E     Reason: tried: '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file), '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file)
=========================== short test summary info ============================
ERROR tests/test_bug[12](https://github.com/shap/shap/actions/runs/10281535297/job/28451295814#step:6:13)1101.py - OSError: dlopen(/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.dylib, 0x0006): Library not loaded: @rpath/libomp.dylib
  Referenced from: <D3923ACB-D836-32D3-A031-CF91999FDAFC> /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.dylib
  Reason: tried: '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file), '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file)
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 0.95s ===============================
Error: Process completed with exit code 2.

@malfet
Copy link
Contributor

malfet commented Aug 7, 2024

To be frank, I'm unsure if problem lies solely with PyTorch at this point, as two other runtimes are importing libomp, and there isn't much one can do short of disabling OpenMP (which one can do programmatically by calling torch.set_num_threads(1) )

@malfet
Copy link
Contributor

malfet commented Aug 7, 2024

@connortann can you please try adding torch.set_num_threads(1) at the start of your test to let me know whether or not it fixes the problem. (it works for me locally)

@connortann
Copy link

connortann commented Aug 7, 2024

Yep certainly: the tests do indeed pass with torch.set_num_threads(1).

I'm unsure if problem lies solely with PyTorch at this point

Indeed, as the segfault only to occurs when lightgbm is imported first. Possibly relevant, we had a separate segfault issue when torch is imported before lightgbm, as described in this comment: shap/shap#3092 (comment)

I hope that collectively we can find a fix; as torch and lightgbm are both extremely popular libraries so it's quite common that they will be installed in the same environment.

@connortann
Copy link

connortann commented Aug 7, 2024

I cross-posted to LightGBM, because as you say the problem doesn't seem to lie soley with pytorch: microsoft/LightGBM#6595

@yuygfgg
Copy link

yuygfgg commented Aug 31, 2024

I'm going to add that this pytorch segmentation fault on macos do not necessarily need LightGBM. Some others like vapoursynth can cause the same problem.

@lorentzenchr
Copy link

As this issue requires a community effort, it is maybe best to centralize the discussion.
@malfet would you be willing to join microsoft/LightGBM#6595 (comment).

@starteleport
Copy link

I am having this problem as well.

My objective is to run https://github.com/black-forest-labs/flux demo with PyTorch 2.4.1 on Intel MacBook Pro's Radeon 5500M.

What I've done so far:

  • Installed Anaconda
  • Built a PyTorch wheel from tag 2.4.1 with pytorch/builder called the way it used to be called from CircleCI before x64 was dropped
  • Verified it works with MPS with a small smoke test: python -c "import torch; print(torch.backends.mps.is_available())"
  • Created a new Conda env in order to run flux
  • Tried installing my own PyTorch wheel and discovered that I need to build torchvision myself as well, because it references torch package and would otherwise conflict
  • Built torchvision and installed the wheel into venv
  • Got my first segfault similar to Segmentation error for torch==2.2.1 on MacOsΒ #121101 (comment)
  • Run flux script with DYLD_PRINT_LIBRARIES=1 and noticed that libiomp5.dylib is being imported both from torch and functorch
  • Built functorch with my torch wheel

After all that the segfault wouldn't go away.

I'm ready to dig into the issue, but I need some guidance/fresh ideas to facilitate the investigation.

@lorentzenchr
Copy link

@gchanan @dzhulgakov @ezyang @malfet If you could have a look and participate in the discussion in microsoft/LightGBM#6595, that would be highly appreciated. I consider those kinds of bugs among the worst for users.

This issue is mainly caused by pytorch, the short summary of microsoft/LightGBM#6595 (comment) is:

torch vendors a libomp.dylib (without library or symbol name mangling) and always prefers that vendored copy to a system installation.

lightgbm searches for a system installation.

As a result, if you've installed both these libraries via wheels on macOS, loading both will result in 2 copies of libomp.dylib being loaded. This may or may not show up as runtime issues... unpredictable, because symbol resolution is lazy by default and therefore depends on the code paths used.

Even if all copies of libomp.dylib loaded into the process are ABI-compatible with each other, there can still be runtime segfaults as a result of mixing symbols from libraries loaded at different memory addresses, I think.

@CloseChoice
Copy link
Author

CloseChoice commented May 6, 2025

Any progress here? With python3.12 the error seems to be thrown with torch 2.2.0, 2.2.1 and 2.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: intel Specific to x86 architecture module: macos Mac OS related issues module: openmp Related to OpenMP (omp) support in PyTorch needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants