MAINT Remove -Wcpp warnings when compiling sklearn.svm._liblinear #25112

OmarManzoor · 2022-12-05T10:05:55Z

Reference Issues/PRs

Towards #24875

What does this implement/fix? Explain your changes.

Used memory views to replace the deprecated cnp.ndarray in sklearn.svm._liblinear

Any other comments?

sklearn/svm/_liblinear.pyx

jjerphan

You found an even better solution. 👍

This LGTM given that compilation passes when "sklearn.svm._liblinear" is added to USE_NEWEST_NUMPY_C_API:

scikit-learn/setup.py

Lines 64 to 70 in cbfb6ab

    
           # XXX: add new extensions to this list when they 
        
           # are not using the old NumPy C API (i.e. version 1.7) 
        
           # TODO: when Cython>=3.0 is used, make sure all Cython extensions 
        
           # use the newest NumPy C API by `#defining` `NPY_NO_DEPRECATED_API` to be 
        
           # `NPY_1_7_API_VERSION`, and remove this list. 
        
           # See: https://github.com/cython/cython/blob/1777f13461f971d064bd1644b02d92b350e6e7d1/docs/src/userguide/migrating_to_cy30.rst#numpy-c-api # noqa 
        
           USE_NEWEST_NUMPY_C_API = (

sklearn/svm/_liblinear.pyx

jjerphan

LGTM!

ogrisel · 2022-12-06T18:24:53Z

Just to make sure my comment #25112 (comment) above is not missed, the current stat of this PR introduces an additional memory copy of the training set which we want to avoid in this wrapper.

jjerphan · 2022-12-08T08:02:01Z

I've opened OmarManzoor#1 to propose a resolution.

Suggestions for scikit-learn#25112

* Use separate memory views for float64 and float32 to handle the possibly dtypes of X * Separate the functionality to get the bytes of X in functions for sparse and normal ndarray

jjerphan · 2022-12-08T12:44:48Z

I've merge main in this branch to solve failures on the CI (fixed by #25136 which was merge in main earlier today).

…in train_wrap

…kit-learn into cython_liblinear

OmarManzoor · 2022-12-08T13:02:03Z

I've merge main in this branch to solve failures on the CI (fixed by #25136 which was merge in main earlier today).

The CI failures were caused by an actual error. I basically used two functions to split out the code but I think that might be resulting in some kind of memory issues as the code crashes when running the common tests. I have pushed the latest changes so that all the functionality remains in the train_wrap function.

OmarManzoor · 2022-12-08T15:16:11Z

@jjerphan Just for my understanding could you kindly clarify why using separate functions leads to a memory leak over here: 407f456,
while direct implementation in the train_wrap function seems to work fine?

jjerphan · 2022-12-08T15:18:58Z

sklearn/svm/_liblinear.pyx

    cdef parameter *param
    cdef problem *problem
    cdef model *model
    cdef char_const_ptr error_msg
    cdef int len_w
+    cdef bint x_has_type_float64 = X.dtype == np.float64


Even if every variable in Python must be snake cased, in scikit-learn we do capitalize X and related variables. In this case, we would have:

Suggested change

cdef bint x_has_type_float64 = X.dtype == np.float64

cdef bint X_has_type_float64 = X.dtype == np.float64

This suggestions also applies to similar variable hereinafter.

jjerphan · 2022-12-08T16:30:38Z

@jjerphan Just for my understanding could you kindly clarify why using separate functions leads to a memory leak over here: 407f456, while direct implementation in the train_wrap function seems to work fine?

Have you observed memoryleaks or other errors due to memory management?

Intuitively, if you are wrapping the logic in functions, the memoryview are local stack-allocated structs in this context and the returned pointer points to an invalid region of memory after the function returns.

OmarManzoor · 2022-12-09T06:22:13Z

@jjerphan Just for my understanding could you kindly clarify why using separate functions leads to a memory leak over here: 407f456, while direct implementation in the train_wrap function seems to work fine?

Have you observed memoryleaks or other errors due to memory management?

Intuitively, if you are wrapping the logic in functions, the memoryview are local stack-allocated structs in this context and the returned pointer points to an invalid region of memory after the function returns.

Actually I am not sure whether it was a memory leak or some other memory related issue since all I observed was a Python segmentation fault when I run test_common.py with pytest.

Fatal Python error: Segmentation fault
Current thread 0x00000001052c8580 (most recent call first):
  File "/Users/omarsalman/Projects/scikit-learn/sklearn/svm/_base.py", line 1224 in _fit_liblinear

Thank you for the intuitive explanation.

jjerphan · 2022-12-09T07:32:44Z

OK, the "segmentation fault" is a (usually fatal) C error which is captured and transferred back by Python in your case.

This error indicates that the process accessed an invalid segment of memory during its execution. If this error was not fatal (due to some handling done by CPython) problems related to memory management (like memory leaks) might happen.

I think the remark that I gave in #25112 (comment) above hold in this case: you are returning addresses of local variables (i.e. variables that are temporarily allocated in one of the process' memory segments: the stack) via pointers. If such a pointer is then dereferenced to get access to the value it points to, a segmentation fault is raised because the local variable has been deallocated on return.

This does not appear in the inline boilerplate I propose because those local variable are still accessible by pointers as they exist in the same scope.

Is this remark understandable?

OmarManzoor · 2022-12-09T07:40:13Z

OK, the "segmentation fault" is a (usually fatal) C error which is captured and transferred back by Python in your case.

This error indicates that the process accessed an invalid segment of memory during its execution. If this error was not fatal (due to some handling done by CPython) problems related to memory management (like memory leaks) might happen.

I think the remark that I gave in #25112 (comment) above hold in this case: you are returning addresses of local variables (i.e. variables that are temporarily allocated in one of the process' memory segments: the stack) via pointers. If such a pointer is then dereferenced to get access to the value it points to, a segmentation fault is raised because the local variable has been deallocated on return.

This does not appear in the inline boilerplate I propose because those local variable are still accessible by pointers as they exist in the same scope.

Is this remark understandable?

Yes thank you very much for the explanation!

OmarManzoor · 2023-01-06T12:05:22Z

@ogrisel Could you kindly have a look at this PR again when you get the time?

ogrisel

This looks good to me. The build log is clean, +1 for merge!

Thanks for the contribution @OmarManzoor!

OmarManzoor · 2023-01-09T07:21:40Z

@ogrisel , @jjerphan
I think this got stuck and did not complete. Should I merge the latest main into the branch to trigger the CI again?

jjerphan · 2023-01-09T07:31:02Z

The pending runs for Circle CI are independent of those changes and also happen in other PRs.

I am merging this manually and take responsibility if this this merge creates any problems (this is very unlikely in my opinion as Linux ARM64 is the only untested configuration and as all the other configurations' test suite pass).

OmarManzoor · 2023-01-09T08:06:35Z

The pending runs for Circle CI are independent of those changes and also happen in other PRs.

I am merging this manually and take responsibility if this this merge creates any problems (this is very unlikely in my opinion as Linux ARM64 is the only untested configuration and as all the other configurations' test suite pass).

Thank you!

…ikit-learn#25112) * MAINT Remove -Wcpp warnings when compiling sklearn.svm._liblinear * Convert the required data in X to bytes * Add NULL check for class_weight_label * Add NULL check for class_weight * Add sklearn.svm._liblinear in setup.py * Use intermediate memoryviews for static reinterpretation of dtypes * * Remove the usage of tobytes() * Use separate memory views for float64 and float32 to handle the possibly dtypes of X * Separate the functionality to get the bytes of X in functions for sparse and normal ndarray * Remove the use of functions and implement the functionality directly in train_wrap * Minor refactor * Refactor variables names involving X to use capital X * Define y, class_weight and sample_weight as const memory views * Use const with X_indices and X_indptr memory view declarations Co-authored-by: Julien Jerphanion <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>

MAINT Remove -Wcpp warnings when compiling sklearn.svm._liblinear

1b6c4dd

github-actions bot added module:svm cython labels Dec 5, 2022

OmarManzoor marked this pull request as draft December 5, 2022 10:06

OmarManzoor commented Dec 5, 2022

View reviewed changes

sklearn/svm/_liblinear.pyx Outdated Show resolved Hide resolved

Convert the required data in X to bytes

33c7e4a

OmarManzoor marked this pull request as ready for review December 5, 2022 11:21

OmarManzoor requested a review from jjerphan December 5, 2022 11:21

OmarManzoor added 2 commits December 5, 2022 17:38

Add NULL check for class_weight_label

4bfe209

Add NULL check for class_weight

59c6456

jjerphan reviewed Dec 5, 2022

View reviewed changes

sklearn/svm/_liblinear.pyx Outdated Show resolved Hide resolved

Add sklearn.svm._liblinear in setup.py

65bc893

jjerphan approved these changes Dec 5, 2022

View reviewed changes

jeremiedbb added the Build / CI label Dec 6, 2022

jjerphan mentioned this pull request Dec 8, 2022

Suggestions for scikit-learn#25112 OmarManzoor/scikit-learn#1

Merged

jjerphan and others added 3 commits December 8, 2022 09:04

Use intermediate memoryviews for static reinterpretation of dtypes

98bdc91

Merge pull request #1 from jjerphan/cython_liblinear

d8b3ed6

Suggestions for scikit-learn#25112

* Remove the usage of tobytes()

407f456

* Use separate memory views for float64 and float32 to handle the possibly dtypes of X * Separate the functionality to get the bytes of X in functions for sparse and normal ndarray

jjerphan mentioned this pull request Dec 8, 2022

MAINT Remove all Cython, C and C++ compilations warnings #24875

Closed

22 tasks

Merge branch 'main' into cython_liblinear

aef0d75

OmarManzoor added 2 commits December 8, 2022 17:52

Remove the use of functions and implement the functionality directly …

f8152f8

…in train_wrap

Merge branch 'cython_liblinear' of https://github.com/OmarManzoor/sci…

623ee63

…kit-learn into cython_liblinear

Minor refactor

e2824c2

jjerphan reviewed Dec 8, 2022

View reviewed changes

OmarManzoor added 2 commits December 9, 2022 11:25

Refactor variables names involving X to use capital X

8cc45f5

Merge remote-tracking branch 'upstream/main' into cython_liblinear

9ef3aaa

OmarManzoor added 3 commits December 14, 2022 14:28

Define y, class_weight and sample_weight as const memory views

89cd7e5

Use const with X_indices and X_indptr memory view declarations

1e29db3

Merge remote-tracking branch 'upstream/main' into cython_liblinear

f6fc3f1

OmarManzoor requested a review from ogrisel December 15, 2022 08:56

Merge remote-tracking branch 'upstream/main' into cython_liblinear

1558e28

ogrisel approved these changes Jan 6, 2023

View reviewed changes

Merge branch 'main' into cython_liblinear

2517a47

ogrisel enabled auto-merge (squash) January 6, 2023 13:57

jjerphan merged commit 0a36bd8 into scikit-learn:main Jan 9, 2023

OmarManzoor deleted the cython_liblinear branch January 9, 2023 08:06

	# XXX: add new extensions to this list when they
	# are not using the old NumPy C API (i.e. version 1.7)
	# TODO: when Cython>=3.0 is used, make sure all Cython extensions
	# use the newest NumPy C API by `#defining` `NPY_NO_DEPRECATED_API` to be
	# `NPY_1_7_API_VERSION`, and remove this list.
	# See: https://github.com/cython/cython/blob/1777f13461f971d064bd1644b02d92b350e6e7d1/docs/src/userguide/migrating_to_cy30.rst#numpy-c-api # noqa
	USE_NEWEST_NUMPY_C_API = (

	cdef bint x_has_type_float64 = X.dtype == np.float64
	cdef bint X_has_type_float64 = X.dtype == np.float64

Uh oh!

MAINT Remove -Wcpp warnings when compiling sklearn.svm._liblinear #25112

MAINT Remove -Wcpp warnings when compiling sklearn.svm._liblinear #25112

Uh oh!

Conversation

OmarManzoor commented Dec 5, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Dec 6, 2022

Uh oh!

jjerphan commented Dec 8, 2022

Uh oh!

jjerphan commented Dec 8, 2022

Uh oh!

OmarManzoor commented Dec 8, 2022

Uh oh!

OmarManzoor commented Dec 8, 2022

Uh oh!

jjerphan Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

jjerphan commented Dec 8, 2022

Uh oh!

OmarManzoor commented Dec 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan commented Dec 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Dec 9, 2022

Uh oh!

OmarManzoor commented Jan 6, 2023

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

OmarManzoor commented Jan 9, 2023

Uh oh!

jjerphan commented Jan 9, 2023

Uh oh!

OmarManzoor commented Jan 9, 2023

Uh oh!

Uh oh!

OmarManzoor commented Dec 9, 2022 •

edited

Loading

jjerphan commented Dec 9, 2022 •

edited

Loading