Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MAINT Remove -Wcpp warnings when compiling sklearn.svm._liblinear #25112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jan 9, 2023

Conversation

OmarManzoor
Copy link
Contributor

Reference Issues/PRs

Towards #24875

What does this implement/fix? Explain your changes.

  • Used memory views to replace the deprecated cnp.ndarray in sklearn.svm._liblinear

Any other comments?

@OmarManzoor OmarManzoor marked this pull request as draft December 5, 2022 10:06
@OmarManzoor OmarManzoor marked this pull request as ready for review December 5, 2022 11:21
@OmarManzoor OmarManzoor requested a review from jjerphan December 5, 2022 11:21
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You found an even better solution. 👍

This LGTM given that compilation passes when "sklearn.svm._liblinear" is added to USE_NEWEST_NUMPY_C_API:

scikit-learn/setup.py

Lines 64 to 70 in cbfb6ab

# XXX: add new extensions to this list when they
# are not using the old NumPy C API (i.e. version 1.7)
# TODO: when Cython>=3.0 is used, make sure all Cython extensions
# use the newest NumPy C API by `#defining` `NPY_NO_DEPRECATED_API` to be
# `NPY_1_7_API_VERSION`, and remove this list.
# See: https://github.com/cython/cython/blob/1777f13461f971d064bd1644b02d92b350e6e7d1/docs/src/userguide/migrating_to_cy30.rst#numpy-c-api # noqa
USE_NEWEST_NUMPY_C_API = (

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ogrisel
Copy link
Member

ogrisel commented Dec 6, 2022

Just to make sure my comment #25112 (comment) above is not missed, the current stat of this PR introduces an additional memory copy of the training set which we want to avoid in this wrapper.

@jjerphan
Copy link
Member

jjerphan commented Dec 8, 2022

I've opened OmarManzoor#1 to propose a resolution.

jjerphan and others added 3 commits December 8, 2022 09:04
* Use separate memory views for float64 and float32 to handle the possibly dtypes of X
* Separate the functionality to get the bytes of X in functions for sparse and normal ndarray
@jjerphan
Copy link
Member

jjerphan commented Dec 8, 2022

I've merge main in this branch to solve failures on the CI (fixed by #25136 which was merge in main earlier today).

@OmarManzoor
Copy link
Contributor Author

I've merge main in this branch to solve failures on the CI (fixed by #25136 which was merge in main earlier today).

The CI failures were caused by an actual error. I basically used two functions to split out the code but I think that might be resulting in some kind of memory issues as the code crashes when running the common tests. I have pushed the latest changes so that all the functionality remains in the train_wrap function.

@OmarManzoor
Copy link
Contributor Author

@jjerphan Just for my understanding could you kindly clarify why using separate functions leads to a memory leak over here: 407f456,
while direct implementation in the train_wrap function seems to work fine?

cdef parameter *param
cdef problem *problem
cdef model *model
cdef char_const_ptr error_msg
cdef int len_w
cdef bint x_has_type_float64 = X.dtype == np.float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if every variable in Python must be snake cased, in scikit-learn we do capitalize X and related variables. In this case, we would have:

Suggested change
cdef bint x_has_type_float64 = X.dtype == np.float64
cdef bint X_has_type_float64 = X.dtype == np.float64

This suggestions also applies to similar variable hereinafter.

@jjerphan
Copy link
Member

jjerphan commented Dec 8, 2022

@jjerphan Just for my understanding could you kindly clarify why using separate functions leads to a memory leak over here: 407f456, while direct implementation in the train_wrap function seems to work fine?

Have you observed memoryleaks or other errors due to memory management?

Intuitively, if you are wrapping the logic in functions, the memoryview are local stack-allocated structs in this context and the returned pointer points to an invalid region of memory after the function returns.

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Dec 9, 2022

@jjerphan Just for my understanding could you kindly clarify why using separate functions leads to a memory leak over here: 407f456, while direct implementation in the train_wrap function seems to work fine?

Have you observed memoryleaks or other errors due to memory management?

Intuitively, if you are wrapping the logic in functions, the memoryview are local stack-allocated structs in this context and the returned pointer points to an invalid region of memory after the function returns.

Actually I am not sure whether it was a memory leak or some other memory related issue since all I observed was a Python segmentation fault when I run test_common.py with pytest.

Fatal Python error: Segmentation fault
Current thread 0x00000001052c8580 (most recent call first):
  File "/Users/omarsalman/Projects/scikit-learn/sklearn/svm/_base.py", line 1224 in _fit_liblinear

Thank you for the intuitive explanation.

@jjerphan
Copy link
Member

jjerphan commented Dec 9, 2022

OK, the "segmentation fault" is a (usually fatal) C error which is captured and transferred back by Python in your case.

This error indicates that the process accessed an invalid segment of memory during its execution. If this error was not fatal (due to some handling done by CPython) problems related to memory management (like memory leaks) might happen.

I think the remark that I gave in #25112 (comment) above hold in this case: you are returning addresses of local variables (i.e. variables that are temporarily allocated in one of the process' memory segments: the stack) via pointers. If such a pointer is then dereferenced to get access to the value it points to, a segmentation fault is raised because the local variable has been deallocated on return.

This does not appear in the inline boilerplate I propose because those local variable are still accessible by pointers as they exist in the same scope.

Is this remark understandable?

@OmarManzoor
Copy link
Contributor Author

OK, the "segmentation fault" is a (usually fatal) C error which is captured and transferred back by Python in your case.

This error indicates that the process accessed an invalid segment of memory during its execution. If this error was not fatal (due to some handling done by CPython) problems related to memory management (like memory leaks) might happen.

I think the remark that I gave in #25112 (comment) above hold in this case: you are returning addresses of local variables (i.e. variables that are temporarily allocated in one of the process' memory segments: the stack) via pointers. If such a pointer is then dereferenced to get access to the value it points to, a segmentation fault is raised because the local variable has been deallocated on return.

This does not appear in the inline boilerplate I propose because those local variable are still accessible by pointers as they exist in the same scope.

Is this remark understandable?

Yes thank you very much for the explanation!

@OmarManzoor OmarManzoor requested a review from ogrisel December 15, 2022 08:56
@OmarManzoor
Copy link
Contributor Author

@ogrisel Could you kindly have a look at this PR again when you get the time?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. The build log is clean, +1 for merge!

Thanks for the contribution @OmarManzoor!

@ogrisel ogrisel enabled auto-merge (squash) January 6, 2023 13:57
@OmarManzoor
Copy link
Contributor Author

@ogrisel , @jjerphan
I think this got stuck and did not complete. Should I merge the latest main into the branch to trigger the CI again?

@jjerphan
Copy link
Member

jjerphan commented Jan 9, 2023

The pending runs for Circle CI are independent of those changes and also happen in other PRs.

I am merging this manually and take responsibility if this this merge creates any problems (this is very unlikely in my opinion as Linux ARM64 is the only untested configuration and as all the other configurations' test suite pass).

@jjerphan jjerphan merged commit 0a36bd8 into scikit-learn:main Jan 9, 2023
@OmarManzoor
Copy link
Contributor Author

The pending runs for Circle CI are independent of those changes and also happen in other PRs.

I am merging this manually and take responsibility if this this merge creates any problems (this is very unlikely in my opinion as Linux ARM64 is the only untested configuration and as all the other configurations' test suite pass).

Thank you!

@OmarManzoor OmarManzoor deleted the cython_liblinear branch January 9, 2023 08:06
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
…ikit-learn#25112)

* MAINT Remove -Wcpp warnings when compiling sklearn.svm._liblinear

* Convert the required data in X to bytes

* Add NULL check for class_weight_label

* Add NULL check for class_weight

* Add sklearn.svm._liblinear in setup.py

* Use intermediate memoryviews for static reinterpretation of dtypes

* * Remove the usage of tobytes()
* Use separate memory views for float64 and float32 to handle the possibly dtypes of X
* Separate the functionality to get the bytes of X in functions for sparse and normal ndarray

* Remove the use of functions and implement the functionality directly in train_wrap

* Minor refactor

* Refactor variables names involving X to use capital X

* Define y, class_weight and sample_weight as const memory views

* Use const with X_indices and X_indptr memory view declarations

Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
…ikit-learn#25112)

* MAINT Remove -Wcpp warnings when compiling sklearn.svm._liblinear

* Convert the required data in X to bytes

* Add NULL check for class_weight_label

* Add NULL check for class_weight

* Add sklearn.svm._liblinear in setup.py

* Use intermediate memoryviews for static reinterpretation of dtypes

* * Remove the usage of tobytes()
* Use separate memory views for float64 and float32 to handle the possibly dtypes of X
* Separate the functionality to get the bytes of X in functions for sparse and normal ndarray

* Remove the use of functions and implement the functionality directly in train_wrap

* Minor refactor

* Refactor variables names involving X to use capital X

* Define y, class_weight and sample_weight as const memory views

* Use const with X_indices and X_indptr memory view declarations

Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants