-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures
for sparse matrices
#23731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
125 commits
Select commit
Hold shift + click to select a range
7eef7ad
[WIP] FIX index overflow error in sparse matrix polynomial expansion …
niuk-a 4adbf38
Merge branch 'main' into csr_polynomial
Micky774 baa98a2
Reconciled with main
Micky774 2b9187d
Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…
Micky774 55424a0
Merge branch 'main' into csr_polynomial
Micky774 69438dc
Removed extra `total_nnz` assignment
Micky774 9ecbf8a
Added fused type
Micky774 345e043
Added clarifying comment
Micky774 1d23b1d
Merge branch 'main' into csr_polynomial
Micky774 cc6a548
Added changelog entry
Micky774 8b189bb
Merge branch 'main' into csr_polynomial
Micky774 15b00fd
Fixed PR tag in changelog entry
Micky774 ee8a3ba
Apply suggestions from code review
Micky774 cd346f1
Merge branch 'main' into csr_polynomial
Micky774 0a17dee
Streamlined logic and improved tests
Micky774 b118a3c
Added test depending on scipy version
Micky774 fa1ecf2
Clarified breaking and renamed types
Micky774 d735c8f
Merge branch 'main' into csr_polynomial
Micky774 f3bb5cd
Merge branch 'main' into csr_polynomial
Micky774 0c9a563
Merge branch 'main' into csr_polynomial
Micky774 a006bf0
Merge branch 'main' into csr_polynomial
Micky774 96259d7
Apply suggestions from code review
Micky774 2e44f39
Merge branch 'main' into csr_polynomial
Micky774 377f6a9
Apply suggestions from code review
Micky774 8a77b66
Improved tests
Micky774 d2e6339
Merge branch 'csr_polynomial' of https://github.com/Micky774/scikit-l…
Micky774 c70c216
Merge branch 'main' into csr_polynomial
Micky774 a9d39a7
Initial addition -- fails with segfault
Micky774 e1262f9
Improved documentation
Micky774 5ee0d96
Added license information
Micky774 ceca8ed
Updated fused-type name to hopefully clarify purpose
Micky774 9bd99ca
Used vectors and updated implementation
Micky774 8943412
Removed accidentally-added file
Micky774 5be9a13
Simplified and cleaned up implementation
Micky774 27974ba
Slightly better formatting and variable name
Micky774 cec3005
Fixed dtype bug and added testing
Micky774 5116a1d
Merge branch 'main' into csr_polynomial
Micky774 764d8bd
Updated test to verify nnz count and indices
Micky774 102e2fa
Improved dtype resolution and clarified with comments
Micky774 8430c3f
Fixed inexact index error
Micky774 db78c7e
Updated formatting
Micky774 057a4f5
Cleaner diff and blame history
Micky774 23e9acf
Merge branch 'main' into csr_polynomial
Micky774 46745a8
Fixed overflow bug in expanded index calculation
Micky774 9ff8413
Fix intermediate calculation overflow and refactor tests
Micky774 5a221f2
Merge branch 'main' into csr_polynomial
Micky774 baea39e
Fixed duplicated changelog entries
Micky774 ff3d050
Merge branch 'main' into csr_polynomial
Micky774 54c7d2e
Apply suggestions from code review
Micky774 1c8a98b
Update comment for scipy min version (new backport)
Micky774 016ae5b
Removed vendored csr_hstack and instead error where appropriate
Micky774 0e14c8d
Merge branch 'main' into csr_polynomial
Micky774 a5c17dc
Updated error message
Micky774 34e7d2a
CLN Add authorship and delete cosmetic changes
Micky774 e82a9f9
Update sklearn/preprocessing/tests/test_polynomial.py
Micky774 0ef1b95
Revert
Micky774 0d7be70
Merge branch 'main' into csr_polynomial
Micky774 56e8c34
Moved calculation of number of non-zero elements to Cython
Micky774 ac07342
Merge branch 'main' into csr_polynomial
Micky774 c973d64
Addressed misc review feedback
Micky774 86da1a0
Merge branch 'main' into csr_polynomial
Micky774 4ac5deb
Added format specification
Micky774 ce2308b
Added explicit equation used to generate constants
Micky774 0543ccd
Merge branch 'main' into csr_polynomial
Micky774 a97c526
Improved documentation and introduced error
Micky774 391d049
Improved wording
Micky774 68615c8
Apply suggestions from code review
Micky774 66738fa
Merge branch 'main' into csr_polynomial
Micky774 e03f689
Overhauled tests
Micky774 c35cf2b
Improved cython routines adressed feedback
Micky774 447296b
Improved code organization
Micky774 9c93e46
Merge branch 'main' into csr_polynomial
Micky774 8809a94
Opted for explicit `cnp.*` typing for `DATA_t`
Micky774 b0c7bf5
Reverted extraneous change
Micky774 b13a13c
Adjusted tests for un-indexable values on 32bit systems
Micky774 4215b0b
Apply suggestions from code review
Micky774 e767c14
Merge branch 'main' into csr_polynomial
Micky774 1793791
Addressed Cython bug
Micky774 9144db7
Added documentation for secondary checks
Micky774 e14f3e4
Update sklearn/preprocessing/_csr_polynomial_expansion.pyx
Micky774 ec3e9ed
Formatting
Micky774 29a1cb8
Merge branch 'csr_polynomial' of https://github.com/Micky774/scikit-l…
Micky774 ddcc960
Merge branch 'main' into csr_polynomial
Micky774 56301a8
Update sklearn/preprocessing/_csr_polynomial_expansion.pyx
Micky774 209e511
Merge branch 'main' into csr_polynomial
Micky774 25b2f60
Factored equations to mitigate overflow risks
Micky774 b77f071
Merge branch 'main' into csr_polynomial
Micky774 9757640
Added `__int128` support when available
Micky774 0f96d48
Added back python computation for overflow protection
Micky774 6d1a9f1
Corrected for linux
Micky774 fa91e4e
Added support for CLANG and improved documentation
Micky774 3e9238a
Merge branch 'main' into csr_polynomial
Micky774 62d9979
Removed unreachable code
Micky774 39dab07
Slight change in equation form
Micky774 d8bdec3
Merge branch 'main' into csr_polynomial
Micky774 4eb98ca
Updated to include test to confirm expected integer width
Micky774 4195a4d
Updated fused types
Micky774 40cd06d
Updated typing
Micky774 f1dc6dc
Included feedback, and caught/handled old scipy bug
Micky774 51ea18c
Merge branch 'main' into csr_polynomial
Micky774 4189339
Apply suggestions from code review
Micky774 6d9f698
Fixed typo
Micky774 c06e26f
Merge branch 'main' into csr_polynomial
ogrisel e1a0725
Merge branch 'main' into csr_polynomial
ogrisel ca5c4d5
Update sklearn/preprocessing/_polynomial.py
Micky774 8e218d1
Merge branch 'main' into csr_polynomial
Micky774 2d2124d
Improved tests
Micky774 b00d149
Improved tests
Micky774 c16e33c
Clean paranthesis
Micky774 af934a8
Updated test to account for 32 bit systems
Micky774 c4eeb7c
Updated ValueError match string
Micky774 6576bdd
Fixed overflow bug in tests for Windows
Micky774 99eabab
Adopted review feedback
Micky774 9a90a59
Improved constant documentation
Micky774 9d9d21b
Improved variable names
Micky774 7ac1a41
Update sklearn/preprocessing/_polynomial.py
Micky774 1ebfbbe
Merge branch 'main' into csr_polynomial
Micky774 7756f62
Incorporated typedef changes
Micky774 6112cf8
Merge branch 'main' into csr_polynomial
Micky774 cf3c00c
Added check for 32bit-ness for clang
Micky774 810fb3b
Improved documentation
Micky774 ff36da1
Merge branch 'main' into csr_polynomial
Micky774 3033e4b
Merge branch 'main' into csr_polynomial
Micky774 ce6b72a
Updated for emscripten edge-case
Micky774 bcdee5d
Removed extraneous assertion
Micky774 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,73 +1,178 @@ | ||
# Author: Andrew nystrom <[email protected]> | ||
# Authors: Andrew nystrom <[email protected]> | ||
# Meekail Zain <[email protected]> | ||
from ..utils._typedefs cimport uint8_t, int64_t, intp_t | ||
|
||
from scipy.sparse import csr_matrix | ||
cimport numpy as cnp | ||
import numpy as np | ||
ctypedef uint8_t FLAG_t | ||
|
||
# We use the following verbatim block to determine whether the current | ||
# platform's compiler supports 128-bit integer values intrinsically. | ||
# This should work for GCC and CLANG on 64-bit architectures, but doesn't for | ||
# MSVC on any architecture. We prefer to use 128-bit integers when possible | ||
# because the intermediate calculations have a non-trivial risk of overflow. It | ||
# is, however, very unlikely to come up on an average use case, hence 64-bit | ||
# integers (i.e. `long long`) are "good enough" for most common cases. There is | ||
# not much we can do to efficiently mitigate the overflow risk on the Windows | ||
# platform at this time. Consider this a "best effort" design decision that | ||
# could be revisited later in case someone comes up with a safer option that | ||
# does not hurt the performance of the common cases. | ||
# See `test_sizeof_LARGEST_INT_t()`for more information on exact type expectations. | ||
cdef extern from *: | ||
""" | ||
#ifdef __SIZEOF_INT128__ | ||
typedef __int128 LARGEST_INT_t; | ||
#elif (__clang__ || __EMSCRIPTEN__) && !__i386__ | ||
typedef _BitInt(128) LARGEST_INT_t; | ||
#else | ||
typedef long long LARGEST_INT_t; | ||
#endif | ||
""" | ||
ctypedef long long LARGEST_INT_t | ||
|
||
|
||
# Determine the size of `LARGEST_INT_t` at runtime. | ||
# Used in `test_sizeof_LARGEST_INT_t`. | ||
def _get_sizeof_LARGEST_INT_t(): | ||
return sizeof(LARGEST_INT_t) | ||
|
||
cnp.import_array() | ||
|
||
# TODO: use `cnp.{int,float}{32,64}` when cython#5230 is resolved: | ||
# TODO: use `{int,float}{32,64}_t` when cython#5230 is resolved: | ||
# https://github.com/cython/cython/issues/5230 | ||
ctypedef fused DATA_T: | ||
ctypedef fused DATA_t: | ||
float | ||
double | ||
int | ||
long | ||
long long | ||
# INDEX_{A,B}_t are defined to generate a proper Cartesian product | ||
# of types through Cython fused-type expansion. | ||
ctypedef fused INDEX_A_t: | ||
signed int | ||
signed long long | ||
ctypedef fused INDEX_B_t: | ||
signed int | ||
signed long long | ||
|
||
|
||
cdef inline cnp.int32_t _deg2_column( | ||
cnp.int32_t d, | ||
cnp.int32_t i, | ||
cnp.int32_t j, | ||
cnp.int32_t interaction_only, | ||
) noexcept nogil: | ||
cdef inline int64_t _deg2_column( | ||
LARGEST_INT_t n_features, | ||
LARGEST_INT_t i, | ||
LARGEST_INT_t j, | ||
FLAG_t interaction_only | ||
) nogil: | ||
"""Compute the index of the column for a degree 2 expansion | ||
|
||
d is the dimensionality of the input data, i and j are the indices | ||
n_features is the dimensionality of the input data, i and j are the indices | ||
for the columns involved in the expansion. | ||
""" | ||
if interaction_only: | ||
return d * i - (i**2 + 3 * i) / 2 - 1 + j | ||
return n_features * i - i * (i + 3) / 2 - 1 + j | ||
else: | ||
return d * i - (i**2 + i) / 2 + j | ||
return n_features * i - i* (i + 1) / 2 + j | ||
|
||
|
||
cdef inline cnp.int32_t _deg3_column( | ||
cnp.int32_t d, | ||
cnp.int32_t i, | ||
cnp.int32_t j, | ||
cnp.int32_t k, | ||
cnp.int32_t interaction_only | ||
) noexcept nogil: | ||
cdef inline int64_t _deg3_column( | ||
LARGEST_INT_t n_features, | ||
LARGEST_INT_t i, | ||
LARGEST_INT_t j, | ||
LARGEST_INT_t k, | ||
FLAG_t interaction_only | ||
) nogil: | ||
"""Compute the index of the column for a degree 3 expansion | ||
|
||
d is the dimensionality of the input data, i, j and k are the indices | ||
n_features is the dimensionality of the input data, i, j and k are the indices | ||
for the columns involved in the expansion. | ||
""" | ||
if interaction_only: | ||
return ((3 * d**2 * i - 3 * d * i**2 + i**3 | ||
+ 11 * i - 3 * j**2 - 9 * j) / 6 | ||
+ i**2 - 2 * d * i + d * j - d + k) | ||
return ( | ||
( | ||
(3 * n_features) * (n_features * i - i**2) | ||
+ i * (i**2 + 11) - (3 * j) * (j + 3) | ||
) / 6 + i**2 + n_features * (j - 1 - 2 * i) + k | ||
) | ||
else: | ||
return ( | ||
( | ||
(3 * n_features) * (n_features * i - i**2) | ||
+ i ** 3 - i - (3 * j) * (j + 1) | ||
) / 6 + n_features * j + k | ||
) | ||
|
||
|
||
def py_calc_expanded_nnz_deg2(n, interaction_only): | ||
return n * (n + 1) // 2 - interaction_only * n | ||
|
||
|
||
def py_calc_expanded_nnz_deg3(n, interaction_only): | ||
return n * (n**2 + 3 * n + 2) // 6 - interaction_only * n**2 | ||
|
||
|
||
cpdef int64_t _calc_expanded_nnz( | ||
LARGEST_INT_t n, | ||
FLAG_t interaction_only, | ||
LARGEST_INT_t degree | ||
): | ||
""" | ||
Calculates the number of non-zero interaction terms generated by the | ||
non-zero elements of a single row. | ||
""" | ||
# This is the maximum value before the intermediate computation | ||
# d**2 + d overflows | ||
# Solution to d**2 + d = maxint64 | ||
# SymPy: solve(x**2 + x - int64_max, x) | ||
cdef int64_t MAX_SAFE_INDEX_CALC_DEG2 = 3037000499 | ||
|
||
# This is the maximum value before the intermediate computation | ||
# d**3 + 3 * d**2 + 2*d overflows | ||
# Solution to d**3 + 3 * d**2 + 2*d = maxint64 | ||
# SymPy: solve(x * (x**2 + 3 * x + 2) - int64_max, x) | ||
cdef int64_t MAX_SAFE_INDEX_CALC_DEG3 = 2097151 | ||
|
||
if degree == 2: | ||
# Only need to check when not using 128-bit integers | ||
if sizeof(LARGEST_INT_t) < 16 and n <= MAX_SAFE_INDEX_CALC_DEG2: | ||
return n * (n + 1) / 2 - interaction_only * n | ||
return <int64_t> py_calc_expanded_nnz_deg2(n, interaction_only) | ||
else: | ||
return ((3 * d**2 * i - 3 * d * i**2 + i ** 3 - i | ||
- 3 * j**2 - 3 * j) / 6 | ||
+ d * j + k) | ||
|
||
|
||
def _csr_polynomial_expansion( | ||
const DATA_T[:] data, | ||
const cnp.int32_t[:] indices, | ||
const cnp.int32_t[:] indptr, | ||
cnp.int32_t d, | ||
cnp.int32_t interaction_only, | ||
cnp.int32_t degree | ||
# Only need to check when not using 128-bit integers | ||
if sizeof(LARGEST_INT_t) < 16 and n <= MAX_SAFE_INDEX_CALC_DEG3: | ||
return n * (n**2 + 3 * n + 2) / 6 - interaction_only * n**2 | ||
return <int64_t> py_calc_expanded_nnz_deg3(n, interaction_only) | ||
|
||
cpdef int64_t _calc_total_nnz( | ||
INDEX_A_t[:] indptr, | ||
FLAG_t interaction_only, | ||
int64_t degree, | ||
): | ||
""" | ||
Perform a second-degree polynomial or interaction expansion on a scipy | ||
Calculates the number of non-zero interaction terms generated by the | ||
non-zero elements across all rows for a single degree. | ||
""" | ||
cdef int64_t total_nnz=0 | ||
cdef intp_t row_idx | ||
for row_idx in range(len(indptr) - 1): | ||
total_nnz += _calc_expanded_nnz( | ||
indptr[row_idx + 1] - indptr[row_idx], | ||
interaction_only, | ||
degree | ||
) | ||
return total_nnz | ||
|
||
|
||
cpdef void _csr_polynomial_expansion( | ||
const DATA_t[:] data, # IN READ-ONLY | ||
const INDEX_A_t[:] indices, # IN READ-ONLY | ||
const INDEX_A_t[:] indptr, # IN READ-ONLY | ||
INDEX_A_t n_features, | ||
DATA_t[:] result_data, # OUT | ||
INDEX_B_t[:] result_indices, # OUT | ||
INDEX_B_t[:] result_indptr, # OUT | ||
FLAG_t interaction_only, | ||
FLAG_t degree | ||
) nogil: | ||
""" | ||
Perform a second or third degree polynomial or interaction expansion on a | ||
compressed sparse row (CSR) matrix. The method used only takes products of | ||
non-zero features. For a matrix with density d, this results in a speedup | ||
on the order of d^k where k is the degree of the expansion, assuming all | ||
rows are of similar density. | ||
non-zero features. For a matrix with density :math:`d`, this results in a | ||
speedup on the order of :math:`(1/d)^k` where :math:`k` is the degree of | ||
the expansion, assuming all rows are of similar density. | ||
|
||
Parameters | ||
---------- | ||
|
@@ -80,9 +185,21 @@ def _csr_polynomial_expansion( | |
indptr : memory view on nd-array | ||
The "indptr" attribute of the input CSR matrix. | ||
|
||
d : int | ||
n_features : int | ||
The dimensionality of the input CSR matrix. | ||
|
||
result_data : nd-array | ||
The output CSR matrix's "data" attribute. | ||
It is modified by this routine. | ||
|
||
result_indices : nd-array | ||
The output CSR matrix's "indices" attribute. | ||
It is modified by this routine. | ||
|
||
result_indptr : nd-array | ||
The output CSR matrix's "indptr" attribute. | ||
It is modified by this routine. | ||
|
||
interaction_only : int | ||
0 for a polynomial expansion, 1 for an interaction expansion. | ||
|
||
|
@@ -95,47 +212,11 @@ def _csr_polynomial_expansion( | |
Matrices Using K-Simplex Numbers" by Andrew Nystrom and John Hughes. | ||
""" | ||
|
||
assert degree in (2, 3) | ||
|
||
if degree == 2: | ||
expanded_dimensionality = int((d**2 + d) / 2 - interaction_only*d) | ||
else: | ||
expanded_dimensionality = int((d**3 + 3*d**2 + 2*d) / 6 | ||
- interaction_only*d**2) | ||
if expanded_dimensionality == 0: | ||
return None | ||
assert expanded_dimensionality > 0 | ||
|
||
cdef cnp.int32_t total_nnz = 0, row_i, nnz | ||
|
||
# Count how many nonzero elements the expanded matrix will contain. | ||
for row_i in range(indptr.shape[0]-1): | ||
# nnz is the number of nonzero elements in this row. | ||
nnz = indptr[row_i + 1] - indptr[row_i] | ||
if degree == 2: | ||
total_nnz += (nnz ** 2 + nnz) / 2 - interaction_only * nnz | ||
else: | ||
total_nnz += ((nnz ** 3 + 3 * nnz ** 2 + 2 * nnz) / 6 | ||
- interaction_only * nnz ** 2) | ||
|
||
# Make the arrays that will form the CSR matrix of the expansion. | ||
cdef: | ||
DATA_T[:] expanded_data = np.empty( | ||
shape=total_nnz, dtype=data.base.dtype | ||
) | ||
cnp.int32_t[:] expanded_indices = np.empty( | ||
shape=total_nnz, dtype=np.int32 | ||
) | ||
cnp.int32_t num_rows = indptr.shape[0] - 1 | ||
cnp.int32_t[:] expanded_indptr = np.empty( | ||
shape=num_rows + 1, dtype=np.int32 | ||
) | ||
|
||
cnp.int32_t expanded_index = 0, row_starts, row_ends | ||
cnp.int32_t i, j, k, i_ptr, j_ptr, k_ptr, num_cols_in_row | ||
|
||
cdef INDEX_A_t row_i, row_starts, row_ends, i, j, k, i_ptr, j_ptr, k_ptr | ||
cdef INDEX_B_t expanded_index=0, num_cols_in_row, col | ||
with nogil: | ||
expanded_indptr[0] = indptr[0] | ||
result_indptr[0] = indptr[0] | ||
for row_i in range(indptr.shape[0]-1): | ||
row_starts = indptr[row_i] | ||
row_ends = indptr[row_i + 1] | ||
|
@@ -145,24 +226,32 @@ def _csr_polynomial_expansion( | |
for j_ptr in range(i_ptr + interaction_only, row_ends): | ||
j = indices[j_ptr] | ||
if degree == 2: | ||
col = _deg2_column(d, i, j, interaction_only) | ||
expanded_indices[expanded_index] = col | ||
expanded_data[expanded_index] = ( | ||
data[i_ptr] * data[j_ptr]) | ||
col = <INDEX_B_t> _deg2_column( | ||
n_features, | ||
i, j, | ||
interaction_only | ||
) | ||
result_indices[expanded_index] = col | ||
result_data[expanded_index] = ( | ||
data[i_ptr] * data[j_ptr] | ||
) | ||
expanded_index += 1 | ||
num_cols_in_row += 1 | ||
else: | ||
# degree == 3 | ||
for k_ptr in range(j_ptr + interaction_only, row_ends): | ||
k = indices[k_ptr] | ||
col = _deg3_column(d, i, j, k, interaction_only) | ||
expanded_indices[expanded_index] = col | ||
expanded_data[expanded_index] = ( | ||
data[i_ptr] * data[j_ptr] * data[k_ptr]) | ||
col = <INDEX_B_t> _deg3_column( | ||
n_features, | ||
i, j, k, | ||
interaction_only | ||
) | ||
result_indices[expanded_index] = col | ||
result_data[expanded_index] = ( | ||
data[i_ptr] * data[j_ptr] * data[k_ptr] | ||
) | ||
expanded_index += 1 | ||
num_cols_in_row += 1 | ||
|
||
expanded_indptr[row_i+1] = expanded_indptr[row_i] + num_cols_in_row | ||
|
||
return csr_matrix((expanded_data, expanded_indices, expanded_indptr), | ||
shape=(num_rows, expanded_dimensionality)) | ||
result_indptr[row_i+1] = result_indptr[row_i] + num_cols_in_row | ||
return |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.