FEA Calculate Explained Variance Ratio for _PLS Models
#32722
+126
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
This PR is targeting the Issue (#32675) regarding the addition of the attributes
explained_variance_ratio_x_andexplained_variance_ratio_y_for the_PLSmodels.Fixes #32675
Fixes #19896
Fixes #30470
What does this implement/fix? Explain your changes.
This PR implements the
explained_variance_ratio_x_andexplained_variance_ratio_y_for_PLSmodels, analogous to how PCA exposesexplained_variance_ratio_. Not having access to it in thePLSRegressionmakes it difficult to:In the current PR, I propose adding two new attributes, that are calculated during fitting:
explained_variance_ratio_x_:ndarrayof shape (n_components)Fraction of variance explained in X-space for each component.
explained_variance_ratio_y_:ndarrayof shape (n_components)Fraction of variance explained in Y-space for each component.
In the initial issue (#32675) I suggested a solution for the
PLSRegressionmodel, including a calculation of the explained variance ratio for both matrices after right after fitting the superclass.However, as I was working on it, I realized that a more elegant and extensible solution is to calculate the variance ratio directly in the parent class during fit. This gives:
Extensibility the suggested solution directly extends the functionality to other models inheriting
_PLS(PLSRegression,PLSCanonicalandCCA) without having to handle additional logic (i.e.,deflation_modecanonicalvsregression(symmetric vs asymmetric)) as this logic is already handled duringsupper().fit().Consistency all subclasses expose the same attributes without custom logic.
Performance the suggested approach is more performant because it does not require to redundantly deflate the matrices again after fitting. The explained variances are calculated in each iteration during the fitting process.
To ensure correctness, the implementation was compared with literature benchmark values; additional details are provided in the Testing section.
I think the superclass implementation is the better design choice, but I’m happy to adopt the earlier proposal if the maintainers prefer that direction.
Testing
The following test actions have been done:
Added tests
The following test was added to
sklearn/cross_decomposition/tests/test_pls.py:test_pls_variance_ratio_X_y()This test runs on all PLS models that inherit the
_PLS(PLSRegression,PLSCanonicalandCCA). A description of the test is provided below:For
PLSRegression,PLSCanonicalandCCA:explained_variance_ratio_x_andexplained_variance_ratio_y_is the same as thenr_componentsin the model.Xapproaches 1 when using the maximum number of components.(This holds for the synthetic test data; symmetric-deflation models may vary depending on the ranks of
Xandy.)For
PLSRegression:Xand theymatrices [1].ymatrix is not larger than 1 variance when the max number of components is used (due to asymmetric deflation).For
PLSCanonicalandCCAit is expected that the cumulative variance explained in theymatrix adds to 1 when the max number of components are used.[1].. Abdi, H. (2003) Partial Least Squares (PLS) Regression. In Lewis-Beck M., Bryman A., Futing T. (Eds.), Encyclopedia of Social Sciences Research Methods. Thousand Oaks (CA): Sage.
Run the following test suite:
pytest sklearn/cross_decomposition/tests/test_pls.py [69/69 passing]pytest sklearn/tests/test_common.py -k PLSRegression -v[65/66 passing, 1 skipped]*pytest sklearn/tests/test_common.py -k PLSCanonical -v[65/66 passing, 1 skipped]*pytest sklearn/tests/test_common.py -k CCA -v[65/66 passing, 1 skipped]*(Skipped tests are unrelated and also skipped on main.)
Documentation
Added docstrings
PLSRegression,PLSCanonicalandCCA.NOTE: other attributes include the release version at which the attribute became available (e.g.,
.. versionadded:: 1.0) I have not added version tags yet; I will include them once maintainers confirm the target release._calculate_variance_xy()test_pls_variance_ratio_X_y()A reference to the literature values is included in this reference.
Building documentation
The documentation was build as indicated in the contributor documentation page. As suggested in the guide, the generated HTML files were inspected to verify the successful built of the documentation. This was done for all three models in scope and an example is shown in the image below.
PLSRegression.PLSCanonical.CCA.Examples
Right now, this is only exemplified in the documentation Since it is a rather small addition, I am not sure what would be best:
Happy to add an example if desired 😄
Performance
The computation is integrated into the existing iterative deflation performed during fit, so the overhead is minimal. No additional matrix factorizations or extra passes over the data are introduced.
There is no impact on estimator instantiation time, or on
.transform()or.predict().