Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH Enable the "sufficient stats" mode of LARS #11699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Mar 6, 2019
Merged

ENH Enable the "sufficient stats" mode of LARS #11699

merged 66 commits into from
Mar 6, 2019

Conversation

yukuairoy
Copy link
Contributor

@yukuairoy yukuairoy commented Jul 27, 2018

What does this implement/fix? Explain your changes.

We'd like to enable the "gram and cov matrix" based mode of the LARS algorithm in the function lars_path(...). As the original paper of B. Efron, T. Hastie, I. Johnstone, R. Tibshirani (2004) documented, as long as we know the sufficient statistics, in this case, the Gram matrix, the Cov vector (Xy) and sample size, the LARS algorithm will be able to work.

We'd like to add a lars_path_gram(...) function to allow users to run through it even if they only know the sufficient statistics but not the original data X and y.

Additional tests have been added to ensure the new lars_path_gram(...) function works as intended.

@agramfort
Copy link
Member

@yukuairoy you can already pass gram and Xy as parameter to lars_path. Why is it not enough?

also I see a number of cosmetic changes to existing code in this PR. Avoid this when possible
so the diff is limited to what the new feature code.

@agramfort
Copy link
Member

@yukuairoy can you answer my question above?

@yukuairoy
Copy link
Contributor Author

yukuairoy commented Jul 29, 2018

@agramfort Thank you for looking at this PR. Previously users have to pass in non-None X and y for lars_path() to work. The "sufficient stats" mode allows users to skip providing X and y (using None as a placeholder) by providing the set of sufficient statistics instead. In addition to the Gram matrix and Cov matrix (Xy), n_samples is needed as the complete set of sufficient statistics. n_samples was not part of the signature of lars_path() hence we also need to add it into the function signature for this mode to work through.

I've sent a separate PR (#11703) to fix the cosmetic issues and I'll push another commit to make sure this current PR is only about the change in functionality.

@jnothman
Copy link
Member

jnothman commented Jul 29, 2018 via email

@yukuairoy
Copy link
Contributor Author

@jnothman Thank you for the comment. Could you clarify what you mean by "n_samples = Xy.shape[0]???​"? I cannot find this line in my diff. Or are you suggesting that n_samples can be inferred from Xy?

Please correct me if I'm wrong, I'd think Xy.shape[0] informs us the n_features info as opposed to n_samples. As a matter of fact, n_samples cannot be inferred from either Gram or Xy unless it is explicitly input by the user.

@agramfort
Copy link
Member

thinking about it I think it would be cleaner to have a new lars_path_gram function that takes Gram and Xy as input (no X or y) and to deprecate the option to pass Gram and Xy to lars_path. That will simplify the API of lars_path.

@jnothman
Copy link
Member

jnothman commented Jul 31, 2018 via email

@yukuairoy
Copy link
Contributor Author

Hi @jnothman, thanks very much for bringing this up. It looks like on the master branch there is indeed a documentation "bug" in the parameter description of Xy. After this PR is approved and merged into master, I can create another PR to fix that documentation "bug" if it helps the community.

@yukuairoy
Copy link
Contributor Author

Hi @agramfort, thanks very much for this nice thought. To provide a bit more context, in many of the existing client codebases we know (as well as several cases which the unit-test code exemplified), sometimes users like to invoke so-called "precomputed" mode of the lars_path(...) function, i.e. passing in both X, y, besides Gram and Xy which are literally precomputed. Under the hood of the numerical implementation, knowing X directly might enable the solving algorithm to employ slightly more efficiently numerical linear-algebra routines, compared with the case where X is not known. This is also consistent with the fact that the numerical output of the “precomputed” mode is not always exactly equal to output of the raw mode (some living examples can be found in the test-cases; more can be found in client codebases though they are hard to see from the github perspective). If we deprecate the option to pass Gram and Xy to lars_path(...), one concern is that it will likely break many of the existing client codebases, which may not be thoroughly necessary.

@agramfort
Copy link
Member

@yukuairoy you will have to change the code of your clients anyway to support what you are aiming for.

When you start to have parameters that are necessary/optional depending on other parameters documenting the API starts to be a mess. With the current master you always need X and y and you can pass precomputed values to avoid recomputation. With what you propose we can have X and y None but then we need to pass n_samples. It starts to be mess I think.

I would prefer to have

lars_path(X, y, precompute=’auto’ | True | False) that would follow the Lars API http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lars.html

and

lars_path_gram(Gram, Xy, n_samples)

of course we should do this without code duplication via a private function.

@yukuairoy
Copy link
Contributor Author

@agramfort I agree that we should keep the two modes separate. My only concern with your suggestion of lars_path(X, y, precompute=’auto’ | True | False) is in order to support the precompute mode, lars_path(), in addition to a precompute parameter, still needs to take Gram and Xy -- making the precompute parameter redundant.

How about we keep the original

lars_path(X, y, Xy=None, Gram=None, max_iter=500, alpha_min=0, method='lar', copy_X=True, eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0, return_path=True, return_n_iter=False, positive=False)

intact and add an additional

lars_path_gram(Gram, Xy, n_samples, max_iter=500, alpha_min=0, method='lar', copy_X=True, eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0, return_path=True, return_n_iter=False, positive=False)?

This way we get to keep backward compatibility. Of course we'll use a private function to avoid code duplication.

@yukuairoy
Copy link
Contributor Author

Thanks @agramfort and @jnothman for the comments. I've updated the code. Please take a look.

@jnothman jnothman changed the title Enable the "sufficient stats" mode of LARS ENH Enable the "sufficient stats" mode of LARS Jan 8, 2019
@yukuairoy
Copy link
Contributor Author

@agramfort Do you have further comments?

@yukuairoy
Copy link
Contributor Author

@agramfort Friendly ping

@yukuairoy
Copy link
Contributor Author

@agramfort can you please review the current version?

@yukuairoy
Copy link
Contributor Author

yukuairoy commented Feb 4, 2019

@jnothman Thanks for reviewing this merge request. Is there anything we can do to make sure @agramfort reviews the latest changes?

@agramfort
Copy link
Member

@yukuairoy we need a what's new update before merging.

@yukuairoy
Copy link
Contributor Author

@agramfort @jnothman thanks for LGTM. I've updated the What's New.

@jnothman
Copy link
Member

jnothman commented Mar 6, 2019

Thanks @yukuairoy!

@jnothman jnothman merged commit fec7670 into scikit-learn:master Mar 6, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants