-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] GSoC 2014: Standard Extreme Learning Machines #3306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] GSoC 2014: Standard Extreme Learning Machines #3306
Conversation
Training data, where n_samples in the number of samples | ||
and n_features is the number of features. | ||
|
||
y : numpy array of shape (n_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y
should be an "array-like" and be validated as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up. I made the changes in multi-layer perceptron
as well.
Hi, I am wondering what verbose extreme learning machines should display. Any ideas ? Thanks |
Travis is acting strange, in that it raises an error for Is there a chance that Travis uses libraries different (or a modified version) from the local for testing ? |
This might be worth having a look at https://github.com/dclambert/Python-ELM. |
Training squared error loss would seem appropriate for verbose output. Not every estimator has verbose output, though (naive Bayes doesn't because it runs instantly on typical problem sizes). |
Thanks, displaying the training error as verbose is such a useful idea. |
The version of numpy / scipy used by the various travis workers are given in the environment variable of each build. You can see the exact setup in: |
@ogrisel thanks I will dig deeper to see where multi-label classification is being affected. |
|
||
# compute regularized output coefficients using eq. 3 in reference [1] | ||
left_part = pinv2( | ||
safe_sparse_dot(H.T, H_tmp) + identity(self.n_hidden) / self.C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use ridge implementation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @agramfort , isn't this technically ridge regression? I am minimizing the L2 norm of the coefficients in the objective function - like in the equation below. Or do you mean I should use scikit-learn implementation of ridge ? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not look like ridge but you seem to do
(H'H + 1/C Id)^{-1} H'
and this is really a ridge solution where H is X and y is y and C = 1/alpha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the equation I gave is for weighted ELMs as it contains the weight term W
which is not part of ridge. However, the implementation contains both versions - with W
and without W
.
The version without W
computes the formulae you mentioned, (H'H + 1/C Id)^{-1} H'y
.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without w then it is a ridge
Pushed a lot of improvements.
Created two examples. Will be leaving the documentation till the end - after I implement the remaining part which is kernel support and after the code is reviewed. Thanks. |
plot_decision_function( | ||
clf_weightless, axes[0], 'ELM(class_weight=None, C=10e5)') | ||
plot_decision_function( | ||
clf_weight_auto, axes[1], 'ELM(class_weight=\'auto\', C=10e5)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than using ' use " to define the string : 'ELM(class_weight="auto", C=10e5)'
@agramfort thanks for your comments. I pushed the updated code. |
Updates,
|
There are kernels in |
It would be good to have an empirical case where the partial fit actually helps. The incremental fitting is what really makes this code non-trivial. If this is helpful, I'd say merge this with renaming / reassignment of credit, and later refactor into Ridge. If not, maybe just add a transformer and an example? |
For a stateless transformer, I presume the fit is mostly needed for input checking and setting the rng? The rng could be set in the first call to transform, although this might break your common tests. In any case, you can just call fit once outside of the for loop. So, indeeded, the incremental fitting is useful in the n_features < n_hidden < n_samples regime. But this is the usual out-of-core learning setting: your features are too big so you need to build them on the fly and call partial fit on small batches. @agramfort had an example using polynomial features in his pydata talk :) If we really want to go the estimator way (rather than the transformer way), there is actually a more elegant and concise way to solve the problem using conjugate gradient with a LinearOperator. This technique can be used to solve the system of linear equations without ever materializing the transformed features of size n_samples x n_hidden. This is because conjugate gradient only needs to compute products between the n_hidden x n_hidden matrix and a vector. This should be like 10 lines of code. See https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L63 for an example of how this works. |
Yeah stateless transformers and the common tests don't work well together, they are currently manually excluded, and that is something we / I need to fix. |
For better ways to solve the problem: well it could be that n_features is small enough that you could fit n_samples x n_features into ram, but not n_samples x n_hidden. Not sure what typical n_features and n_hidden are. You are the expert in solving linear problems, I am certainly not, so if there are smarter ways to solve this, then we should go for these. |
Sorry, when I was talking about generating features on the fly, I was referring to the features generated by the random projection + activation transformer. This is the same setting with polynomial features as well: your original features fit in memory but not the combination features obtained by PolynomialFeatures. But the principle is the same even if you start from your raw data, as long as the transformer used is stateless (e.g., FeatureHasher). In all cases, we loop over small batches of data, transform them and call partial_fit. |
+1: let's tackle the simple things first. |
@amueller: What's the plan? Will you need to detect whether a transformer
is stateless?
|
Did I say I have a plan? ;)
|
Can we decide what to do with this PR first? |
+1 for a transformer on my side. Instead of using a pipeline of two transformers as I initially suggested, we can maybe create just one transformer that does the random projection and applies the activation functions. This should be fairly straightforward to implement. For the examples, showing how to do grid search with a pipeline would be nice. For the name, of the transformer, maybe |
Do you think the interative ridge regression here has value or are there better ways to |
The idea of accumulating the n_hidden x n_hidden matrix is nice but this won't scale if n_hidden is large. If we implement a general partial_fit out of this algorithm, this will crash when people try it on high dimensional data like bag of words. We can add it and recommend not to use it when n_features is large. This would still be useful in some settings where n_samples is huge but n_features is reasonably small. For n_features large, I guess one should use SGD's partial_fit. |
Ok. So lets do the transformer? @IssamLaradji do you want to do that? I'm not sure about |
I don't like |
aside @amueller re stateless transformers: for this purpose, transformers On 9 May 2015 at 01:43, Vlad Niculae [email protected] wrote:
|
@amueller yah sure! I can do the transformer. So I will open a new pull request for this. Should the file, containing the algorithm, be under the scikit-learn main directory ? Would the parameters be something like,
PS: I think for ridge regression there is also feature-wise batch support which scales with |
One possible place would be the pre-processing module. |
Actually how about putting it in the neural_network module? |
sounds good. |
right, it's only used by neural network algorithms as far as I know, so having it in |
#4703 this is a rough implementation of the |
How about "RandomBasisFunction" as a name? |
The decoupled approach was the same approach of https://github.com/dclambert/Python-ELM Moreover, it also included a MELM-GRBF implementation. |
Thanks, I added the project to
https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets
|
Just FYI: |
I don't think this will ever get merged. Closing. Feel free to reopen if you disagree. |
Finished implementing the standard extreme learning machines (ELMs). I am getting the following results with 550 hidden neurons against the digits datasets,
Training accuracy using the logistic activation function: 0.999444
Training accuracy using the tanh activation function: 1.000000
Fortunately, this algorithm is much easier to implement and debug than multi-layer perceptron :).
I will push a test file soon.
@ogrisel , @larsmans