Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] GSoC 2014: Standard Extreme Learning Machines #3306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

IssamLaradji
Copy link
Contributor

Finished implementing the standard extreme learning machines (ELMs). I am getting the following results with 550 hidden neurons against the digits datasets,

Training accuracy using the logistic activation function: 0.999444
Training accuracy using the tanh activation function: 1.000000

Fortunately, this algorithm is much easier to implement and debug than multi-layer perceptron :).
I will push a test file soon.

@ogrisel , @larsmans

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling e5e363d on IssamLaradji:Extreme-Learning-Machines into 68b0a28 on scikit-learn:master.

Training data, where n_samples in the number of samples
and n_features is the number of features.

y : numpy array of shape (n_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y should be an "array-like" and be validated as such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up. I made the changes in multi-layer perceptron as well.

@IssamLaradji
Copy link
Contributor Author

Hi, I am wondering what verbose extreme learning machines should display. Any ideas ?

Thanks

@IssamLaradji
Copy link
Contributor Author

Travis is acting strange, in that it raises an error for test_multilabel_classification(), although, in my local machine, the test_multilabel_classification() method in test_elm runs correctly with 1000 different seeds. Also, the pull request passed the local test after executing make test on the whole library.

Is there a chance that Travis uses libraries different (or a modified version) from the local for testing ?

@arjoly
Copy link
Member

arjoly commented Jun 30, 2014

This might be worth having a look at https://github.com/dclambert/Python-ELM.

@larsmans
Copy link
Member

Training squared error loss would seem appropriate for verbose output. Not every estimator has verbose output, though (naive Bayes doesn't because it runs instantly on typical problem sizes).

@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) when pulling 2be2941 on IssamLaradji:Extreme-Learning-Machines into 68b0a28 on scikit-learn:master.

@IssamLaradji
Copy link
Contributor Author

Thanks, displaying the training error as verbose is such a useful idea.

@ogrisel
Copy link
Member

ogrisel commented Jul 1, 2014

However, Travis raises an error for test_multilabel_classification(). Is there a chance that Travis uses libraries different (or a modified version) from the local for testing ?

The version of numpy / scipy used by the various travis workers are given in the environment variable of each build. You can see the exact setup in:

@IssamLaradji
Copy link
Contributor Author

@ogrisel thanks I will dig deeper to see where multi-label classification is being affected.

@IssamLaradji
Copy link
Contributor Author

Hi guys, I implemented weighted and regularized ELMs - here are their awesome results on the imbalanced dataset. :) :)

Non-Regularized ELMs (Large C)
non_regularized_elm

Regularized ELMs (Small C)
regularized_elm


# compute regularized output coefficients using eq. 3 in reference [1]
left_part = pinv2(
safe_sparse_dot(H.T, H_tmp) + identity(self.n_hidden) / self.C)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use ridge implementation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @agramfort , isn't this technically ridge regression? I am minimizing the L2 norm of the coefficients in the objective function - like in the equation below. Or do you mean I should use scikit-learn implementation of ridge ? Thanks.

l_elm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not look like ridge but you seem to do

(H'H + 1/C Id)^{-1} H'

and this is really a ridge solution where H is X and y is y and C = 1/alpha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the equation I gave is for weighted ELMs as it contains the weight term W which is not part of ridge. However, the implementation contains both versions - with W and without W.
The version without W computes the formulae you mentioned, (H'H + 1/C Id)^{-1} H'y.
Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without w then it is a ridge

@IssamLaradji
Copy link
Contributor Author

Pushed a lot of improvements.

  1. Added sequential ELM support - with partial_fit
  2. Added relevant tests for sequential ELM and weighted ELM

Created two examples.

  1. Weighted ELM plot
    plot_weighted

  2. Training vs. Testing with respect to hidden neurons
    plot_testing_training

Will be leaving the documentation till the end - after I implement the remaining part which is kernel support and after the code is reviewed. Thanks.

plot_decision_function(
clf_weightless, axes[0], 'ELM(class_weight=None, C=10e5)')
plot_decision_function(
clf_weight_auto, axes[1], 'ELM(class_weight=\'auto\', C=10e5)')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using ' use " to define the string : 'ELM(class_weight="auto", C=10e5)'

@IssamLaradji
Copy link
Contributor Author

@agramfort thanks for your comments. I pushed the updated code.

@IssamLaradji
Copy link
Contributor Author

Updates,

  1. ELM is now using ridge-regression as off-the-shelf solver to compute its solutions.
  2. Added support for kernels - linear, poly, rbf, sigmoid.
    Is there a way we could reuse the fast, efficient SVM kernel methods?
    Thanks.

@larsmans
Copy link
Member

There are kernels in sklearn.metrics. The ones in sklearn.svm are buried deep down in the C++ code for LibSVM.

@amueller
Copy link
Member

amueller commented May 7, 2015

It would be good to have an empirical case where the partial fit actually helps. The incremental fitting is what really makes this code non-trivial. If this is helpful, I'd say merge this with renaming / reassignment of credit, and later refactor into Ridge.

If not, maybe just add a transformer and an example?

@mblondel
Copy link
Member

mblondel commented May 7, 2015

For a stateless transformer, I presume the fit is mostly needed for input checking and setting the rng? The rng could be set in the first call to transform, although this might break your common tests. In any case, you can just call fit once outside of the for loop.

So, indeeded, the incremental fitting is useful in the n_features < n_hidden < n_samples regime. But this is the usual out-of-core learning setting: your features are too big so you need to build them on the fly and call partial fit on small batches. @agramfort had an example using polynomial features in his pydata talk :)

If we really want to go the estimator way (rather than the transformer way), there is actually a more elegant and concise way to solve the problem using conjugate gradient with a LinearOperator. This technique can be used to solve the system of linear equations without ever materializing the transformed features of size n_samples x n_hidden. This is because conjugate gradient only needs to compute products between the n_hidden x n_hidden matrix and a vector. This should be like 10 lines of code. See https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L63 for an example of how this works.

@amueller
Copy link
Member

amueller commented May 7, 2015

Yeah stateless transformers and the common tests don't work well together, they are currently manually excluded, and that is something we / I need to fix.
The problem with setting the random weights on the first transform is that this would break if people want to use a transformer object on two different datasets. n_features is usually inferred in fit and if you use another one in transform that is an error. I guess you could set it in transform unless it was set in fit, and if you explicitly want to use it on another dataset, you have to call fit. That is slightly magic, though.

@amueller
Copy link
Member

amueller commented May 7, 2015

For better ways to solve the problem: well it could be that n_features is small enough that you could fit n_samples x n_features into ram, but not n_samples x n_hidden. Not sure what typical n_features and n_hidden are.

You are the expert in solving linear problems, I am certainly not, so if there are smarter ways to solve this, then we should go for these.
I didn't mentor this GSoC, I just heard multiple times "this just needs a final review"..

@mblondel
Copy link
Member

mblondel commented May 8, 2015

For better ways to solve the problem: well it could be that n_features is small enough that you could fit n_samples x n_features into ram, but not n_samples x n_hidden. Not sure what typical n_features and n_hidden are.

Sorry, when I was talking about generating features on the fly, I was referring to the features generated by the random projection + activation transformer. This is the same setting with polynomial features as well: your original features fit in memory but not the combination features obtained by PolynomialFeatures. But the principle is the same even if you start from your raw data, as long as the transformer used is stateless (e.g., FeatureHasher). In all cases, we loop over small batches of data, transform them and call partial_fit.

@GaelVaroquaux
Copy link
Member

For partial fit pipelines, I would only support stateless transformers for the moment.

+1: let's tackle the simple things first.

@mblondel
Copy link
Member

mblondel commented May 8, 2015 via email

@amueller
Copy link
Member

amueller commented May 8, 2015

Did I say I have a plan? ;)
Three possible ways?

  1. detect (I have no idea how)
  2. annotate as stateless
  3. via api: stateless transformers don't need to call fit. Then we don't need to call fit in partial_fit. This would also allow users to provide a statefull transformer fit on a subset of the data.

@amueller
Copy link
Member

amueller commented May 8, 2015

Can we decide what to do with this PR first?
@IssamLaradji put a lot of work into it, and it has been sitting around for way to long. If we feel that the algorithm isn't suited for a classifier / regressor class, we should see what we can salvage and add transformers / examples etc.

@mblondel
Copy link
Member

mblondel commented May 8, 2015

+1 for a transformer on my side. Instead of using a pipeline of two transformers as I initially suggested, we can maybe create just one transformer that does the random projection and applies the activation functions. This should be fairly straightforward to implement. For the examples, showing how to do grid search with a pipeline would be nice. For the name, of the transformer, maybe RandomActivationTransformer?

@amueller
Copy link
Member

amueller commented May 8, 2015

Do you think the interative ridge regression here has value or are there better ways to partial_fit ridge regression?

@mblondel
Copy link
Member

mblondel commented May 8, 2015

The idea of accumulating the n_hidden x n_hidden matrix is nice but this won't scale if n_hidden is large. If we implement a general partial_fit out of this algorithm, this will crash when people try it on high dimensional data like bag of words. We can add it and recommend not to use it when n_features is large. This would still be useful in some settings where n_samples is huge but n_features is reasonably small. For n_features large, I guess one should use SGD's partial_fit.

@amueller
Copy link
Member

amueller commented May 8, 2015

Ok. So lets do the transformer? @IssamLaradji do you want to do that?
Or do you think you don't have time?

I'm not sure about RandomActivationTransformer. Maybe NonlinearProjection though projection kinda means to a lower dim space. NonlinearRandomFeatures? RandomFeatures?

@vene
Copy link
Member

vene commented May 8, 2015

I don't like RandomFeatures, it's way too generic. From the name I'd expect it to simply ignore X and return random features. Out of all the names here, it seems to me like RandomActivation is the most specific (it best conveys what the object does). (I'd remove the Transformer suffix).

@jnothman
Copy link
Member

jnothman commented May 9, 2015

aside @amueller re stateless transformers: for this purpose, transformers
that depend only on the type or number of columns of the input should also
be acceptable, just to make things tricky!

On 9 May 2015 at 01:43, Vlad Niculae [email protected] wrote:

I don't like RandomFeatures, it's way too generic. From the name I'd
expect it to simply ignore X and return random features. Out of all the
names here, it seems to me like RandomActivation is the most specific (it
best conveys what the object does). (I'd remove the Transformer suffix).


Reply to this email directly or view it on GitHub
#3306 (comment)
.

@IssamLaradji
Copy link
Contributor Author

@amueller yah sure! I can do the transformer.

So I will open a new pull request for this.

Should the file, containing the algorithm, be under the scikit-learn main directory ?
I mean, would it be something like from sklearn import RandomActivation ?

Would the parameters be something like,
-weight_scale; which sets the range of values for the uniform random sampling.

  • activation_function; which could be identity, relu, logistic and so on.

PS: I think for ridge regression there is also feature-wise batch support which scales with n_features rather than n_samples.

@mblondel
Copy link
Member

mblondel commented May 9, 2015

One possible place would be the pre-processing module.

@mblondel
Copy link
Member

mblondel commented May 9, 2015

Actually how about putting it in the neural_network module?

@IssamLaradji
Copy link
Contributor Author

sounds good.

@IssamLaradji
Copy link
Contributor Author

right, it's only used by neural network algorithms as far as I know, so having it in neural_network module is better imo.

@IssamLaradji
Copy link
Contributor Author

#4703 this is a rough implementation of the RandomActivation algorithm

@jnothman jnothman mentioned this pull request May 11, 2015
3 tasks
@amueller
Copy link
Member

How about "RandomBasisFunction" as a name?

@ekerazha
Copy link

The decoupled approach was the same approach of https://github.com/dclambert/Python-ELM
You had a "random_layer" and you could also pipeline it before a Ridge regression.

Moreover, it also included a MELM-GRBF implementation.

@mblondel
Copy link
Member

mblondel commented May 13, 2015 via email

@ProfFan
Copy link

ProfFan commented Jan 8, 2016

Just FYI:
Recently Anton Akusok et al. has implemented ELM in python with MAGMA-based acceleration, under the name of hpelm (pypi:https://pypi.python.org/pypi/hpelm).

@agramfort
Copy link
Member

I don't think this will ever get merged. Closing. Feel free to reopen if you disagree.

@agramfort agramfort closed this Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.