-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] Self organising map #2996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
''' | ||
The pseudo F statistic : | ||
|
||
pseudo F = [( [(T - PG)/(G - 1)])/( [(PG)/(n - G)])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define T, G, PG and n. People should not have to read the reference to understand the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be moved somewhere else, or removed entirely, I don't yet know what it is, or how useful it is.. It is currently only used in the examples. Suggestions for where to stick it are welcome :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with adding a new scoring metric as long as it's used in the literature and properly documented (both in the docstring and in the narrative doc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but where should it be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the function into the examples/cluster/som_digits.py, and explained what's happening, and why it's useful in commit 367c44b
import pylab as pl | ||
from matplotlib.colors import ListedColormap, NoNorm, rgb2hex | ||
import numpy as np | ||
from scikits.learn.cluster import SelfOrganizingMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh gosh… This is old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to update the examples tomorrow.
The broken tests reported by travis are real failures. |
@ogrisel : it's a problem of convergence. The tests converge eventually, but it needs so many iterations that the tests become slow. It's possible that the generated data sets are pathological for this particular type of model. I plan to look in to that more. |
plot(init) | ||
plt.title('Initial map') | ||
|
||
som = SelfOrganizingMap(affinity=(16,16), n_iterations=1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example doesn't appear to be working at the moment. Might need to look into it more. It's very similar to the example at http://www.pymvpa.org/examples/som.html
@NelleV Now would be a good time to review the code. I've been over it in depth with the debugger, and it looks like it's doing the right thing, but it isn't converging very well in some circumstances (in particular, the I've tried a couple of modifications, for which I will add comments in a minute, and some of these make the colormap example converge somewhat, but still really slowly, but they make one or both of the passing tests fail. Main differences between this algorithm and Kohonen (1990), is that we're using an arbitrary graph for the SOM map, (which is a grid by default) so we can't use the L2 metric, and must use the L1 metric (post office/shortest path). This affects both the neighbourhood and radius functions. The neighborhood, radius, and alpha (learning rate) functions are all somewhat arbitrary - Kohonen doesn't specify particular functions, just that they are monotonically decreasing as functions of time (iteration), or, in the case of neighbourhood, of distance between map nodes. It's not obvious to me that this shouldn't work in any particular case. |
winner = self.best_matching_center(x) | ||
radius = self.radius_of_the_neighborhood(iteration) | ||
updatable = self.cluster_centers_in_radius(winner, radius) | ||
distances = np.sum((self.cluster_centers_[winner] - self.cluster_centers_[updatable])**2, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be
distances = self.distance_matrix[winner][updatable]
but if I make it that, then currently passing tests start failing. Not sure why. It DOES make the plot example start to converge, but only really slowly (increase n_iterations in the plot example to try it).
Ok, the updater function was wrong (pulling the distance from x to the winning node, instead of to the node to be pulled). It's working now, and converging well. I'll see if I can improve the radius and alpha functions, but it actually works quite well now One of the cluster comparison data sets (the one with three clusters) is pathological for a 2x2 SOM. It works fine with a 3x1 grid, but then the others don't work so well. |
Can someone please have a look at this? it's been nearly 2 weeks since I got it working. The test failure appears to be something to do with estimator cloning - something that I don't touch on. I have tried doing what I think the test is doing manually, and I can't replicate the problem (I don't know how |
@@ -84,6 +84,8 @@ | |||
average_linkage = cluster.AgglomerativeClustering(linkage="average", | |||
affinity="cityblock", n_clusters=2, | |||
connectivity=connectivity) | |||
som = cluster.SelfOrganizingMap(adjacency=(2, 2), | |||
n_iterations= 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8
Also, I am still not convinced about the SOM usefulness. In the todo list the item:
has been ticked. However then only on example on the digits data and the CH score is lower (which I think means that the clustering quality is not as good as k-means' according to the CH assumption. This does not really highlight the SOM method, especially as the fitting time is significantly higher. However when using other metrics the SOM solution is not that bad. For instance Adjusted Rand Score computed with the true digits labels says that the quality of the SOM solution is on average similar to the k-means solution. Please also include the SOM method in the cluster comparison example. Similarly the other TODO item has also been ticked:
By reading the narrative document I still don't get any intuition as to where one would rather use SOM vs other clustering method by reading the examples or the documentation. What are the known applications where SOM has been shown to work well in practice? Also out of curiosity, what is your personal interest in using SOM rather than simpler methods such as k-means? Please untick those TODO items and address those comments first. As I said earlier, there is no point in implementing, reviewing and maintaining methods in scikit-learn if they cannot be clearly demonstrated to be practically useful and better than other simpler methods (at least for some problems) in the documentation and the examples. To clarify my point, I do not state SOMs are practically useless (I don't know), I just want to be convinced they are not, as would the reader of the scikit-learn documentation would too. |
This is a difficulty: The SOM theoretically should always perform worse than K-means. As discussed at #2892, it isn't the simple fitting that SOM is useful for, rather it is the fact that the SOM grid adds a semantic layer to the cluster layout. SOMs are basically useful for a) dimensionality reduction, where there are strong non-linear patterns in the data, and b) for using the semantic grid as a basis for creating small-multiple plots based on data from each cluster.
I will see if I can come up with some examples that show these uses. |
Thanks! |
5c4bff6
to
0cd1b8a
Compare
I think this should go into contrib (looks like @naught101 gave up and then decided to not give up?). |
closing for now, feel free to reopen and argue. I think it would be nice to have in contrib, but not for master. |
I've been busy with other stuff, and haven't needed this module. Sorry that I haven't had time to fill out the examples. I was kind of hoping thinking a code review beforehand would be useful, as well as some indication that this would be useful in sklearn. I'm not too hung up on it being in the core package, but I'm worried that I wouldn't be able to adequately maintain it in contrib (due to lack of expertise in the theory, as well as code efficiency). But I guess that's as good an argument for keeping it out of core, too :) |
I don't think having an unmaintained extension is much of an issue. On 26 October 2016 at 11:38, naught101 [email protected] wrote:
|
Update of Sebastien Campion's pull request, as discussed in #2892
TODO: