Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a25522e
ENH: MultiOutputTree (wip)
glouppe Jun 25, 2012
4728c79
Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…
glouppe Jun 27, 2012
eac35cc
ENH: Multi-output decision trees
glouppe Jun 28, 2012
064a48c
ENH: Regenerate .c file
glouppe Jun 29, 2012
74bf03c
FIX: graphviz test
glouppe Jun 29, 2012
55dbb49
Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…
glouppe Jun 29, 2012
be8ea69
FIX: test_classification_toy
glouppe Jun 29, 2012
afacf44
TEST: test_multioutput (1)
glouppe Jun 29, 2012
6cf4d26
TEST: test_multioutput
glouppe Jul 2, 2012
b22b1f6
ENH: make forests support multi-output
glouppe Jul 2, 2012
7b6ef37
TEST: test_multioutput
glouppe Jul 2, 2012
5ee718c
ENH: Patch GradientBoosting
glouppe Jul 2, 2012
41cd38f
ENH: Patch GradientBoosting (2)
glouppe Jul 2, 2012
b4131f9
FIX: log_proba + DOC
glouppe Jul 2, 2012
d560372
DOC: What's new
glouppe Jul 2, 2012
9f7a0dd
PEP8
glouppe Jul 2, 2012
0cae649
ENH: graphviz
glouppe Jul 2, 2012
e00d789
DOC: narrative documentation
glouppe Jul 2, 2012
358884a
DOC: typo
glouppe Jul 2, 2012
c549cb6
DOC: Scikit-Learn -> scikit-learn
glouppe Jul 2, 2012
0d4719e
ENH: Cython improved code
glouppe Jul 2, 2012
18a2e23
ENH: Cython improved code (2)
glouppe Jul 2, 2012
5333afa
DOC: narrative documentation
glouppe Jul 3, 2012
f178fe6
FIX: use and modify own y
glouppe Jul 3, 2012
b14c23a
COSMIT
glouppe Jul 3, 2012
f1bdd99
FIX: segfault
glouppe Jul 4, 2012
f11ff94
DOC: Example
glouppe Jul 4, 2012
264737e
DOC: typo
glouppe Jul 4, 2012
386631e
DOC: example
glouppe Jul 4, 2012
91963b8
DOC: typo
glouppe Jul 4, 2012
a08a910
DOC: narrative documentation
glouppe Jul 4, 2012
637ab82
added multi-ouput tree example
bdholt1 Jul 9, 2012
81a1f90
updated documentation to reflect multi-output DT regression
bdholt1 Jul 9, 2012
e5a61dc
Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…
glouppe Jul 9, 2012
94a5f3f
added link
bdholt1 Jul 9, 2012
dc8e65a
Merge pull request #3 from bdholt1/glouppe-tree-mo
glouppe Jul 9, 2012
532c54c
Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…
glouppe Jul 9, 2012
f14601a
DOC: format
glouppe Jul 9, 2012
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ target values (class labels) for the training samples::
>>> clf = RandomForestClassifier(n_estimators=10)
>>> clf = clf.fit(X, Y)

Like :ref:`decision trees <tree>`, forests of trees also extend
to :ref:`multi-output problems <tree_multioutput>` (if Y is an array of size
``[n_samples, n_outputs]``).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a link to the face completion example here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Random Forests
--------------
Expand Down Expand Up @@ -161,6 +165,8 @@ amount of time (e.g., on large datasets).

* :ref:`example_ensemble_plot_forest_iris.py`
* :ref:`example_ensemble_plot_forest_importances_faces.py`
* :ref:`example_ensemble_plot_forest_multioutput.py`


.. topic:: References

Expand Down Expand Up @@ -210,6 +216,7 @@ the matching feature to the prediction function.
* :ref:`example_ensemble_plot_forest_importances_faces.py`
* :ref:`example_ensemble_plot_forest_importances.py`


.. _gradient_boosting:

Gradient Tree Boosting
Expand Down Expand Up @@ -471,6 +478,7 @@ can be controled via the ``max_features`` parameter.
* :ref:`example_ensemble_plot_gradient_boosting_regression.py`
* :ref:`example_ensemble_plot_gradient_boosting_regularization.py`


.. topic:: References

.. [F2001] J. Friedman, "Greedy Function Approximation: A Gradient Boosting Machine",
Expand Down
73 changes: 71 additions & 2 deletions doc/modules/tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ Some advantages of decision trees are:
of variable. See :ref:`algorithms <tree_algorithms>` for more
information.

- Able to handle multi-output problems.

- Uses a white box model. If a given situation is observable in a model,
the explanation for the condition is easily explained by boolean logic.
By constrast, in a black box model (e.g., in an artificial neural
Expand All @@ -49,6 +51,7 @@ Some advantages of decision trees are:
- Performs well even if its assumptions are somewhat violated by
the true model from which the data were generated.


The disadvantages of decision trees include:

- Decision-tree learners can create over-complex trees that do not
Expand Down Expand Up @@ -78,6 +81,7 @@ The disadvantages of decision trees include:
It is therefore recommended to balance the dataset prior to fitting
with the decision tree.


.. _tree_classification:

Classification
Expand All @@ -87,8 +91,8 @@ Classification
classification on a dataset.

As other classifiers, :class:`DecisionTreeClassifier` take as input two
arrays: an array X of size [n_samples, n_features] holding the training
samples, and an array Y of integer values, size [n_samples], holding
arrays: an array X of size ``[n_samples, n_features]`` holding the training
samples, and an array Y of integer values, size ``[n_samples]``, holding
the class labels for the training samples::

>>> from sklearn import tree
Expand Down Expand Up @@ -147,6 +151,7 @@ After being fitted, the model can then be used to predict new values::

* :ref:`example_tree_plot_iris.py`


.. _tree_regression:

Regression
Expand Down Expand Up @@ -177,6 +182,67 @@ instead of integer values::

* :ref:`example_tree_plot_tree_regression.py`


.. _tree_multioutput:

Multi-output problems
=====================

A multi-output problem is a supervised learning problem with several outputs
to predict, that is when Y is a 2d array of size ``[n_samples, n_outputs]``.

When there is no correlation between the outputs, a very simple way to solve
this kind of problem is to build n independent models, i.e. one for each
output, and then to use those models to independently predict each one of the n
outputs. However, because it is likely that the output values related to the
same input are themselves correlated, an often better way is to build a single
model capable of predicting simultaneously all n outputs. First, it requires
lower training time since only a single estimator is built. Second, the
generalization accuracy of the resulting estimator may often be increased.

With regard to decision trees, this strategy can readily be used to support
multi-output problems. This requires the following changes:

- Store n output values in leaves, instead of 1;
- Use splitting criteria that compute the average reduction across all
n outputs.

This module offers support for multi-output problems by implementing this
strategy in both :class:`DecisionTreeClassifier` and
:class:`DecisionTreeRegressor`. If a decision tree is fit on an output array Y
of size ``[n_samples, n_outputs]`` then the resulting estimator will:

* Output n_output values upon ``predict``;

* Output a list of n_output arrays of class probabilities upon
``predict_proba``.


The use of multi-output trees for regression is demonstrated in
:ref:`example_tree_plot_tree_regression_multioutput.py`. In this example, the input
X is a single real value and the outputs Y are the sine and cosine of X.

.. figure:: ../auto_examples/tree/images/plot_tree_regression_multioutput_1.png
:target: ../auto_examples/tree/plot_tree_regression_multioutput.html
:scale: 75
:align: center

The use of multi-output trees for classification is demonstrated in
:ref:`example_ensemble_plot_forest_multioutput.py`. In this example, the inputs
X are the pixels of the upper half of faces and the outputs Y are the pixels of
the lower half of those faces.

.. figure:: ../auto_examples/ensemble/images/plot_forest_multioutput_1.png
:target: ../auto_examples/ensemble/plot_forest_multioutput.html
:scale: 75
:align: center

.. topic:: Examples:

* :ref:`example_tree_plot_tree_regression_multioutput.py`
* :ref:`example_ensemble_plot_forest_multioutput.py`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than just linking here, please include an inline plot + a small paragraph explaining what are the inputs and the outputs for this example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)



.. _tree_complexity:

Complexity
Expand Down Expand Up @@ -228,6 +294,7 @@ slowing down the algorithm significantly.

Tips on practical use
=====================

* Decision trees tend to overfit on data with a large number of features.
Getting the right ratio of samples to number of features is important, since
a tree with few samples in high dimensional space is very likely to overfit.
Expand Down Expand Up @@ -259,6 +326,7 @@ Tips on practical use
* All decision trees use Fortran ordered ``np.float32`` arrays internally.
If training data is not in this format, a copy of the dataset will be made.


.. _tree_algorithms:

Tree algorithms: ID3, C4.5, C5.0 and CART
Expand Down Expand Up @@ -297,6 +365,7 @@ scikit-learn uses an optimised version of the CART algorithm.
.. _ID3: http://en.wikipedia.org/wiki/ID3_algorithm
.. _CART: http://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees


.. _tree_mathematical_formulation:

Mathematical formulation
Expand Down
3 changes: 3 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ Changelog

- A common testing framework for all estimators was added.

- Decision trees and forests of randomized trees now support multi-output
classification and regression problems, by `Gilles Louppe`

API changes summary
-------------------

Expand Down
2 changes: 1 addition & 1 deletion examples/ensemble/plot_forest_importances_faces.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
# Number of cores to use to perform parallel fitting of the forest model
n_jobs = 1

# Loading the digits dataset
# Load the faces dataset
data = fetch_olivetti_faces()
X = data.images.reshape((len(data.images), -1))
y = data.target
Expand Down
70 changes: 70 additions & 0 deletions examples/ensemble/plot_forest_multioutput.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
"""
=========================================
Face completion with multi-output forests
=========================================

This example shows the use of multi-output forests to complete images.
The goal is to predict the lower half of a face given its upper half.

The first row of images shows true faces. The second half illustrates
how the forest completes the lower half of those faces.

"""
print __doc__

import numpy as np
import pylab as pl

from sklearn.datasets import fetch_olivetti_faces
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor


# Load the faces datasets
data = fetch_olivetti_faces()
targets = data.target

data = data.images.reshape((len(data.images), -1))
train = data[targets < 30]
test = data[targets >= 30] # Test on independent people
n_pixels = data.shape[1]

X_train = train[:, :int(0.5 * n_pixels)] # Upper half of the faces
Y_train = train[:, int(0.5 * n_pixels):] # Lower half of the faces
X_test = test[:, :int(0.5 * n_pixels)]
Y_test = test[:, int(0.5 * n_pixels):]

# Build a multi-output forest
forest = ExtraTreesRegressor(n_estimators=10,
max_features=32,
random_state=0)

forest.fit(X_train, Y_train)
Y_test_predict = forest.predict(X_test)

# Plot the completed faces
n_faces = 5
image_shape = (64, 64)

pl.figure(figsize=(2. * n_faces, 2.26 * 2))
pl.suptitle("Face completion with multi-output forests", size=16)

for i in xrange(1, 1 + n_faces):
face_id = np.random.randint(X_test.shape[0])

true_face = np.hstack((X_test[face_id], Y_test[face_id]))
completed_face = np.hstack((X_test[face_id], Y_test_predict[face_id]))

pl.subplot(2, n_faces, i)
pl.axis("off")
pl.imshow(true_face.reshape(image_shape),
cmap=pl.cm.gray,
interpolation="nearest")

pl.subplot(2, n_faces, n_faces + i)
pl.axis("off")
pl.imshow(completed_face.reshape(image_shape),
cmap=pl.cm.gray,
interpolation="nearest")

pl.show()
55 changes: 55 additions & 0 deletions examples/tree/plot_tree_regression_multioutput.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
"""
===================================================================
Multi-output Decision Tree Regression
===================================================================

Multi-output regression with :ref:`decision trees <tree>`: the decision tree
is used to predict simultaneously the noisy x and y observations of a circle
given a single underlying feature. As a result, it learns local linear
regressions approximating the circle.

We can see that if the maximum depth of the tree (controlled by the
`max_depth` parameter) is set too high, the decision trees learn too fine
details of the training data and learn from the noise, i.e. they overfit.
"""
print __doc__

import numpy as np

# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(100, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T
y[::5,:] += (0.5 - rng.rand(20,2))

# Fit regression model
from sklearn.tree import DecisionTreeRegressor

clf_1 = DecisionTreeRegressor(max_depth=2)
clf_2 = DecisionTreeRegressor(max_depth=5)
clf_3 = DecisionTreeRegressor(max_depth=8)
clf_1.fit(X, y)
clf_2.fit(X, y)
clf_3.fit(X, y)

# Predict
X_test = np.arange(-100.0, 100.0, 0.01)[:, np.newaxis]
y_1 = clf_1.predict(X_test)
y_2 = clf_2.predict(X_test)
y_3 = clf_3.predict(X_test)

# Plot the results
import pylab as pl

pl.figure()
pl.scatter(y[:,0], y[:,1], c="k", label="data")
pl.scatter(y_1[:,0], y_1[:,1], c="g", label="max_depth=2")
pl.scatter(y_2[:,0], y_2[:,1], c="r", label="max_depth=5")
pl.scatter(y_3[:,0], y_3[:,1], c="b", label="max_depth=8")
pl.xlim([-6, 6])
pl.ylim([-6, 6])
pl.xlabel("data")
pl.ylabel("target")
pl.title("Multi-output Decision Tree Regression")
pl.legend()
pl.show()
Loading