diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
index 78c2e1333d2eb..3aee8f258b9d1 100644
--- a/doc/modules/classes.rst
+++ b/doc/modules/classes.rst
@@ -322,7 +322,6 @@ Samples generator
 
    decomposition.PCA
    decomposition.IncrementalPCA
-   decomposition.ProjectedGradientNMF
    decomposition.KernelPCA
    decomposition.FactorAnalysis
    decomposition.FastICA
@@ -1058,7 +1057,7 @@ See the :ref:`metrics` section of the user guide for further details.
    neighbors.DistanceMetric
    neighbors.KernelDensity
    neighbors.LocalOutlierFactor
-	      
+
 .. autosummary::
    :toctree: generated/
    :template: function.rst
diff --git a/doc/modules/decomposition.rst b/doc/modules/decomposition.rst
index 5b05beb098c5a..a473b31dd812f 100644
--- a/doc/modules/decomposition.rst
+++ b/doc/modules/decomposition.rst
@@ -648,27 +648,26 @@ components with some sparsity:
 Non-negative matrix factorization (NMF or NNMF)
 ===============================================
 
-:class:`NMF` is an alternative approach to decomposition that assumes that the
+NMF with the Frobenius norm
+---------------------------
+
+:class:`NMF` [1]_ is an alternative approach to decomposition that assumes that the
 data and the components are non-negative. :class:`NMF` can be plugged in
 instead of :class:`PCA` or its variants, in the cases where the data matrix
-does not contain negative values.
-It finds a decomposition of samples :math:`X`
-into two matrices :math:`W` and :math:`H` of non-negative elements,
-by optimizing for the squared Frobenius norm:
+does not contain negative values. It finds a decomposition of samples
+:math:`X` into two matrices :math:`W` and :math:`H` of non-negative elements,
+by optimizing the distance :math:`d` between :math:`X` and the matrix product
+:math:`WH`. The most widely used distance function is the squared Frobenius
+norm, which is an obvious extension of the Euclidean norm to matrices:
 
 .. math::
-    \arg\min_{W,H} \frac{1}{2} ||X - WH||_{Fro}^2 = \frac{1}{2} \sum_{i,j} (X_{ij} - {WH}_{ij})^2
-
-This norm is an obvious extension of the Euclidean norm to matrices. (Other
-optimization objectives have been suggested in the NMF literature, in
-particular Kullback-Leibler divergence, but these are not currently
-implemented.)
+    d_{Fro}(X, Y) = \frac{1}{2} ||X - Y||_{Fro}^2 = \frac{1}{2} \sum_{i,j} (X_{ij} - {Y}_{ij})^2
 
 Unlike :class:`PCA`, the representation of a vector is obtained in an additive
 fashion, by superimposing the components, without subtracting. Such additive
 models are efficient for representing images and text.
 
-It has been observed in [Hoyer, 04] that, when carefully constrained,
+It has been observed in [Hoyer, 2004] [2]_ that, when carefully constrained,
 :class:`NMF` can produce a parts-based representation of the dataset,
 resulting in interpretable models. The following example displays 16
 sparse components found by :class:`NMF` from the images in the Olivetti
@@ -686,8 +685,8 @@ faces dataset, in comparison with the PCA eigenfaces.
 
 
 The :attr:`init` attribute determines the initialization method applied, which
-has a great impact on the performance of the method. :class:`NMF` implements
-the method Nonnegative Double Singular Value Decomposition. NNDSVD is based on
+has a great impact on the performance of the method. :class:`NMF` implements the
+method Nonnegative Double Singular Value Decomposition. NNDSVD [4]_ is based on
 two SVD processes, one approximating the data matrix, the other approximating
 positive sections of the resulting partial SVD factors utilizing an algebraic
 property of unit rank matrices. The basic NNDSVD algorithm is better fit for
@@ -696,6 +695,11 @@ the mean of all elements of the data), and NNDSVDar (in which the zeros are set
 to random perturbations less than the mean of the data divided by 100) are
 recommended in the dense case.
 
+Note that the Multiplicative Update ('mu') solver cannot update zeros present in
+the initialization, so it leads to poorer results when used jointly with the
+basic NNDSVD algorithm which introduces a lot of zeros; in this case, NNDSVDa or
+NNDSVDar should be preferred.
+
 :class:`NMF` can also be initialized with correctly scaled random non-negative
 matrices by setting :attr:`init="random"`. An integer seed or a
 ``RandomState`` can also be passed to :attr:`random_state` to control
@@ -716,7 +720,7 @@ and the intensity of the regularization with the :attr:`alpha`
 and the regularized objective function is:
 
 .. math::
-    \frac{1}{2}||X - WH||_{Fro}^2
+    d_{Fro}(X, WH)
     + \alpha \rho ||W||_1 + \alpha \rho ||H||_1
     + \frac{\alpha(1-\rho)}{2} ||W||_{Fro} ^ 2
     + \frac{\alpha(1-\rho)}{2} ||H||_{Fro} ^ 2
@@ -725,35 +729,100 @@ and the regularized objective function is:
 :func:`non_negative_factorization` allows a finer control through the
 :attr:`regularization` attribute, and may regularize only W, only H, or both.
 
+NMF with a beta-divergence
+--------------------------
+
+As described previously, the most widely used distance function is the squared
+Frobenius norm, which is an obvious extension of the Euclidean norm to
+matrices:
+
+.. math::
+    d_{Fro}(X, Y) = \frac{1}{2} ||X - Y||_{Fro}^2 = \frac{1}{2} \sum_{i,j} (X_{ij} - {Y}_{ij})^2
+
+Other distance functions can be used in NMF as, for example, the (generalized)
+Kullback-Leibler (KL) divergence, also referred as I-divergence:
+
+.. math::
+    d_{KL}(X, Y) = \sum_{i,j} (X_{ij} log(\frac{X_{ij}}{Y_{ij}}) - X_{ij} + Y_{ij})
+
+Or, the Itakura-Saito (IS) divergence:
+
+.. math::
+    d_{IS}(X, Y) = \sum_{i,j} (\frac{X_{ij}}{Y_{ij}} - log(\frac{X_{ij}}{Y_{ij}}) - 1)
+
+These three distances are special cases of the beta-divergence family, with
+:math:`\beta = 2, 1, 0` respectively [6]_. The beta-divergence are
+defined by :
+
+.. math::
+    d_{\beta}(X, Y) = \sum_{i,j} \frac{1}{\beta(\beta - 1)}(X_{ij}^\beta + (\beta-1)Y_{ij}^\beta - \beta X_{ij} Y_{ij}^{\beta - 1})
+
+.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_beta_divergence_001.png
+    :target: ../auto_examples/decomposition/plot_beta_divergence.html
+    :align: center
+    :scale: 75%
+
+Note that this definition is not valid if :math:`\beta \in (0; 1)`, yet it can
+be continously extended to the definitions of :math:`d_{KL}` and :math:`d_{IS}`
+respectively.
+
+:class:`NMF` implements two solvers, using Coordinate Descent ('cd') [5]_, and
+Multiplicative Update ('mu') [6]_. The 'mu' solver can optimize every
+beta-divergence, including of course the Frobenius norm (:math:`\beta=2`), the
+(generalized) Kullback-Leibler divergence (:math:`\beta=1`) and the
+Itakura-Saito divergence (:math:`\beta=0`). Note that for
+:math:`\beta \in (1; 2)`, the 'mu' solver is significantly faster than for other
+values of :math:`\beta`. Note also that with a negative (or 0, i.e.
+'itakura-saito') :math:`\beta`, the input matrix cannot contain zero values.
+
+The 'cd' solver can only optimize the Frobenius norm. Due to the
+underlying non-convexity of NMF, the different solvers may converge to
+different minima, even when optimizing the same distance function.
+
+NMF is best used with the ``fit_transform`` method, which returns the matrix W.
+The matrix H is stored into the fitted model in the ``components_`` attribute;
+the method ``transform`` will decompose a new matrix X_new based on these
+stored components::
+
+    >>> import numpy as np
+    >>> X = np.array([[1, 1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
+    >>> from sklearn.decomposition import NMF
+    >>> model = NMF(n_components=2, init='random', random_state=0)
+    >>> W = model.fit_transform(X)
+    >>> H = model.components_
+    >>> X_new = np.array([[1, 0], [1, 6.1], [1, 0], [1, 4], [3.2, 1], [0, 4]])
+    >>> W_new = model.transform(X_new)
+
 .. topic:: Examples:
 
     * :ref:`sphx_glr_auto_examples_decomposition_plot_faces_decomposition.py`
     * :ref:`sphx_glr_auto_examples_applications_topics_extraction_with_nmf_lda.py`
+    * :ref:`sphx_glr_auto_examples_decomposition_plot_beta_divergence.py`
 
 .. topic:: References:
 
-    * `"Learning the parts of objects by non-negative matrix factorization"
+    .. [1] `"Learning the parts of objects by non-negative matrix factorization"
       <http://www.columbia.edu/~jwp2128/Teaching/W4721/papers/nmf_nature.pdf>`_
       D. Lee, S. Seung, 1999
 
-    * `"Non-negative Matrix Factorization with Sparseness Constraints"
+    .. [2] `"Non-negative Matrix Factorization with Sparseness Constraints"
       <http://www.jmlr.org/papers/volume5/hoyer04a/hoyer04a.pdf>`_
       P. Hoyer, 2004
 
-    * `"Projected gradient methods for non-negative matrix factorization"
-      <http://www.csie.ntu.edu.tw/~cjlin/nmf/>`_
-      C.-J. Lin, 2007
-
-    * `"SVD based initialization: A head start for nonnegative
+    .. [4] `"SVD based initialization: A head start for nonnegative
       matrix factorization"
       <http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/HPCLAB020107.pdf>`_
       C. Boutsidis, E. Gallopoulos, 2008
 
-    * `"Fast local algorithms for large scale nonnegative matrix and tensor
+    .. [5] `"Fast local algorithms for large scale nonnegative matrix and tensor
       factorizations."
       <http://www.bsp.brain.riken.jp/publications/2009/Cichocki-Phan-IEICE_col.pdf>`_
       A. Cichocki, P. Anh-Huy, 2009
 
+    .. [6] `"Algorithms for nonnegative matrix factorization with the beta-divergence"
+      <http://http://arxiv.org/pdf/1010.1763v3.pdf>`_
+      C. Fevotte, J. Idier, 2011
+
 
 .. _LatentDirichletAllocation:
 
diff --git a/doc/whats_new.rst b/doc/whats_new.rst
index 02f09d885cb14..4028f40cc32d4 100644
--- a/doc/whats_new.rst
+++ b/doc/whats_new.rst
@@ -35,6 +35,12 @@ New features
      detection based on nearest neighbors.
      :issue:`5279` by `Nicolas Goix`_ and `Alexandre Gramfort`_.
 
+   - The new solver ``mu`` implements a Multiplicate Update in
+     :class:`decomposition.NMF`, allowing the optimization of all
+     beta-divergences, including the Frobenius norm, the generalized
+     Kullback-Leibler divergence and the Itakura-Saito divergence.
+     By `Tom Dupre la Tour`_.
+
 Enhancements
 ............
 
@@ -152,7 +158,7 @@ Bug fixes
      with SVD and Eigen solver are now of the same length. :issue:`7632`
      by :user:`JPFrancoia <JPFrancoia>`
 
-   - Fixes issue in :ref:`univariate_feature_selection` where score 
+   - Fixes issue in :ref:`univariate_feature_selection` where score
      functions were not accepting multi-label targets. :issue:`7676`
      by `Mohammed Affan`_
 
@@ -382,7 +388,7 @@ Other estimators
 
    - New :class:`mixture.GaussianMixture` and :class:`mixture.BayesianGaussianMixture`
      replace former mixture models, employing faster inference
-     for sounder results. :issue:`7295` by :user:`Wei Xue <xuewei4d>` and 
+     for sounder results. :issue:`7295` by :user:`Wei Xue <xuewei4d>` and
      :user:`Thierry Guillemot <tguillemot>`.
 
    - Class :class:`decomposition.RandomizedPCA` is now factored into :class:`decomposition.PCA`
@@ -505,7 +511,7 @@ Decomposition, manifold learning and clustering
    - :class:`cluster.KMeans` and :class:`cluster.MiniBatchKMeans` now works
      with ``np.float32`` and ``np.float64`` input data without converting it.
      This allows to reduce the memory consumption by using ``np.float32``.
-     :issue:`6846` by :user:`Sebastian Säger <ssaeger>` and 
+     :issue:`6846` by :user:`Sebastian Säger <ssaeger>` and
      :user:`YenChen Lin <yenchenlin>`.
 
 Preprocessing and feature selection
@@ -514,7 +520,7 @@ Preprocessing and feature selection
      :issue:`5929` by :user:`Konstantin Podshumok <podshumok>`.
 
    - :class:`feature_extraction.FeatureHasher` now accepts string values.
-     :issue:`6173` by :user:`Ryad Zenine <ryadzenine>` and 
+     :issue:`6173` by :user:`Ryad Zenine <ryadzenine>` and
      :user:`Devashish Deshpande <dsquareindia>`.
 
    - Keyword arguments can now be supplied to ``func`` in
@@ -528,7 +534,7 @@ Preprocessing and feature selection
 Model evaluation and meta-estimators
 
    - :class:`multiclass.OneVsOneClassifier` and :class:`multiclass.OneVsRestClassifier`
-     now support ``partial_fit``. By :user:`Asish Panda <kaichogami>` and 
+     now support ``partial_fit``. By :user:`Asish Panda <kaichogami>` and
      :user:`Philipp Dowling <phdowling>`.
 
    - Added support for substituting or disabling :class:`pipeline.Pipeline`
@@ -556,7 +562,7 @@ Metrics
 
    - Added ``labels`` flag to :class:`metrics.log_loss` to to explicitly provide
      the labels when the number of classes in ``y_true`` and ``y_pred`` differ.
-     :issue:`7239` by :user:`Hong Guangguo <hongguangguo>` with help from 
+     :issue:`7239` by :user:`Hong Guangguo <hongguangguo>` with help from
      :user:`Mads Jensen <indianajensen>` and :user:`Nelson Liu <nelson-liu>`.
 
    - Support sparse contingency matrices in cluster evaluation
@@ -676,7 +682,7 @@ Decomposition, manifold learning and clustering
     - Fixed incorrect initialization of :func:`utils.arpack.eigsh` on all
       occurrences. Affects :class:`cluster.bicluster.SpectralBiclustering`,
       :class:`decomposition.KernelPCA`, :class:`manifold.LocallyLinearEmbedding`,
-      and :class:`manifold.SpectralEmbedding` (:issue:`5012`). By 
+      and :class:`manifold.SpectralEmbedding` (:issue:`5012`). By
       :user:`Peter Fischer <yanlend>`.
 
     - Attribute ``explained_variance_ratio_`` calculated with the SVD solver
@@ -959,7 +965,7 @@ New features
      :class:`cross_validation.LabelShuffleSplit` generate train-test folds,
      respectively similar to :class:`cross_validation.KFold` and
      :class:`cross_validation.ShuffleSplit`, except that the folds are
-     conditioned on a label array. By `Brian McFee`_, :user:`Jean 
+     conditioned on a label array. By `Brian McFee`_, :user:`Jean
      Kossaifi <JeanKossaifi>` and `Gilles Louppe`_.
 
    - :class:`decomposition.LatentDirichletAllocation` implements the Latent
@@ -1049,7 +1055,7 @@ Enhancements
      By `Trevor Stephens`_.
 
    - Provide an option for sparse output from
-     :func:`sklearn.metrics.pairwise.cosine_similarity`. By 
+     :func:`sklearn.metrics.pairwise.cosine_similarity`. By
      :user:`Jaidev Deshpande <jaidevd>`.
 
    - Add :func:`minmax_scale` to provide a function interface for
@@ -1260,7 +1266,7 @@ Bug fixes
       By `Tom Dupre la Tour`_.
 
     - Fixed bug :issue:`5495` when
-      doing OVR(SVC(decision_function_shape="ovr")). Fixed by 
+      doing OVR(SVC(decision_function_shape="ovr")). Fixed by
       :user:`Elvis Dohmatob <dohmatob>`.
 
 
diff --git a/examples/applications/topics_extraction_with_nmf_lda.py b/examples/applications/topics_extraction_with_nmf_lda.py
index d5b9cbfc5af44..d4ed9607073c7 100644
--- a/examples/applications/topics_extraction_with_nmf_lda.py
+++ b/examples/applications/topics_extraction_with_nmf_lda.py
@@ -9,6 +9,10 @@
 The output is a list of topics, each represented as a list of terms
 (weights are not shown).
 
+Non-negative Matrix Factorization is applied with two different objective
+functions: the Frobenius norm, and the generalized Kullback-Leibler divergence.
+The latter is equivalent to Probabilistic Latent Semantic Indexing.
+
 The default parameters (n_samples / n_features / n_topics) should make
 the example runnable in a couple of tens of seconds. You can try to
 increase the dimensions of the problem, but be aware that the time
@@ -36,9 +40,10 @@
 
 def print_top_words(model, feature_names, n_top_words):
     for topic_idx, topic in enumerate(model.components_):
-        print("Topic #%d:" % topic_idx)
-        print(" ".join([feature_names[i]
-                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
+        message = "Topic #%d: " % topic_idx
+        message += " ".join([feature_names[i]
+                             for i in topic.argsort()[:-n_top_words - 1:-1]])
+        print(message)
     print()
 
 
@@ -71,9 +76,10 @@ def print_top_words(model, feature_names, n_top_words):
 t0 = time()
 tf = tf_vectorizer.fit_transform(data_samples)
 print("done in %0.3fs." % (time() - t0))
+print()
 
 # Fit the NMF model
-print("Fitting the NMF model with tf-idf features, "
+print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
       "n_samples=%d and n_features=%d..."
       % (n_samples, n_features))
 t0 = time()
@@ -81,7 +87,20 @@ def print_top_words(model, feature_names, n_top_words):
           alpha=.1, l1_ratio=.5).fit(tfidf)
 print("done in %0.3fs." % (time() - t0))
 
-print("\nTopics in NMF model:")
+print("\nTopics in NMF model (Frobenius norm):")
+tfidf_feature_names = tfidf_vectorizer.get_feature_names()
+print_top_words(nmf, tfidf_feature_names, n_top_words)
+
+# Fit the NMF model
+print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
+      "tf-idf features, n_samples=%d and n_features=%d..."
+      % (n_samples, n_features))
+t0 = time()
+nmf = NMF(n_components=n_topics, random_state=1, beta_loss='kullback-leibler',
+          solver='mu', max_iter=1000, alpha=.1, l1_ratio=.5).fit(tfidf)
+print("done in %0.3fs." % (time() - t0))
+
+print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
 tfidf_feature_names = tfidf_vectorizer.get_feature_names()
 print_top_words(nmf, tfidf_feature_names, n_top_words)
 
diff --git a/examples/decomposition/plot_beta_divergence.py b/examples/decomposition/plot_beta_divergence.py
new file mode 100644
index 0000000000000..f5029ffcf5001
--- /dev/null
+++ b/examples/decomposition/plot_beta_divergence.py
@@ -0,0 +1,29 @@
+"""
+==============================
+Beta-divergence loss functions
+==============================
+
+A plot that compares the various Beta-divergence loss functions supported by
+the Multiplicative-Update ('mu') solver in :class:`sklearn.decomposition.NMF`.
+"""
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.decomposition.nmf import _beta_divergence
+
+print(__doc__)
+
+x = np.linspace(0.001, 4, 1000)
+y = np.zeros(x.shape)
+
+colors = 'mbgyr'
+for j, beta in enumerate((0., 0.5, 1., 1.5, 2.)):
+    for i, xi in enumerate(x):
+        y[i] = _beta_divergence(1, xi, 1, beta)
+    name = "beta = %1.1f" % beta
+    plt.plot(x, y, label=name, color=colors[j])
+
+plt.xlabel("x")
+plt.title("beta-divergence(1, x)")
+plt.legend(loc=0)
+plt.axis([0, 4, 0, 3])
+plt.show()
diff --git a/sklearn/decomposition/nmf.py b/sklearn/decomposition/nmf.py
index 3b71079d995fe..63026e3ad43bd 100644
--- a/sklearn/decomposition/nmf.py
+++ b/sklearn/decomposition/nmf.py
@@ -4,9 +4,6 @@
 #         Lars Buitinck
 #         Mathieu Blondel <mathieu@mblondel.org>
 #         Tom Dupre la Tour
-# Author: Chih-Jen Lin, National Taiwan University (original projected gradient
-#                                                   NMF implementation)
-# Author: Anthony Di Franco (Projected gradient, Python and NumPy port)
 # License: BSD 3 clause
 
 
@@ -15,6 +12,7 @@
 from math import sqrt
 import warnings
 import numbers
+import time
 
 import numpy as np
 import scipy.sparse as sp
@@ -22,22 +20,16 @@
 from ..base import BaseEstimator, TransformerMixin
 from ..utils import check_random_state, check_array
 from ..utils.extmath import randomized_svd, safe_sparse_dot, squared_norm
-from ..utils.extmath import fast_dot
+from ..utils.extmath import fast_dot, safe_min
 from ..utils.validation import check_is_fitted, check_non_negative
 from ..exceptions import ConvergenceWarning
 from .cdnmf_fast import _update_cdnmf_fast
 
+EPSILON = np.finfo(np.float32).eps
 
 INTEGER_TYPES = (numbers.Integral, np.integer)
 
 
-def safe_vstack(Xs):
-    if any(sp.issparse(X) for X in Xs):
-        return sp.vstack(Xs)
-    else:
-        return np.vstack(Xs)
-
-
 def norm(x):
     """Dot product-based Euclidean norm implementation
 
@@ -61,16 +53,181 @@ def _check_init(A, shape, whom):
         raise ValueError('Array passed to %s is full of zeros.' % whom)
 
 
-def _safe_compute_error(X, W, H):
-    """Frobenius norm between X and WH, safe for sparse array"""
+def _beta_divergence(X, W, H, beta, square_root=False):
+    """Compute the beta-divergence of X and dot(W, H).
+
+    Parameters
+    ----------
+    X : float or array-like, shape (n_samples, n_features)
+
+    W : float or dense array-like, shape (n_samples, n_components)
+
+    H : float or dense array-like, shape (n_components, n_features)
+
+    beta : float, string in {'frobenius', 'kullback-leibler', 'itakura-saito'}
+        Parameter of the beta-divergence.
+        If beta == 2, this is half the Frobenius *squared* norm.
+        If beta == 1, this is the generalized Kullback-Leibler divergence.
+        If beta == 0, this is the Itakura-Saito divergence.
+        Else, this is the general beta-divergence.
+
+    square_root : boolean, default False
+        If True, return np.sqrt(2 * res)
+        For beta == 2, it corresponds to the Frobenius norm.
+
+    Returns
+    -------
+        res : float
+            Beta divergence of X and np.dot(X, H)
+    """
+    beta = _beta_loss_to_float(beta)
+
+    # The method can be called with scalars
     if not sp.issparse(X):
-        error = norm(X - np.dot(W, H))
+        X = np.atleast_2d(X)
+    W = np.atleast_2d(W)
+    H = np.atleast_2d(H)
+
+    # Frobenius norm
+    if beta == 2:
+        # Avoid the creation of the dense np.dot(W, H) if X is sparse.
+        if sp.issparse(X):
+            norm_X = np.dot(X.data, X.data)
+            norm_WH = trace_dot(np.dot(np.dot(W.T, W), H), H)
+            cross_prod = trace_dot((X * H.T), W)
+            res = (norm_X + norm_WH - 2. * cross_prod) / 2.
+        else:
+            res = squared_norm(X - np.dot(W, H)) / 2.
+
+        if square_root:
+            return np.sqrt(res * 2)
+        else:
+            return res
+
+    if sp.issparse(X):
+        # compute np.dot(W, H) only where X is nonzero
+        WH_data = _special_sparse_dot(W, H, X).data
+        X_data = X.data
+    else:
+        WH = fast_dot(W, H)
+        WH_data = WH.ravel()
+        X_data = X.ravel()
+
+    # do not affect the zeros: here 0 ** (-1) = 0 and not infinity
+    WH_data = WH_data[X_data != 0]
+    X_data = X_data[X_data != 0]
+
+    # used to avoid division by zero
+    WH_data[WH_data == 0] = EPSILON
+
+    # generalized Kullback-Leibler divergence
+    if beta == 1:
+        # fast and memory efficient computation of np.sum(np.dot(W, H))
+        sum_WH = np.dot(np.sum(W, axis=0), np.sum(H, axis=1))
+        # computes np.sum(X * log(X / WH)) only where X is nonzero
+        div = X_data / WH_data
+        res = np.dot(X_data, np.log(div))
+        # add full np.sum(np.dot(W, H)) - np.sum(X)
+        res += sum_WH - X_data.sum()
+
+    # Itakura-Saito divergence
+    elif beta == 0:
+        div = X_data / WH_data
+        res = np.sum(div) - np.product(X.shape) - np.sum(np.log(div))
+
+    # beta-divergence, beta not in (0, 1, 2)
+    else:
+        if sp.issparse(X):
+            # slow loop, but memory efficient computation of :
+            # np.sum(np.dot(W, H) ** beta)
+            sum_WH_beta = 0
+            for i in range(X.shape[1]):
+                sum_WH_beta += np.sum(fast_dot(W, H[:, i]) ** beta)
+
+        else:
+            sum_WH_beta = np.sum(WH ** beta)
+
+        sum_X_WH = np.dot(X_data, WH_data ** (beta - 1))
+        res = (X_data ** beta).sum() - beta * sum_X_WH
+        res += sum_WH_beta * (beta - 1)
+        res /= beta * (beta - 1)
+
+    if square_root:
+        return np.sqrt(2 * res)
     else:
-        norm_X = np.dot(X.data, X.data)
-        norm_WH = trace_dot(np.dot(np.dot(W.T, W), H), H)
-        cross_prod = trace_dot((X * H.T), W)
-        error = sqrt(norm_X + norm_WH - 2. * cross_prod)
-    return error
+        return res
+
+
+def _special_sparse_dot(W, H, X):
+    """Computes np.dot(W, H), only where X is non zero."""
+    if sp.issparse(X):
+        ii, jj = X.nonzero()
+        dot_vals = np.multiply(W[ii, :], H.T[jj, :]).sum(axis=1)
+        WH = sp.coo_matrix((dot_vals, (ii, jj)), shape=X.shape)
+        return WH.tocsr()
+    else:
+        return fast_dot(W, H)
+
+
+def _compute_regularization(alpha, l1_ratio, regularization):
+    """Compute L1 and L2 regularization coefficients for W and H"""
+    alpha_H = 0.
+    alpha_W = 0.
+    if regularization in ('both', 'components'):
+        alpha_H = float(alpha)
+    if regularization in ('both', 'transformation'):
+        alpha_W = float(alpha)
+
+    l1_reg_W = alpha_W * l1_ratio
+    l1_reg_H = alpha_H * l1_ratio
+    l2_reg_W = alpha_W * (1. - l1_ratio)
+    l2_reg_H = alpha_H * (1. - l1_ratio)
+    return l1_reg_W, l1_reg_H, l2_reg_W, l2_reg_H
+
+
+def _check_string_param(solver, regularization, beta_loss, init):
+    allowed_solver = ('cd', 'mu')
+    if solver not in allowed_solver:
+        raise ValueError(
+            'Invalid solver parameter: got %r instead of one of %r' %
+            (solver, allowed_solver))
+
+    allowed_regularization = ('both', 'components', 'transformation', None)
+    if regularization not in allowed_regularization:
+        raise ValueError(
+            'Invalid regularization parameter: got %r instead of one of %r' %
+            (regularization, allowed_regularization))
+
+    # 'mu' is the only solver that handles other beta losses than 'frobenius'
+    if solver != 'mu' and beta_loss not in (2, 'frobenius'):
+        raise ValueError(
+            'Invalid beta_loss parameter: solver %r does not handle beta_loss'
+            ' = %r' % (solver, beta_loss))
+
+    if solver == 'mu' and init == 'nndsvd':
+        warnings.warn("The multiplicative update ('mu') solver cannot update "
+                      "zeros present in the initialization, and so leads to "
+                      "poorer results when used jointly with init='nndsvd'. "
+                      "You may try init='nndsvda' or init='nndsvdar' instead.",
+                      UserWarning)
+
+    beta_loss = _beta_loss_to_float(beta_loss)
+    return beta_loss
+
+
+def _beta_loss_to_float(beta_loss):
+    """Convert string beta_loss to float"""
+    allowed_beta_loss = {'frobenius': 2,
+                         'kullback-leibler': 1,
+                         'itakura-saito': 0}
+    if isinstance(beta_loss, str) and beta_loss in allowed_beta_loss:
+        beta_loss = allowed_beta_loss[beta_loss]
+
+    if not isinstance(beta_loss, numbers.Number):
+        raise ValueError('Invalid beta_loss parameter: got %r instead '
+                         'of one of %r, or a float.' %
+                         (beta_loss, allowed_beta_loss.keys()))
+    return beta_loss
 
 
 def _initialize_nmf(X, n_components, init=None, eps=1e-6,
@@ -90,7 +247,7 @@ def _initialize_nmf(X, n_components, init=None, eps=1e-6,
 
     init :  None | 'random' | 'nndsvd' | 'nndsvda' | 'nndsvdar'
         Method used to initialize the procedure.
-        Default: 'nndsvdar' if n_components < n_features, otherwise 'random'.
+        Default: 'nndsvd' if n_components < n_features, otherwise 'random'.
         Valid options:
 
         - 'random': non-negative random matrices, scaled with:
@@ -209,121 +366,6 @@ def _initialize_nmf(X, n_components, init=None, eps=1e-6,
     return W, H
 
 
-def _nls_subproblem(V, W, H, tol, max_iter, alpha=0., l1_ratio=0.,
-                    sigma=0.01, beta=0.1):
-    """Non-negative least square solver
-
-    Solves a non-negative least squares subproblem using the projected
-    gradient descent algorithm.
-
-    Parameters
-    ----------
-    V : array-like, shape (n_samples, n_features)
-        Constant matrix.
-
-    W : array-like, shape (n_samples, n_components)
-        Constant matrix.
-
-    H : array-like, shape (n_components, n_features)
-        Initial guess for the solution.
-
-    tol : float
-        Tolerance of the stopping condition.
-
-    max_iter : int
-        Maximum number of iterations before timing out.
-
-    alpha : double, default: 0.
-        Constant that multiplies the regularization terms. Set it to zero to
-        have no regularization.
-
-    l1_ratio : double, default: 0.
-        The regularization mixing parameter, with 0 <= l1_ratio <= 1.
-        For l1_ratio = 0 the penalty is an L2 penalty.
-        For l1_ratio = 1 it is an L1 penalty.
-        For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
-
-    sigma : float
-        Constant used in the sufficient decrease condition checked by the line
-        search.  Smaller values lead to a looser sufficient decrease condition,
-        thus reducing the time taken by the line search, but potentially
-        increasing the number of iterations of the projected gradient
-        procedure. 0.01 is a commonly used value in the optimization
-        literature.
-
-    beta : float
-        Factor by which the step size is decreased (resp. increased) until
-        (resp. as long as) the sufficient decrease condition is satisfied.
-        Larger values allow to find a better step size but lead to longer line
-        search. 0.1 is a commonly used value in the optimization literature.
-
-    Returns
-    -------
-    H : array-like, shape (n_components, n_features)
-        Solution to the non-negative least squares problem.
-
-    grad : array-like, shape (n_components, n_features)
-        The gradient.
-
-    n_iter : int
-        The number of iterations done by the algorithm.
-
-    References
-    ----------
-    C.-J. Lin. Projected gradient methods for non-negative matrix
-    factorization. Neural Computation, 19(2007), 2756-2779.
-    http://www.csie.ntu.edu.tw/~cjlin/nmf/
-    """
-    WtV = safe_sparse_dot(W.T, V)
-    WtW = fast_dot(W.T, W)
-
-    # values justified in the paper (alpha is renamed gamma)
-    gamma = 1
-    for n_iter in range(1, max_iter + 1):
-        grad = np.dot(WtW, H) - WtV
-        if alpha > 0 and l1_ratio == 1.:
-            grad += alpha
-        elif alpha > 0:
-            grad += alpha * (l1_ratio + (1 - l1_ratio) * H)
-
-        # The following multiplication with a boolean array is more than twice
-        # as fast as indexing into grad.
-        if norm(grad * np.logical_or(grad < 0, H > 0)) < tol:
-            break
-
-        Hp = H
-
-        for inner_iter in range(20):
-            # Gradient step.
-            Hn = H - gamma * grad
-            # Projection step.
-            Hn *= Hn > 0
-            d = Hn - H
-            gradd = np.dot(grad.ravel(), d.ravel())
-            dQd = np.dot(np.dot(WtW, d).ravel(), d.ravel())
-            suff_decr = (1 - sigma) * gradd + 0.5 * dQd < 0
-            if inner_iter == 0:
-                decr_gamma = not suff_decr
-
-            if decr_gamma:
-                if suff_decr:
-                    H = Hn
-                    break
-                else:
-                    gamma *= beta
-            elif not suff_decr or (Hp == Hn).all():
-                H = Hp
-                break
-            else:
-                gamma /= beta
-                Hp = Hn
-
-    if n_iter == max_iter:
-        warnings.warn("Iteration limit reached in nls subproblem.")
-
-    return H, grad, n_iter
-
-
 def _update_coordinate_descent(X, W, Ht, l1_reg, l2_reg, shuffle,
                                random_state):
     """Helper function for _fit_coordinate_descent
@@ -355,8 +397,8 @@ def _update_coordinate_descent(X, W, Ht, l1_reg, l2_reg, shuffle,
     return _update_cdnmf_fast(W, HHt, XHt, permutation)
 
 
-def _fit_coordinate_descent(X, W, H, tol=1e-4, max_iter=200, alpha=0.001,
-                            l1_ratio=0., regularization=None, update_H=True,
+def _fit_coordinate_descent(X, W, H, tol=1e-4, max_iter=200, l1_reg_W=0,
+                            l1_reg_H=0, l2_reg_W=0, l2_reg_H=0, update_H=True,
                             verbose=0, shuffle=False, random_state=None):
     """Compute Non-negative Matrix Factorization (NMF) with Coordinate Descent
 
@@ -381,18 +423,17 @@ def _fit_coordinate_descent(X, W, H, tol=1e-4, max_iter=200, alpha=0.001,
     max_iter : integer, default: 200
         Maximum number of iterations before timing out.
 
-    alpha : double, default: 0.
-        Constant that multiplies the regularization terms.
+    l1_reg_W : double, default: 0.
+        L1 regularization parameter for W.
 
-    l1_ratio : double, default: 0.
-        The regularization mixing parameter, with 0 <= l1_ratio <= 1.
-        For l1_ratio = 0 the penalty is an L2 penalty.
-        For l1_ratio = 1 it is an L1 penalty.
-        For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
+    l1_reg_H : double, default: 0.
+        L1 regularization parameter for H.
 
-    regularization : 'both' | 'components' | 'transformation' | None
-        Select whether the regularization affects the components (H), the
-        transformation (W), both or none of them.
+    l2_reg_W : double, default: 0.
+        L2 regularization parameter for W.
+
+    l2_reg_H : double, default: 0.
+        L2 regularization parameter for H.
 
     update_H : boolean, default: True
         Set to True, both W and H will be estimated from initial guesses.
@@ -429,29 +470,18 @@ def _fit_coordinate_descent(X, W, H, tol=1e-4, max_iter=200, alpha=0.001,
     Ht = check_array(H.T, order='C')
     X = check_array(X, accept_sparse='csr')
 
-    # L1 and L2 regularization
-    l1_H, l2_H, l1_W, l2_W = 0, 0, 0, 0
-    if regularization in ('both', 'components'):
-        alpha = float(alpha)
-        l1_H = l1_ratio * alpha
-        l2_H = (1. - l1_ratio) * alpha
-    if regularization in ('both', 'transformation'):
-        alpha = float(alpha)
-        l1_W = l1_ratio * alpha
-        l2_W = (1. - l1_ratio) * alpha
-
     rng = check_random_state(random_state)
 
     for n_iter in range(max_iter):
         violation = 0.
 
         # Update W
-        violation += _update_coordinate_descent(X, W, Ht, l1_W, l2_W,
-                                                shuffle, rng)
+        violation += _update_coordinate_descent(X, W, Ht, l1_reg_W,
+                                                l2_reg_W, shuffle, rng)
         # Update H
         if update_H:
-            violation += _update_coordinate_descent(X.T, Ht, W, l1_H, l2_H,
-                                                    shuffle, rng)
+            violation += _update_coordinate_descent(X.T, Ht, W, l1_reg_H,
+                                                    l2_reg_H, shuffle, rng)
 
         if n_iter == 0:
             violation_init = violation
@@ -470,9 +500,307 @@ def _fit_coordinate_descent(X, W, H, tol=1e-4, max_iter=200, alpha=0.001,
     return W, Ht.T, n_iter
 
 
+def _multiplicative_update_w(X, W, H, beta_loss, l1_reg_W, l2_reg_W, gamma,
+                             H_sum=None, HHt=None, XHt=None, update_H=True):
+    """update W in Multiplicative Update NMF"""
+    if beta_loss == 2:
+        # Numerator
+        if XHt is None:
+            XHt = safe_sparse_dot(X, H.T)
+        if update_H:
+            # avoid a copy of XHt, which will be re-computed (update_H=True)
+            numerator = XHt
+        else:
+            # preserve the XHt, which is not re-computed (update_H=False)
+            numerator = XHt.copy()
+
+        # Denominator
+        if HHt is None:
+            HHt = fast_dot(H, H.T)
+        denominator = fast_dot(W, HHt)
+
+    else:
+        # Numerator
+        # if X is sparse, compute WH only where X is non zero
+        WH_safe_X = _special_sparse_dot(W, H, X)
+        if sp.issparse(X):
+            WH_safe_X_data = WH_safe_X.data
+            X_data = X.data
+        else:
+            WH_safe_X_data = WH_safe_X
+            X_data = X
+            # copy used in the Denominator
+            WH = WH_safe_X.copy()
+            if beta_loss - 1. < 0:
+                WH[WH == 0] = EPSILON
+
+        # to avoid taking a negative power of zero
+        if beta_loss - 2. < 0:
+            WH_safe_X_data[WH_safe_X_data == 0] = EPSILON
+
+        if beta_loss == 1:
+            np.divide(X_data, WH_safe_X_data, out=WH_safe_X_data)
+        else:
+            WH_safe_X_data **= beta_loss - 2
+            # element-wise multiplication
+            WH_safe_X_data *= X_data
+
+        # here numerator = dot(X * (dot(W, H) ** (beta_loss - 2)), H.T)
+        numerator = safe_sparse_dot(WH_safe_X, H.T)
+
+        # Denominator
+        if beta_loss == 1:
+            if H_sum is None:
+                H_sum = np.sum(H, axis=1)  # shape(n_components, )
+            denominator = H_sum[np.newaxis, :]
+
+        else:
+            # computation of WHHt = dot(dot(W, H) ** beta_loss - 1, H.T)
+            if sp.issparse(X):
+                # memory efficient computation
+                # (compute row by row, avoiding the dense matrix WH)
+                WHHt = np.empty(W.shape)
+                for i in range(X.shape[0]):
+                    WHi = fast_dot(W[i, :], H)
+                    if beta_loss - 1 < 0:
+                        WHi[WHi == 0] = EPSILON
+                    WHi **= beta_loss - 1
+                    WHHt[i, :] = fast_dot(WHi, H.T)
+            else:
+                WH **= beta_loss - 1
+                WHHt = fast_dot(WH, H.T)
+            denominator = WHHt
+
+    # Add L1 and L2 regularization
+    if l1_reg_W > 0:
+        denominator += l1_reg_W
+    if l2_reg_W > 0:
+        denominator = denominator + l2_reg_W * W
+    denominator[denominator == 0] = EPSILON
+
+    numerator /= denominator
+    delta_W = numerator
+
+    # gamma is in ]0, 1]
+    if gamma != 1:
+        delta_W **= gamma
+
+    return delta_W, H_sum, HHt, XHt
+
+
+def _multiplicative_update_h(X, W, H, beta_loss, l1_reg_H, l2_reg_H, gamma):
+    """update H in Multiplicative Update NMF"""
+    if beta_loss == 2:
+        numerator = safe_sparse_dot(W.T, X)
+        denominator = fast_dot(fast_dot(W.T, W), H)
+
+    else:
+        # Numerator
+        WH_safe_X = _special_sparse_dot(W, H, X)
+        if sp.issparse(X):
+            WH_safe_X_data = WH_safe_X.data
+            X_data = X.data
+        else:
+            WH_safe_X_data = WH_safe_X
+            X_data = X
+            # copy used in the Denominator
+            WH = WH_safe_X.copy()
+            if beta_loss - 1. < 0:
+                WH[WH == 0] = EPSILON
+
+        # to avoid division by zero
+        if beta_loss - 2. < 0:
+            WH_safe_X_data[WH_safe_X_data == 0] = EPSILON
+
+        if beta_loss == 1:
+            np.divide(X_data, WH_safe_X_data, out=WH_safe_X_data)
+        else:
+            WH_safe_X_data **= beta_loss - 2
+            # element-wise multiplication
+            WH_safe_X_data *= X_data
+
+        # here numerator = dot(W.T, (dot(W, H) ** (beta_loss - 2)) * X)
+        numerator = safe_sparse_dot(W.T, WH_safe_X)
+
+        # Denominator
+        if beta_loss == 1:
+            W_sum = np.sum(W, axis=0)  # shape(n_components, )
+            W_sum[W_sum == 0] = 1.
+            denominator = W_sum[:, np.newaxis]
+
+        # beta_loss not in (1, 2)
+        else:
+            # computation of WtWH = dot(W.T, dot(W, H) ** beta_loss - 1)
+            if sp.issparse(X):
+                # memory efficient computation
+                # (compute column by column, avoiding the dense matrix WH)
+                WtWH = np.empty(H.shape)
+                for i in range(X.shape[1]):
+                    WHi = fast_dot(W, H[:, i])
+                    if beta_loss - 1 < 0:
+                        WHi[WHi == 0] = EPSILON
+                    WHi **= beta_loss - 1
+                    WtWH[:, i] = fast_dot(W.T, WHi)
+            else:
+                WH **= beta_loss - 1
+                WtWH = fast_dot(W.T, WH)
+            denominator = WtWH
+
+    # Add L1 and L2 regularization
+    if l1_reg_H > 0:
+        denominator += l1_reg_H
+    if l2_reg_H > 0:
+        denominator = denominator + l2_reg_H * H
+    denominator[denominator == 0] = EPSILON
+
+    numerator /= denominator
+    delta_H = numerator
+
+    # gamma is in ]0, 1]
+    if gamma != 1:
+        delta_H **= gamma
+
+    return delta_H
+
+
+def _fit_multiplicative_update(X, W, H, beta_loss='frobenius',
+                               max_iter=200, tol=1e-4,
+                               l1_reg_W=0, l1_reg_H=0, l2_reg_W=0, l2_reg_H=0,
+                               update_H=True, verbose=0):
+    """Compute Non-negative Matrix Factorization with Multiplicative Update
+
+    The objective function is _beta_divergence(X, WH) and is minimized with an
+    alternating minimization of W and H. Each minimization is done with a
+    Multiplicative Update.
+
+    Parameters
+    ----------
+    X : array-like, shape (n_samples, n_features)
+        Constant input matrix.
+
+    W : array-like, shape (n_samples, n_components)
+        Initial guess for the solution.
+
+    H : array-like, shape (n_components, n_features)
+        Initial guess for the solution.
+
+    beta_loss : float or string, default 'frobenius'
+        String must be in {'frobenius', 'kullback-leibler', 'itakura-saito'}.
+        Beta divergence to be minimized, measuring the distance between X
+        and the dot product WH. Note that values different from 'frobenius'
+        (or 2) and 'kullback-leibler' (or 1) lead to significantly slower
+        fits. Note that for beta_loss <= 0 (or 'itakura-saito'), the input
+        matrix X cannot contain zeros.
+
+    max_iter : integer, default: 200
+        Number of iterations.
+
+    tol : float, default: 1e-4
+        Tolerance of the stopping condition.
+
+    l1_reg_W : double, default: 0.
+        L1 regularization parameter for W.
+
+    l1_reg_H : double, default: 0.
+        L1 regularization parameter for H.
+
+    l2_reg_W : double, default: 0.
+        L2 regularization parameter for W.
+
+    l2_reg_H : double, default: 0.
+        L2 regularization parameter for H.
+
+    update_H : boolean, default: True
+        Set to True, both W and H will be estimated from initial guesses.
+        Set to False, only W will be estimated.
+
+    verbose : integer, default: 0
+        The verbosity level.
+
+    Returns
+    -------
+    W : array, shape (n_samples, n_components)
+        Solution to the non-negative least squares problem.
+
+    H : array, shape (n_components, n_features)
+        Solution to the non-negative least squares problem.
+
+    n_iter : int
+        The number of iterations done by the algorithm.
+
+    References
+    ----------
+    Fevotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix
+    factorization with the beta-divergence. Neural Computation, 23(9).
+    """
+    start_time = time.time()
+
+    beta_loss = _beta_loss_to_float(beta_loss)
+
+    # gamma for Maximization-Minimization (MM) algorithm [Fevotte 2011]
+    if beta_loss < 1:
+        gamma = 1. / (2. - beta_loss)
+    elif beta_loss > 2:
+        gamma = 1. / (beta_loss - 1.)
+    else:
+        gamma = 1.
+
+    # used for the convergence criterion
+    error_at_init = _beta_divergence(X, W, H, beta_loss, square_root=True)
+    previous_error = error_at_init
+
+    H_sum, HHt, XHt = None, None, None
+    for n_iter in range(1, max_iter + 1):
+        # update W
+        # H_sum, HHt and XHt are saved and reused if not update_H
+        delta_W, H_sum, HHt, XHt = _multiplicative_update_w(
+            X, W, H, beta_loss, l1_reg_W, l2_reg_W, gamma,
+            H_sum, HHt, XHt, update_H)
+        W *= delta_W
+
+        # necessary for stability with beta_loss < 1
+        if beta_loss < 1:
+            W[W < np.finfo(np.float64).eps] = 0.
+
+        # update H
+        if update_H:
+            delta_H = _multiplicative_update_h(X, W, H, beta_loss, l1_reg_H,
+                                               l2_reg_H, gamma)
+            H *= delta_H
+
+            # These values will be recomputed since H changed
+            H_sum, HHt, XHt = None, None, None
+
+            # necessary for stability with beta_loss < 1
+            if beta_loss <= 1:
+                H[H < np.finfo(np.float64).eps] = 0.
+
+        # test convergence criterion every 10 iterations
+        if tol > 0 and n_iter % 10 == 0:
+            error = _beta_divergence(X, W, H, beta_loss, square_root=True)
+
+            if verbose:
+                iter_time = time.time()
+                print("Epoch %02d reached after %.3f seconds, error: %f" %
+                      (n_iter, iter_time - start_time, error))
+
+            if (previous_error - error) / error_at_init < tol:
+                break
+            previous_error = error
+
+    # do not print if we have already printed in the convergence test
+    if verbose and (tol == 0 or n_iter % 10 != 0):
+        end_time = time.time()
+        print("Epoch %02d reached after %.3f seconds." %
+              (n_iter, end_time - start_time))
+
+    return W, H, n_iter
+
+
 def non_negative_factorization(X, W=None, H=None, n_components=None,
                                init='random', update_H=True, solver='cd',
-                               tol=1e-4, max_iter=200, alpha=0., l1_ratio=0.,
+                               beta_loss='frobenius', tol=1e-4,
+                               max_iter=200, alpha=0., l1_ratio=0.,
                                regularization=None, random_state=None,
                                verbose=0, shuffle=False):
     """Compute Non-negative Matrix Factorization (NMF)
@@ -494,6 +822,10 @@ def non_negative_factorization(X, W=None, H=None, n_components=None,
         ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm)
         ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm)
 
+    For multiplicative-update ('mu') solver, the Frobenius norm
+    (0.5 * ||X - WH||_Fro^2) can be changed into another beta-divergence loss,
+    by changing the beta_loss parameter.
+
     The objective function is minimized with an alternating minimization of W
     and H. If H is given and update_H=False, it solves for W only.
 
@@ -537,9 +869,26 @@ def non_negative_factorization(X, W=None, H=None, n_components=None,
         Set to True, both W and H will be estimated from initial guesses.
         Set to False, only W will be estimated.
 
-    solver : 'cd'
+    solver : 'cd' | 'mu'
         Numerical solver to use:
         'cd' is a Coordinate Descent solver.
+        'mu' is a Multiplicative Update solver.
+
+        .. versionadded:: 0.17
+           Coordinate Descent solver.
+
+        .. versionadded:: 0.19
+           Multiplicative Update solver.
+
+    beta_loss : float or string, default 'frobenius'
+        String must be in {'frobenius', 'kullback-leibler', 'itakura-saito'}.
+        Beta divergence to be minimized, measuring the distance between X
+        and the dot product WH. Note that values different from 'frobenius'
+        (or 2) and 'kullback-leibler' (or 1) lead to significantly slower
+        fits. Note that for beta_loss <= 0 (or 'itakura-saito'), the input
+        matrix X cannot contain zeros. Used only in 'mu' solver.
+
+        .. versionadded:: 0.19
 
     tol : float, default: 1e-4
         Tolerance of the stopping condition.
@@ -570,7 +919,6 @@ def non_negative_factorization(X, W=None, H=None, n_components=None,
     shuffle : boolean, default: False
         If true, randomize the order of coordinates in the CD solver.
 
-
     Returns
     -------
     W : array-like, shape (n_samples, n_components)
@@ -582,20 +930,33 @@ def non_negative_factorization(X, W=None, H=None, n_components=None,
     n_iter : int
         Actual number of iterations.
 
+    Examples
+    --------
+    >>> import numpy as np
+    >>> X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
+    >>> from sklearn.decomposition import non_negative_factorization
+    >>> W, H, n_iter = non_negative_factorization(X, n_components=2, \
+        init='random', random_state=0)
+
     References
     ----------
-    C.-J. Lin. Projected gradient methods for non-negative matrix
-    factorization. Neural Computation, 19(2007), 2756-2779.
-    http://www.csie.ntu.edu.tw/~cjlin/nmf/
-
     Cichocki, Andrzej, and P. H. A. N. Anh-Huy. "Fast local algorithms for
     large scale nonnegative matrix and tensor factorizations."
     IEICE transactions on fundamentals of electronics, communications and
     computer sciences 92.3: 708-721, 2009.
+
+    Fevotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix
+    factorization with the beta-divergence. Neural Computation, 23(9).
     """
 
     X = check_array(X, accept_sparse=('csr', 'csc'))
     check_non_negative(X, "NMF (input X)")
+    beta_loss = _check_string_param(solver, regularization, beta_loss, init)
+
+    if safe_min(X) == 0 and beta_loss <= 0:
+        raise ValueError("When beta_loss <= 0 and X contains zeros, "
+                         "the solver may diverge. Please add small values to "
+                         "X, or use a positive beta_loss.")
 
     n_samples, n_features = X.shape
     if n_components is None:
@@ -605,8 +966,8 @@ def non_negative_factorization(X, W=None, H=None, n_components=None,
         raise ValueError("Number of components must be a positive integer;"
                          " got (n_components=%r)" % n_components)
     if not isinstance(max_iter, INTEGER_TYPES) or max_iter < 0:
-        raise ValueError("Maximum number of iterations must be a positive integer;"
-                         " got (max_iter=%r)" % max_iter)
+        raise ValueError("Maximum number of iterations must be a positive "
+                         "integer; got (max_iter=%r)" % max_iter)
     if not isinstance(tol, numbers.Number) or tol < 0:
         raise ValueError("Tolerance for stopping criteria must be "
                          "positive; got (tol=%r)" % tol)
@@ -617,24 +978,37 @@ def non_negative_factorization(X, W=None, H=None, n_components=None,
         _check_init(W, (n_samples, n_components), "NMF (input W)")
     elif not update_H:
         _check_init(H, (n_components, n_features), "NMF (input H)")
-        W = np.zeros((n_samples, n_components))
+        # 'mu' solver should not be initialized by zeros
+        if solver == 'mu':
+            avg = np.sqrt(X.mean() / n_components)
+            W = avg * np.ones((n_samples, n_components))
+        else:
+            W = np.zeros((n_samples, n_components))
     else:
         W, H = _initialize_nmf(X, n_components, init=init,
                                random_state=random_state)
 
+    l1_reg_W, l1_reg_H, l2_reg_W, l2_reg_H = _compute_regularization(
+        alpha, l1_ratio, regularization)
+
     if solver == 'cd':
-        W, H, n_iter = _fit_coordinate_descent(X, W, H, tol,
-                                               max_iter,
-                                               alpha, l1_ratio,
-                                               regularization,
+        W, H, n_iter = _fit_coordinate_descent(X, W, H, tol, max_iter,
+                                               l1_reg_W, l1_reg_H,
+                                               l2_reg_W, l2_reg_H,
                                                update_H=update_H,
                                                verbose=verbose,
                                                shuffle=shuffle,
                                                random_state=random_state)
+    elif solver == 'mu':
+        W, H, n_iter = _fit_multiplicative_update(X, W, H, beta_loss, max_iter,
+                                                  tol, l1_reg_W, l1_reg_H,
+                                                  l2_reg_W, l2_reg_H, update_H,
+                                                  verbose)
+
     else:
         raise ValueError("Invalid solver parameter '%s'." % solver)
 
-    if n_iter == max_iter:
+    if n_iter == max_iter and tol > 0:
         warnings.warn("Maximum number of iteration %d reached. Increase it to"
                       " improve convergence." % max_iter, ConvergenceWarning)
 
@@ -661,6 +1035,10 @@ class NMF(BaseEstimator, TransformerMixin):
         ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm)
         ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm)
 
+    For multiplicative-update ('mu') solver, the Frobenius norm
+    (0.5 * ||X - WH||_Fro^2) can be changed into another beta-divergence loss,
+    by changing the beta_loss parameter.
+
     The objective function is minimized with an alternating minimization of W
     and H.
 
@@ -674,7 +1052,7 @@ class NMF(BaseEstimator, TransformerMixin):
 
     init :  'random' | 'nndsvd' |  'nndsvda' | 'nndsvdar' | 'custom'
         Method used to initialize the procedure.
-        Default: 'nndsvdar' if n_components < n_features, otherwise random.
+        Default: 'nndsvd' if n_components < n_features, otherwise random.
         Valid options:
 
         - 'random': non-negative random matrices, scaled with:
@@ -692,21 +1070,32 @@ class NMF(BaseEstimator, TransformerMixin):
 
         - 'custom': use custom matrices W and H
 
-    solver : 'cd'
+    solver : 'cd' | 'mu'
         Numerical solver to use:
         'cd' is a Coordinate Descent solver.
+        'mu' is a Multiplicative Update solver.
 
         .. versionadded:: 0.17
            Coordinate Descent solver.
 
-        .. versionchanged:: 0.17
-           Deprecated Projected Gradient solver.
+        .. versionadded:: 0.19
+           Multiplicative Update solver.
+
+    beta_loss : float or string, default 'frobenius'
+        String must be in {'frobenius', 'kullback-leibler', 'itakura-saito'}.
+        Beta divergence to be minimized, measuring the distance between X
+        and the dot product WH. Note that values different from 'frobenius'
+        (or 2) and 'kullback-leibler' (or 1) lead to significantly slower
+        fits. Note that for beta_loss <= 0 (or 'itakura-saito'), the input
+        matrix X cannot contain zeros. Used only in 'mu' solver.
 
-    tol : double, default: 1e-4
-        Tolerance value used in stopping conditions.
+        .. versionadded:: 0.19
+
+    tol : float, default: 1e-4
+        Tolerance of the stopping condition.
 
     max_iter : integer, default: 200
-        Number of iterations to compute.
+        Maximum number of iterations before timing out.
 
     random_state : integer seed, RandomState instance, or None (default)
         Random number generator seed control.
@@ -735,16 +1124,15 @@ class NMF(BaseEstimator, TransformerMixin):
         .. versionadded:: 0.17
            *shuffle* parameter used in the Coordinate Descent solver.
 
-
     Attributes
     ----------
     components_ : array, [n_components, n_features]
-        Non-negative components of the data.
+        Factorization matrix, sometimes called 'dictionary'.
 
     reconstruction_err_ : number
-        Frobenius norm of the matrix difference between
-        the training data and the reconstructed data from
-        the fit produced by the model. ``|| X - WH ||_2``
+        Frobenius norm of the matrix difference, or beta-divergence, between
+        the training data ``X`` and the reconstructed data ``WH`` from
+        the fitted model.
 
     n_iter_ : int
         Actual number of iterations.
@@ -752,38 +1140,30 @@ class NMF(BaseEstimator, TransformerMixin):
     Examples
     --------
     >>> import numpy as np
-    >>> X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
+    >>> X = np.array([[1, 1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
     >>> from sklearn.decomposition import NMF
     >>> model = NMF(n_components=2, init='random', random_state=0)
-    >>> model.fit(X) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
-    NMF(alpha=0.0, init='random', l1_ratio=0.0, max_iter=200,
-      n_components=2, random_state=0, shuffle=False,
-      solver='cd', tol=0.0001, verbose=0)
-
-    >>> model.components_
-    array([[ 2.09783018,  0.30560234],
-           [ 2.13443044,  2.13171694]])
-    >>> model.reconstruction_err_ #doctest: +ELLIPSIS
-    0.00115993...
+    >>> W = model.fit_transform(X)
+    >>> H = model.components_
 
     References
     ----------
-    C.-J. Lin. Projected gradient methods for non-negative matrix
-    factorization. Neural Computation, 19(2007), 2756-2779.
-    http://www.csie.ntu.edu.tw/~cjlin/nmf/
-
     Cichocki, Andrzej, and P. H. A. N. Anh-Huy. "Fast local algorithms for
     large scale nonnegative matrix and tensor factorizations."
     IEICE transactions on fundamentals of electronics, communications and
     computer sciences 92.3: 708-721, 2009.
-    """
 
-    def __init__(self, n_components=None, init=None, solver='cd', tol=1e-4,
-                 max_iter=200, random_state=None, alpha=0., l1_ratio=0.,
-                 verbose=0, shuffle=False):
+    Fevotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix
+    factorization with the beta-divergence. Neural Computation, 23(9).
+    """
+    def __init__(self, n_components=None, init=None, solver='cd',
+                 beta_loss='frobenius', tol=1e-4, max_iter=200,
+                 random_state=None, alpha=0., l1_ratio=0., verbose=0,
+                 shuffle=False):
         self.n_components = n_components
         self.init = init
         self.solver = solver
+        self.beta_loss = beta_loss
         self.tol = tol
         self.max_iter = max_iter
         self.random_state = random_state
@@ -816,14 +1196,15 @@ def fit_transform(self, X, y=None, W=None, H=None):
         X = check_array(X, accept_sparse=('csr', 'csc'))
 
         W, H, n_iter_ = non_negative_factorization(
-            X=X, W=W, H=H, n_components=self.n_components,
-            init=self.init, update_H=True, solver=self.solver,
+            X=X, W=W, H=H, n_components=self.n_components, init=self.init,
+            update_H=True, solver=self.solver, beta_loss=self.beta_loss,
             tol=self.tol, max_iter=self.max_iter, alpha=self.alpha,
             l1_ratio=self.l1_ratio, regularization='both',
             random_state=self.random_state, verbose=self.verbose,
             shuffle=self.shuffle)
 
-        self.reconstruction_err_ = _safe_compute_error(X, W, H)
+        self.reconstruction_err_ = _beta_divergence(X, W, H, self.beta_loss,
+                                                    square_root=True)
 
         self.n_components_ = H.shape[0]
         self.components_ = H
@@ -864,8 +1245,8 @@ def transform(self, X):
         W, _, n_iter_ = non_negative_factorization(
             X=X, W=None, H=self.components_, n_components=self.n_components_,
             init=self.init, update_H=False, solver=self.solver,
-            tol=self.tol, max_iter=self.max_iter, alpha=self.alpha,
-            l1_ratio=self.l1_ratio, regularization='both',
+            beta_loss=self.beta_loss, tol=self.tol, max_iter=self.max_iter,
+            alpha=self.alpha, l1_ratio=self.l1_ratio, regularization='both',
             random_state=self.random_state, verbose=self.verbose,
             shuffle=self.shuffle)
 
diff --git a/sklearn/decomposition/tests/test_nmf.py b/sklearn/decomposition/tests/test_nmf.py
index bb93ed94f3df5..6254c147d45a5 100644
--- a/sklearn/decomposition/tests/test_nmf.py
+++ b/sklearn/decomposition/tests/test_nmf.py
@@ -1,4 +1,7 @@
 import numpy as np
+import scipy.sparse as sp
+import numbers
+
 from scipy import linalg
 from sklearn.decomposition import NMF, non_negative_factorization
 from sklearn.decomposition import nmf   # For testing internals
@@ -7,18 +10,21 @@
 from sklearn.utils.testing import assert_true
 from sklearn.utils.testing import assert_false
 from sklearn.utils.testing import assert_raise_message, assert_no_warnings
+from sklearn.utils.testing import assert_array_equal
 from sklearn.utils.testing import assert_array_almost_equal
 from sklearn.utils.testing import assert_almost_equal
 from sklearn.utils.testing import assert_less
+from sklearn.utils.testing import assert_greater
+from sklearn.utils.testing import ignore_warnings
+from sklearn.utils.extmath import squared_norm, fast_dot
 from sklearn.base import clone
-
-
-random_state = np.random.mtrand.RandomState(0)
+from sklearn.exceptions import ConvergenceWarning
 
 
 def test_initialize_nn_output():
     # Test that initialization does not return negative values
-    data = np.abs(random_state.randn(10, 10))
+    rng = np.random.mtrand.RandomState(42)
+    data = np.abs(rng.randn(10, 10))
     for init in ('random', 'nndsvd', 'nndsvda', 'nndsvdar'):
         W, H = nmf._initialize_nmf(data, 10, init=init, random_state=0)
         assert_false((W < 0).any() or (H < 0).any())
@@ -27,10 +33,17 @@ def test_initialize_nn_output():
 def test_parameter_checking():
     A = np.ones((2, 2))
     name = 'spam'
-    msg = "Invalid solver parameter 'spam'"
+    msg = "Invalid solver parameter: got 'spam' instead of one of"
     assert_raise_message(ValueError, msg, NMF(solver=name).fit, A)
     msg = "Invalid init parameter: got 'spam' instead of one of"
     assert_raise_message(ValueError, msg, NMF(init=name).fit, A)
+    msg = "Invalid beta_loss parameter: got 'spam' instead of one"
+    assert_raise_message(ValueError, msg, NMF(solver='mu',
+                                              beta_loss=name).fit, A)
+    msg = "Invalid beta_loss parameter: solver 'cd' does not handle "
+    msg += "beta_loss = 1.0"
+    assert_raise_message(ValueError, msg, NMF(solver='cd',
+                                              beta_loss=1.0).fit, A)
 
     msg = "Negative values in data passed to"
     assert_raise_message(ValueError, msg, NMF().fit, -A)
@@ -44,7 +57,8 @@ def test_initialize_close():
     # Test NNDSVD error
     # Test that _initialize_nmf error is less than the standard deviation of
     # the entries in the matrix.
-    A = np.abs(random_state.randn(10, 10))
+    rng = np.random.mtrand.RandomState(42)
+    A = np.abs(rng.randn(10, 10))
     W, H = nmf._initialize_nmf(A, 10, init='nndsvd')
     error = linalg.norm(np.dot(W, H) - A)
     sdev = linalg.norm(A - A.mean())
@@ -55,7 +69,8 @@ def test_initialize_variants():
     # Test NNDSVD variants correctness
     # Test that the variants 'nndsvda' and 'nndsvdar' differ from basic
     # 'nndsvd' only where the basic version has zeros.
-    data = np.abs(random_state.randn(10, 10))
+    rng = np.random.mtrand.RandomState(42)
+    data = np.abs(rng.randn(10, 10))
     W0, H0 = nmf._initialize_nmf(data, 10, init='nndsvd')
     Wa, Ha = nmf._initialize_nmf(data, 10, init='nndsvda')
     War, Har = nmf._initialize_nmf(data, 10, init='nndsvdar',
@@ -65,50 +80,46 @@ def test_initialize_variants():
         assert_almost_equal(evl[ref != 0], ref[ref != 0])
 
 
+# ignore UserWarning raised when both solver='mu' and init='nndsvd'
+@ignore_warnings(category=UserWarning)
 def test_nmf_fit_nn_output():
     # Test that the decomposition does not contain negative values
     A = np.c_[5 * np.ones(5) - np.arange(1, 6),
               5 * np.ones(5) + np.arange(1, 6)]
-    for init in (None, 'nndsvd', 'nndsvda', 'nndsvdar'):
-        model = NMF(n_components=2, init=init, random_state=0)
-        transf = model.fit_transform(A)
-        assert_false((model.components_ < 0).any() or
-                     (transf < 0).any())
+    for solver in ('cd', 'mu'):
+        for init in (None, 'nndsvd', 'nndsvda', 'nndsvdar', 'random'):
+            model = NMF(n_components=2, solver=solver, init=init,
+                        random_state=0)
+            transf = model.fit_transform(A)
+            assert_false((model.components_ < 0).any() or
+                         (transf < 0).any())
 
 
 def test_nmf_fit_close():
+    rng = np.random.mtrand.RandomState(42)
     # Test that the fit is not too far away
-    pnmf = NMF(5, init='nndsvd', random_state=0)
-    X = np.abs(random_state.randn(6, 5))
-    assert_less(pnmf.fit(X).reconstruction_err_, 0.05)
-
-
-def test_nls_nn_output():
-    # Test that NLS solver doesn't return negative values
-    A = np.arange(1, 5).reshape(1, -1)
-    Ap, _, _ = nmf._nls_subproblem(np.dot(A.T, -A), A.T, A, 0.001, 100)
-    assert_false((Ap < 0).any())
-
-
-def test_nls_close():
-    # Test that the NLS results should be close
-    A = np.arange(1, 5).reshape(1, -1)
-    Ap, _, _ = nmf._nls_subproblem(np.dot(A.T, A), A.T, np.zeros_like(A),
-                                   0.001, 100)
-    assert_true((np.abs(Ap - A) < 0.01).all())
+    for solver in ('cd', 'mu'):
+        pnmf = NMF(5, solver=solver, init='nndsvdar', random_state=0,
+                   max_iter=600)
+        X = np.abs(rng.randn(6, 5))
+        assert_less(pnmf.fit(X).reconstruction_err_, 0.1)
 
 
 def test_nmf_transform():
     # Test that NMF.transform returns close values
-    A = np.abs(random_state.randn(6, 5))
-    m = NMF(n_components=4, init='nndsvd', random_state=0)
-    ft = m.fit_transform(A)
-    t = m.transform(A)
-    assert_array_almost_equal(ft, t, decimal=2)
+    rng = np.random.mtrand.RandomState(42)
+    A = np.abs(rng.randn(6, 5))
+    for solver in ['cd', 'mu']:
+        m = NMF(solver=solver, n_components=3, init='random',
+                random_state=0, tol=1e-5)
+        ft = m.fit_transform(A)
+        t = m.transform(A)
+        assert_array_almost_equal(ft, t, decimal=2)
 
 
 def test_nmf_transform_custom_init():
     # Smoke test that checks if NMF.transform works with custom initialization
+    random_state = np.random.RandomState(0)
     A = np.abs(random_state.randn(6, 5))
     n_components = 4
     avg = np.sqrt(A.mean() / n_components)
@@ -125,29 +136,34 @@ def test_nmf_inverse_transform():
     # Test that NMF.inverse_transform returns close values
     random_state = np.random.RandomState(0)
     A = np.abs(random_state.randn(6, 4))
-    m = NMF(n_components=4, init='random', random_state=0)
-    m.fit_transform(A)
-    t = m.transform(A)
-    A_new = m.inverse_transform(t)
-    assert_array_almost_equal(A, A_new, decimal=2)
+    for solver in ('cd', 'mu'):
+        m = NMF(solver=solver, n_components=4, init='random', random_state=0,
+                max_iter=1000)
+        ft = m.fit_transform(A)
+        A_new = m.inverse_transform(ft)
+        assert_array_almost_equal(A, A_new, decimal=2)
 
 
 def test_n_components_greater_n_features():
     # Smoke test for the case of more components than features.
-    A = np.abs(random_state.randn(30, 10))
+    rng = np.random.mtrand.RandomState(42)
+    A = np.abs(rng.randn(30, 10))
     NMF(n_components=15, random_state=0, tol=1e-2).fit(A)
 
 
-def test_sparse_input():
+def test_nmf_sparse_input():
     # Test that sparse matrices are accepted as input
     from scipy.sparse import csc_matrix
 
-    A = np.abs(random_state.randn(10, 10))
+    rng = np.random.mtrand.RandomState(42)
+    A = np.abs(rng.randn(10, 10))
     A[:, 2 * np.arange(5)] = 0
     A_sparse = csc_matrix(A)
 
-    est1 = NMF(n_components=5, init='random', random_state=0, tol=1e-2)
-    est2 = clone(est1)
+    for solver in ('cd', 'mu'):
+        est1 = NMF(solver=solver, n_components=5, init='random',
+                   random_state=0, tol=1e-2)
+        est2 = clone(est1)
 
     W1 = est1.fit_transform(A)
     W2 = est2.fit_transform(A_sparse)
@@ -158,34 +174,39 @@ def test_sparse_input():
     assert_array_almost_equal(H1, H2)
 
 
-def test_sparse_transform():
+def test_nmf_sparse_transform():
     # Test that transform works on sparse data.  Issue #2124
-
-    A = np.abs(random_state.randn(3, 2))
-    A[A > 1.0] = 0
+    rng = np.random.mtrand.RandomState(42)
+    A = np.abs(rng.randn(3, 2))
+    A[1, 1] = 0
     A = csc_matrix(A)
 
-    model = NMF(random_state=0, tol=1e-4, n_components=2)
-    A_fit_tr = model.fit_transform(A)
-    A_tr = model.transform(A)
-    assert_array_almost_equal(A_fit_tr, A_tr, decimal=1)
+    for solver in ('cd', 'mu'):
+        model = NMF(solver=solver, random_state=0, n_components=2,
+                    max_iter=400)
+        A_fit_tr = model.fit_transform(A)
+        A_tr = model.transform(A)
+        assert_array_almost_equal(A_fit_tr, A_tr, decimal=1)
 
 
 def test_non_negative_factorization_consistency():
     # Test that the function is called in the same way, either directly
     # or through the NMF class
-    A = np.abs(random_state.randn(10, 10))
+    rng = np.random.mtrand.RandomState(42)
+    A = np.abs(rng.randn(10, 10))
     A[:, 2 * np.arange(5)] = 0
 
-    W_nmf, H, _ = non_negative_factorization(A, random_state=1, tol=1e-2)
-    W_nmf_2, _, _ = non_negative_factorization(
-        A, H=H, update_H=False, random_state=1, tol=1e-2)
+    for solver in ('cd', 'mu'):
+        W_nmf, H, _ = non_negative_factorization(
+            A, solver=solver, random_state=1, tol=1e-2)
+        W_nmf_2, _, _ = non_negative_factorization(
+            A, H=H, update_H=False, solver=solver, random_state=1, tol=1e-2)
 
-    model_class = NMF(random_state=1, tol=1e-2)
-    W_cls = model_class.fit_transform(A)
-    W_cls_2 = model_class.transform(A)
-    assert_array_almost_equal(W_nmf, W_cls, decimal=10)
-    assert_array_almost_equal(W_nmf_2, W_cls_2, decimal=10)
+        model_class = NMF(solver=solver, random_state=1, tol=1e-2)
+        W_cls = model_class.fit_transform(A)
+        W_cls_2 = model_class.transform(A)
+        assert_array_almost_equal(W_nmf, W_cls, decimal=10)
+        assert_array_almost_equal(W_nmf_2, W_cls_2, decimal=10)
 
 
 def test_non_negative_factorization_checking():
@@ -205,16 +226,256 @@ def test_non_negative_factorization_checking():
     assert_raise_message(ValueError, msg, nnmf, A, -A, A, 2, 'custom')
     msg = "Array passed to NMF (input H) is full of zeros"
     assert_raise_message(ValueError, msg, nnmf, A, A, 0 * A, 2, 'custom')
-
-
-def test_safe_compute_error():
-    A = np.abs(random_state.randn(10, 10))
-    A[:, 2 * np.arange(5)] = 0
-    A_sparse = csc_matrix(A)
-
-    W, H = nmf._initialize_nmf(A, 5, init='random', random_state=0)
-
-    error = nmf._safe_compute_error(A, W, H)
-    error_sparse = nmf._safe_compute_error(A_sparse, W, H)
-
-    assert_almost_equal(error, error_sparse)
+    msg = "Invalid regularization parameter: got 'spam' instead of one of"
+    assert_raise_message(ValueError, msg, nnmf, A, A, 0 * A, 2, 'custom', True,
+                         'cd', 2., 1e-4, 200, 0., 0., 'spam')
+
+
+def _beta_divergence_dense(X, W, H, beta):
+    """Compute the beta-divergence of X and W.H for dense array only.
+
+    Used as a reference for testing nmf._beta_divergence.
+    """
+    if isinstance(X, numbers.Number):
+        W = np.array([[W]])
+        H = np.array([[H]])
+        X = np.array([[X]])
+
+    WH = fast_dot(W, H)
+
+    if beta == 2:
+        return squared_norm(X - WH) / 2
+
+    WH_Xnonzero = WH[X != 0]
+    X_nonzero = X[X != 0]
+    np.maximum(WH_Xnonzero, 1e-9, out=WH_Xnonzero)
+
+    if beta == 1:
+        res = np.sum(X_nonzero * np.log(X_nonzero / WH_Xnonzero))
+        res += WH.sum() - X.sum()
+
+    elif beta == 0:
+        div = X_nonzero / WH_Xnonzero
+        res = np.sum(div) - X.size - np.sum(np.log(div))
+    else:
+        res = (X_nonzero ** beta).sum()
+        res += (beta - 1) * (WH ** beta).sum()
+        res -= beta * (X_nonzero * (WH_Xnonzero ** (beta - 1))).sum()
+        res /= beta * (beta - 1)
+
+    return res
+
+
+def test_beta_divergence():
+    # Compare _beta_divergence with the reference _beta_divergence_dense
+    n_samples = 20
+    n_features = 10
+    n_components = 5
+    beta_losses = [0., 0.5, 1., 1.5, 2.]
+
+    # initialization
+    rng = np.random.mtrand.RandomState(42)
+    X = rng.randn(n_samples, n_features)
+    X[X < 0] = 0.
+    X_csr = sp.csr_matrix(X)
+    W, H = nmf._initialize_nmf(X, n_components, init='random', random_state=42)
+
+    for beta in beta_losses:
+        ref = _beta_divergence_dense(X, W, H, beta)
+        loss = nmf._beta_divergence(X, W, H, beta)
+        loss_csr = nmf._beta_divergence(X_csr, W, H, beta)
+
+        assert_almost_equal(ref, loss, decimal=7)
+        assert_almost_equal(ref, loss_csr, decimal=7)
+
+
+def test_special_sparse_dot():
+    # Test the function that computes np.dot(W, H), only where X is non zero.
+    n_samples = 10
+    n_features = 5
+    n_components = 3
+    rng = np.random.mtrand.RandomState(42)
+    X = rng.randn(n_samples, n_features)
+    X[X < 0] = 0.
+    X_csr = sp.csr_matrix(X)
+
+    W = np.abs(rng.randn(n_samples, n_components))
+    H = np.abs(rng.randn(n_components, n_features))
+
+    WH_safe = nmf._special_sparse_dot(W, H, X_csr)
+    WH = nmf._special_sparse_dot(W, H, X)
+
+    # test that both results have same values, in X_csr nonzero elements
+    ii, jj = X_csr.nonzero()
+    WH_safe_data = np.asarray(WH_safe[ii, jj]).ravel()
+    assert_array_almost_equal(WH_safe_data, WH[ii, jj], decimal=10)
+
+    # test that WH_safe and X_csr have the same sparse structure
+    assert_array_equal(WH_safe.indices, X_csr.indices)
+    assert_array_equal(WH_safe.indptr, X_csr.indptr)
+    assert_array_equal(WH_safe.shape, X_csr.shape)
+
+
+@ignore_warnings(category=ConvergenceWarning)
+def test_nmf_multiplicative_update_sparse():
+    # Compare sparse and dense input in multiplicative update NMF
+    # Also test continuity of the results with respect to beta_loss parameter
+    n_samples = 20
+    n_features = 10
+    n_components = 5
+    alpha = 0.1
+    l1_ratio = 0.5
+    n_iter = 20
+
+    # initialization
+    rng = np.random.mtrand.RandomState(1337)
+    X = rng.randn(n_samples, n_features)
+    X = np.abs(X)
+    X_csr = sp.csr_matrix(X)
+    W0, H0 = nmf._initialize_nmf(X, n_components, init='random',
+                                 random_state=42)
+
+    for beta_loss in (-1.2, 0, 0.2, 1., 2., 2.5):
+        # Reference with dense array X
+        W, H = W0.copy(), H0.copy()
+        W1, H1, _ = non_negative_factorization(
+            X, W, H, n_components, init='custom', update_H=True,
+            solver='mu', beta_loss=beta_loss, max_iter=n_iter, alpha=alpha,
+            l1_ratio=l1_ratio, regularization='both', random_state=42)
+
+        # Compare with sparse X
+        W, H = W0.copy(), H0.copy()
+        W2, H2, _ = non_negative_factorization(
+            X_csr, W, H, n_components, init='custom', update_H=True,
+            solver='mu', beta_loss=beta_loss, max_iter=n_iter, alpha=alpha,
+            l1_ratio=l1_ratio, regularization='both', random_state=42)
+
+        assert_array_almost_equal(W1, W2, decimal=7)
+        assert_array_almost_equal(H1, H2, decimal=7)
+
+        # Compare with almost same beta_loss, since some values have a specific
+        # behavior, but the results should be continuous w.r.t beta_loss
+        beta_loss -= 1.e-5
+        W, H = W0.copy(), H0.copy()
+        W3, H3, _ = non_negative_factorization(
+            X_csr, W, H, n_components, init='custom', update_H=True,
+            solver='mu', beta_loss=beta_loss, max_iter=n_iter, alpha=alpha,
+            l1_ratio=l1_ratio, regularization='both', random_state=42)
+
+        assert_array_almost_equal(W1, W3, decimal=4)
+        assert_array_almost_equal(H1, H3, decimal=4)
+
+
+def test_nmf_negative_beta_loss():
+    # Test that an error is raised if beta_loss < 0 and X contains zeros.
+    # Test that the output has not NaN values when the input contains zeros.
+    n_samples = 6
+    n_features = 5
+    n_components = 3
+
+    rng = np.random.mtrand.RandomState(42)
+    X = rng.randn(n_samples, n_features)
+    X[X < 0] = 0
+    X_csr = sp.csr_matrix(X)
+
+    def _assert_nmf_no_nan(X, beta_loss):
+        W, H, _ = non_negative_factorization(
+            X, n_components=n_components, solver='mu', beta_loss=beta_loss,
+            random_state=0, max_iter=1000)
+        assert_false(np.any(np.isnan(W)))
+        assert_false(np.any(np.isnan(H)))
+
+    msg = "When beta_loss <= 0 and X contains zeros, the solver may diverge."
+    for beta_loss in (-0.6, 0.):
+        assert_raise_message(ValueError, msg, _assert_nmf_no_nan, X, beta_loss)
+        _assert_nmf_no_nan(X + 1e-9, beta_loss)
+
+    for beta_loss in (0.2, 1., 1.2, 2., 2.5):
+        _assert_nmf_no_nan(X, beta_loss)
+        _assert_nmf_no_nan(X_csr, beta_loss)
+
+
+def test_nmf_regularization():
+    # Test the effect of L1 and L2 regularizations
+    n_samples = 6
+    n_features = 5
+    n_components = 3
+    rng = np.random.mtrand.RandomState(42)
+    X = np.abs(rng.randn(n_samples, n_features))
+
+    # L1 regularization should increase the number of zeros
+    l1_ratio = 1.
+    for solver in ['cd', 'mu']:
+        regul = nmf.NMF(n_components=n_components, solver=solver,
+                        alpha=0.5, l1_ratio=l1_ratio, random_state=42)
+        model = nmf.NMF(n_components=n_components, solver=solver,
+                        alpha=0., l1_ratio=l1_ratio, random_state=42)
+
+        W_regul = regul.fit_transform(X)
+        W_model = model.fit_transform(X)
+
+        H_regul = regul.components_
+        H_model = model.components_
+
+        W_regul_n_zeros = W_regul[W_regul == 0].size
+        W_model_n_zeros = W_model[W_model == 0].size
+        H_regul_n_zeros = H_regul[H_regul == 0].size
+        H_model_n_zeros = H_model[H_model == 0].size
+
+        assert_greater(W_regul_n_zeros, W_model_n_zeros)
+        assert_greater(H_regul_n_zeros, H_model_n_zeros)
+
+    # L2 regularization should decrease the mean of the coefficients
+    l1_ratio = 0.
+    for solver in ['cd', 'mu']:
+        regul = nmf.NMF(n_components=n_components, solver=solver,
+                        alpha=0.5, l1_ratio=l1_ratio, random_state=42)
+        model = nmf.NMF(n_components=n_components, solver=solver,
+                        alpha=0., l1_ratio=l1_ratio, random_state=42)
+
+        W_regul = regul.fit_transform(X)
+        W_model = model.fit_transform(X)
+
+        H_regul = regul.components_
+        H_model = model.components_
+
+        assert_greater(W_model.mean(), W_regul.mean())
+        assert_greater(H_model.mean(), H_regul.mean())
+
+
+@ignore_warnings(category=ConvergenceWarning)
+def test_nmf_decreasing():
+    # test that the objective function is decreasing at each iteration
+    n_samples = 20
+    n_features = 15
+    n_components = 10
+    alpha = 0.1
+    l1_ratio = 0.5
+    tol = 0.
+
+    # initialization
+    rng = np.random.mtrand.RandomState(42)
+    X = rng.randn(n_samples, n_features)
+    np.abs(X, X)
+    W0, H0 = nmf._initialize_nmf(X, n_components, init='random',
+                                 random_state=42)
+
+    for beta_loss in (-1.2, 0, 0.2, 1., 2., 2.5):
+        for solver in ('cd', 'mu'):
+            if solver != 'mu' and beta_loss != 2:
+                # not implemented
+                continue
+            W, H = W0.copy(), H0.copy()
+            previous_loss = None
+            for _ in range(30):
+                # one more iteration starting from the previous results
+                W, H, _ = non_negative_factorization(
+                    X, W, H, beta_loss=beta_loss, init='custom',
+                    n_components=n_components, max_iter=1, alpha=alpha,
+                    solver=solver, tol=tol, l1_ratio=l1_ratio, verbose=0,
+                    regularization='both', random_state=0, update_H=True)
+
+                loss = nmf._beta_divergence(X, W, H, beta_loss)
+                if previous_loss is not None:
+                    assert_greater(previous_loss, loss)
+                previous_loss = loss