diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst index 2f506dcf7be07..9dfcaa4549f08 100644 --- a/doc/modules/feature_extraction.rst +++ b/doc/modules/feature_extraction.rst @@ -436,11 +436,12 @@ Using the ``TfidfTransformer``'s default settings, the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as -:math:`\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1`, +:math:`\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1`, -where :math:`n_d` is the total number of documents, and :math:`\text{df}(d,t)` -is the number of documents that contain term :math:`t`. The resulting tf-idf -vectors are then normalized by the Euclidean norm: +where :math:`n` is the total number of documents in the document set, and +:math:`\text{df}(t)` is the number of documents in the document set that +contain term :math:`t`. The resulting tf-idf vectors are then normalized by the +Euclidean norm: :math:`v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}`. @@ -455,14 +456,14 @@ computed in scikit-learn's :class:`TfidfTransformer` and :class:`TfidfVectorizer` differ slightly from the standard textbook notation that defines the idf as -:math:`\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}.` +:math:`\text{idf}(t) = \log{\frac{n}{1+\text{df}(t)}}.` In the :class:`TfidfTransformer` and :class:`TfidfVectorizer` with ``smooth_idf=False``, the "1" count is added to the idf instead of the idf's denominator: -:math:`\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1` +:math:`\text{idf}(t) = \log{\frac{n}{\text{df}(t)}} + 1` This normalization is implemented by the :class:`TfidfTransformer` class:: @@ -509,21 +510,21 @@ v{_2}^2 + \dots + v{_n}^2}}` For example, we can compute the tf-idf of the first term in the first document in the `counts` array as follows: -:math:`n_{d} = 6` +:math:`n = 6` -:math:`\text{df}(d, t)_{\text{term1}} = 6` +:math:`\text{df}(t)_{\text{term1}} = 6` -:math:`\text{idf}(d, t)_{\text{term1}} = -log \frac{n_d}{\text{df}(d, t)} + 1 = log(1)+1 = 1` +:math:`\text{idf}(t)_{\text{term1}} = +\log \frac{n}{\text{df}(t)} + 1 = \log(1)+1 = 1` :math:`\text{tf-idf}_{\text{term1}} = \text{tf} \times \text{idf} = 3 \times 1 = 3` Now, if we repeat this computation for the remaining 2 terms in the document, we get -:math:`\text{tf-idf}_{\text{term2}} = 0 \times (log(6/1)+1) = 0` +:math:`\text{tf-idf}_{\text{term2}} = 0 \times (\log(6/1)+1) = 0` -:math:`\text{tf-idf}_{\text{term3}} = 1 \times (log(6/2)+1) \approx 2.0986` +:math:`\text{tf-idf}_{\text{term3}} = 1 \times (\log(6/2)+1) \approx 2.0986` and the vector of raw tf-idfs: @@ -540,12 +541,12 @@ Furthermore, the default parameter ``smooth_idf=True`` adds "1" to the numerator and denominator as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: -:math:`\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1` +:math:`\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1` Using this modification, the tf-idf of the third term in document 1 changes to 1.8473: -:math:`\text{tf-idf}_{\text{term3}} = 1 \times log(7/3)+1 \approx 1.8473` +:math:`\text{tf-idf}_{\text{term3}} = 1 \times \log(7/3)+1 \approx 1.8473` And the L2-normalized tf-idf changes to diff --git a/sklearn/feature_extraction/text.py b/sklearn/feature_extraction/text.py index d705a060e7588..d06e4c7fd483e 100644 --- a/sklearn/feature_extraction/text.py +++ b/sklearn/feature_extraction/text.py @@ -1146,17 +1146,18 @@ class TfidfTransformer(BaseEstimator, TransformerMixin): informative than features that occur in a small fraction of the training corpus. - The formula that is used to compute the tf-idf of term t is - tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as - idf(d, t) = log [ n / df(d, t) ] + 1 (if ``smooth_idf=False``), - where n is the total number of documents and df(d, t) is the - document frequency; the document frequency is the number of documents d - that contain term t. The effect of adding "1" to the idf in the equation - above is that terms with zero idf, i.e., terms that occur in all documents - in a training set, will not be entirely ignored. - (Note that the idf formula above differs from the standard - textbook notation that defines the idf as - idf(d, t) = log [ n / (df(d, t) + 1) ]). + The formula that is used to compute the tf-idf for a term t of a document d + in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is + computed as idf(t) = log [ n / df(t) ] + 1 (if ``smooth_idf=False``), where + n is the total number of documents in the document set and df(t) is the + document frequency of t; the document frequency is the number of documents + in the document set that contain the term t. The effect of adding "1" to + the idf in the equation above is that terms with zero idf, i.e., terms + that occur in all documents in a training set, will not be entirely + ignored. + (Note that the idf formula above differs from the standard textbook + notation that defines the idf as + idf(t) = log [ n / (df(t) + 1) ]). If ``smooth_idf=True`` (the default), the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen