Thanks to visit codestin.com
Credit goes to github.com

Skip to content

added text classification in nlp_apps #1043

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 9, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
updated as per the changes suggested.
  • Loading branch information
thesagarsehgal committed Mar 15, 2019
commit a81a43e41a96366df0c57a8c1910c0deb0b51e03
51 changes: 22 additions & 29 deletions nlp_apps.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -786,12 +786,12 @@
"source": [
"**Text Classification** is assigning a category to a document based on the content of the document. Text Classification is one of the most popular and fundamental tasks of Natural Language Processing. Text classification can be applied on a variety of texts like *Short Documents* (like tweets, customer reviews, etc.) and *Long Document* (like emails, media articles, etc.).\n",
"\n",
"We already have seen an example of Text Classification in the above tasks like Language Identification and Author Identification and Federalist Paper Identification.\n",
"We already have seen an example of Text Classification in the above tasks like Language Identification, Author Recognition and Federalist Paper Identification.\n",
"\n",
"### Applications\n",
"Some of the broad applications of Text Classification are:-\n",
"- Language Identification\n",
"- Author Identification\n",
"- Author Recognition\n",
"- Sentiment Analysis\n",
"- Spam Mail Detection\n",
"- Topic Labelling \n",
Expand All @@ -803,14 +803,14 @@
"- Brand Monitoring\n",
"- Auto-tagging of user queries\n",
"\n",
"For Text Classification, we would be using Naive Bayes Classifier. The reason for using Naive Bayes Classifier is:-\n",
"- Being a probabilistic classifier, therefore will calculate the probability of each category\n",
"For Text Classification, we would be using the Naive Bayes Classifier. The reasons for using Naive Bayes Classifier are:-\n",
"- Being a probabilistic classifier, therefore, will calculate the probability of each category\n",
"- It is fast, reliable and accurate \n",
"- Naive Bayes Classifiers have already been used to solve many Natural Language Processing(NLP) applications.\n",
"- Naive Bayes Classifiers have already been used to solve many Natural Language Processing (NLP) applications.\n",
"\n",
"Here we would here be covering an example of **Word Sense Disambiguation** as an application of Text Classification. It is used to remove the ambiquity of a given word, if the word has 2 different meanings.\n",
"Here we would here be covering an example of **Word Sense Disambiguation** as an application of Text Classification. It is used to remove the ambiguity of a given word if the word has two different meanings.\n",
"\n",
"As we know that we would be working on determining weather the word *apple* in a sentence reffers to `fruit` or to a `company`.\n"
"As we know that we would be working on determining whether the word *apple* in a sentence refers to `fruit` or to a `company`."
]
},
{
Expand All @@ -819,7 +819,7 @@
"source": [
"**Step 1:- Defining the dataset** \n",
"\n",
"The dataset has been defined here itself so that everything is clear and can be tested with other things as well."
"The dataset has been defined here so that everything is clear and can be tested with other things as well."
]
},
{
Expand Down Expand Up @@ -883,7 +883,7 @@
"\n",
"Now we would be extracting features from the text like extracting the set of words used in both the categories i.e. `company` and `fruit`.\n",
"\n",
"This frequency of a word would help in calculating the probability of that word being in a particular class. "
"The frequency of a word would help in calculating the probability of that word being in a particular class. "
]
},
{
Expand All @@ -910,28 +910,28 @@
" elif(tag == class_1):\n",
" words_1 += sent\n",
" \n",
"print(\"Number of words in `\" + class_0 + \"` class:\", len(words_0))\n",
"print(\"Number of words in `\" + class_1 + \"` class:\", len(words_1))"
"print(\"Number of words in `{}` class: {}\".format(class_0, len(words_0)))\n",
"print(\"Number of words in `{}` class: {}\".format(class_1, len(words_1)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you might have observed that our dataset is equally balanced i.e. we have an equal number of words in both the classes."
"As you might have observed, that our dataset is equally balanced, i.e. we have an equal number of words in both the classes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 4:-Making the Naive Bayes Model**\n",
"**Step 4:- Building the Naive Bayes Model**\n",
"\n",
"Using Naive Bayes classifier we can calculate the probability of a word in `company` and `fruit` class and then multiplying all of them to get the probability of that sentence belonging each of the given classes. But if a word is not in our dictionary then this leads to the probability of that class becoming zero. For eg:- the word *Foxconn* is not in the dictionary of any of the classes. Due to this the \n",
"Using the Naive Bayes classifier we can calculate the probability of a word in `company` and `fruit` class and then multiplying all of them to get the probability of that sentence belonging each of the given classes. But if a word is not in our dictionary then this leads to the probability of that word belonging to that class becoming zero. For example:- the word *Foxconn* is not in the dictionary of any of the classes. Due to this, the probability of word *Foxconn* being in any of these classes becomes zero, and since all the probabilities are multiplied, this leads to the probability of that sentence belonging to any of the classes becoming zero. \n",
"\n",
"To solve the problem we need to use **smoothing**, i.e. providing a minimum non-zero threshold probability to every word that we come across.\n",
"\n",
"The `UnigramWordModel` class has implemented smoothing by taking an additional argument from the user i.e. the minimum frequency that we would be giving to every word even if it is new to the dictionary."
"The `UnigramWordModel` class has implemented smoothing by taking an additional argument from the user, i.e. the minimum frequency that we would be giving to every word even if it is new to the dictionary."
]
},
{
Expand All @@ -948,7 +948,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we would be making the Naive Bayes model. For that, we would be making `dist` as we had done earlier in the Authorship Recognition Task."
"Now we would be building the Naive Bayes model. For that, we would be making `dist` as we had done earlier in the Authorship Recognition Task."
]
},
{
Expand All @@ -961,16 +961,16 @@
"\n",
"dist = {('company', 1): model_words_0, ('fruit', 1): model_words_1}\n",
"\n",
"nBS = NaiveBayesLearner(dist, simple = True)"
"nBS = NaiveBayesLearner(dist, simple=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 5:- Predict the class of the label**\n",
"**Step 5:- Predict the class of a sentence**\n",
"\n",
"Now we would be making a function that does pre-procesing of the sentences which we have taken for testing. And then predicting the class of every sentence in the document."
"Now we will be writing a function that does pre-process of the sentences which we have taken for testing. And then predicting the class of every sentence in the document."
]
},
{
Expand Down Expand Up @@ -999,7 +999,7 @@
}
],
"source": [
"# prediction the class of every sentence in the test set\n",
"# predicting the class of sentences in the test set\n",
"for i in test_data:\n",
" print(i + \"\\t-\" + recognize(i, nBS))"
]
Expand All @@ -1008,17 +1008,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You might have observed that our predictions are correct. Though they might not give correct results because of lack of data. And we are clearly able to differentiate between sentences in a much better way. \n",
"You might have observed that the predictions made by the model are correct and we are able to differentiate between sentences of different classes. You can try more sentences on your own. Unfortunately though, since the datasets are pretty small, chances are the guesses will not always be correct.\n",
"\n",
"As you might have observed that the above method is very much similar to the Authorship prediction, which is also a type of Text Classification. Like this most of Text Classification have the same underlying structure and follow a similar procedure."
"As you might have observed, the above method is very much similar to the Author Recognition, which is also a type of Text Classification. Like this most of Text Classification have the same underlying structure and follow a similar procedure."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down