Extracting, Cleaning and Pre-processing Text
• Need for it
Need for Tokenization
• In order to get our computer to understand any text, we need to
break that word down in a way that our machine can understand.
That’s where the concept of tokenization in Natural Language
Processing (NLP) comes in
• Tokenization is a way of separating a piece of text into smaller units
called tokens. Here, tokens can be either words, characters, subwords
or sentences. Hence, tokenization can be broadly classified into 3
types – word, character, and subword (n-gram characters)
tokenization.
How is Tokenization done
• The most common way of forming tokens is based on space.
Assuming space as a delimiter, the tokenization of the sentence
results in 3 tokens – Natural-Language-Processing. As each token is a
word, it becomes an example of Word tokenization.
Bigrams, Trigrams & Ngrams
• N-grams are simply all combinations of adjacent words or letters of
length n that you can find in your source text.
• The basic point of n-grams is that they capture the language structure
from the statistical point of view, like what letter or word is likely to
follow the given one. The longer the n-gram (the higher the n), the
more context you have to work with. Optimum length really depends
on the application – if your n-grams are too short, you may fail to
capture important differences. On the other hand, if they are too
long, you may fail to capture the “general knowledge” and only stick
to particular cases.
Frequency Distribution
• A frequency distribution records the number of times each outcome
of an experiment has occurred. For example, a frequency distribution
could be used to record the frequency of each word type in a
document
• We need Frequency Distribution to fully understand what is
happening within our data, we can use a frequency distribution
function that will display the most frequent tokens and words.
Cleaning up your data
• While working with frequency distributions, we noticed that there
were many words that are identified as “different” tokens because
some instances were capitalized while others were not. In order to
combine these groups, we must convert all the characters into their
lowercase forms.
• Next, we must remove all meaningless punctuation. This allows our
model to entirely focus on the important phrases in the text instead
of the number of periods.
Stemming
• Stemming is the process of reducing a word to its word stem that
affixes to suffixes and prefixes or to the roots of words known as a
lemma.
• The input to the stemmer is tokenized words
• Stemming is used in information retrieval systems like search engines.
Lemmatization
• Lemmatization is the process of grouping together the different
inflected forms of a word so they can be analyzed as a single item.
Lemmatization is similar to stemming but it brings context to the
words. So it links words with similar meanings to one word.
• To make Lemmatization work we have to supply the dictionary so that
words can be linked to their root lemma
Difference between stemming and
lemmatisation
• In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of
the word. There are definitely different algorithms used to find out how many characters
have to be chopped off, but the algorithms don’t actually know the meaning of the word
in the language it belongs to. In lemmatization, on the other hand, the algorithms have
this knowledge. In fact, you can even say that these algorithms refer a dictionary to
understand the meaning of the word before reducing it to its root word, or lemma.
• So, a lemmatization algorithm would know that the word better is derived from the
word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to
do the same. There could be over-stemming or under-stemming, and the
word better could be reduced to either bet, or bett, or just retained as better. But there is
no way in stemming that it could be reduced to its root word good. This, basically is the
difference between stemming and lemmatization.
Stopwords
• Stop words are a set of commonly used words in a language. Examples of stop
words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly
used in Text Mining and Natural Language Processing (NLP) to eliminate words
that are so commonly used that they carry very little useful information.
• This can be done by maintaining a list of stop words (which can be manually or
automatically curated) and preventing all words from your stop word list from
being analyzed.
• On removing stopwords, dataset size decreases and the time to train the model
also decreases
• Removing stopwords can potentially help improve the performance as there are
fewer and only meaningful tokens left. Thus, it could increase classification
accuracy
Part Of Speech Tagging
• It is a process of converting a sentence to forms – list of words, list of
tuples (where each tuple is having a form (word, tag)). The tag in case
of is a part-of-speech tag, and signifies whether the word is a noun,
adjective, verb, and so on.
Named Entity Recognition