README

NOTE: This is a work in progress. I haven't been able to achieve a decent level of classification accuracy yet. Suggestions or contributions are welcome.

A Machine Learning (supervised classification) application which learns from the training data of whatsapp chat history, and given a message from some unknown user, predicts which user must have sent that message.

TODO

Handle the case where a user sends a forwarded message. We do not want to consider that as part of our data.
Show the unique users after data is scanned. Such that messages from the same user with different names can be taken care of.

Current implementation

Bag of words technique for feature vectorization (Count Vectorizer with n-grams of range 1-5)
Classification Algorithm - Stochastic Gradient Descent

Current testing

Training data - 9570 messages from 8 unique users
Accuracy from K-Fold (K=6) method ~ 40.57% :(

Possible improvements

Other Naive Bayes implementations (like Bernoulli)
Use N-grams
Try SVMs (sklearn.linear_model.SGDClassifier) ?
TF-IDF transforms?
Create and analyze confusion matrix

Possible features in case feature extraction has to be done manually

Number of words in message
Number of consecutive emojis in message
Type of emojis entered
Message endings
Number of non-dictionary words
Number of capitalized words
Total number of messages sent by a person
Non-alphabet characters in message
Timing of the day when a user sends the message
Vowel Frequency
Consonant Frequency
Digit Frequency
Punctuation Frequency
Spacing Frequency
Special Character Frequency
Word Count
Characters Per Word
Words Per sentence
Preposition Frequency
Pronoun Frequency
Determiner Frequency
Conjunction Frequency
Attribution Frequency
Link Frequency
1 Letter Word
2 Letter Word
3 Letter Word
4 Letter Word
5 Letter Word
6 Letter Word
7 Letter Word
8-10 Letter Word
11-20 Letter Word

Please refer : https://www.quora.com/Anonymity-Quora-feature/How-vulnerable-are-Quora-answers-to-automated-writing-style-analysis

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
compute_stats.py		compute_stats.py
parse_input.py		parse_input.py
requirements		requirements
train_classifier.py		train_classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README

TODO

Current implementation

Current testing

Possible improvements

Possible features in case feature extraction has to be done manually

References

About

Uh oh!

Releases

Packages

Languages

vipulchaskar/IMLearner

Folders and files

Latest commit

History

Repository files navigation

README

TODO

Current implementation

Current testing

Possible improvements

Possible features in case feature extraction has to be done manually

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages