Analyze SPAM data

Motivation

Preprocessing
- Use encoding UTF-8 to read the data
- Split the comment by space to feed the D2V
Doc2Vec
- Input: many different length comment
- Output: fixed length vectors e.g. $V_{100\times1}$
- The vector size is dominated by max of comment length
DNN
- Input: vector of comments
- Output: spam or ham
- Loss function: cross entropy of sigmoid
- Optimizer: Adaptive Moment Estimation
- Tuning the parameters e.g. batch size, learning rate

Demo

$ python spamClassifier.py [--data=Youtube03-LMFAO.csv] [--text="I am spam"]

Performance
Furthermore
- Compare the ACC on identical model
- LMFAO is close to Eminem and Shakira

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
D2V		D2V
DNN		DNN
comment		comment
visual		visual
.DS_Store		.DS_Store
README.md		README.md
spamClassifier.py		spamClassifier.py