Blog Posts Classification

Implementation of various machine learning algorithms and deep learning models to predict and classify the age group of blog posts.

Data source: Blogger, train_posts.csv

Setup

Dependencies:

python3, tensorflow, keras, nltk, textstat, tqdm, scikit-learn, etc.

To install the dependencies:

pip install -r requirements.txt

Usage

Basic Usage

Copy your training and test datasets in the default data directory ../data
Run the command with specific arguments

Validation script

cat ../data/file_name | python3 validation.py [-h] <arguments>

Example:

cat ../data/train_posts.csv | python3 validation.py --classifier lr \
                                                    --C 0.01 0.05 0.10 \
                                                    --solver newton-cg \
                                                    --max_iter 200

Arguments

--classifier : Classifier choices = nb: Naive Bayes, lr: Logistic Regression, rf: Random Forest. Default: nb
--alpha: Alpha values for classifier=nb. Default [0.001, 0.005, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7]
--C: C values for classifier=lr. Default: [0.001, 0.1, 1.0, 2.0, 3.0, 5.0, 6.0, 7.0, 7.5, 10.0]
--test_size: Fraction of test size. Default: 0.10
--max_iter: Maximum of iterations for classifier=lr. Default: 100
--solver: Solver for classifier=lr. Default: lbfgs. Choices: ['lbfgs', 'newton-cg', 'sag', 'saga']
--n_estimators: Number of estimators for classifier=rf. Default: 300
--max_depth: Max depth for classifier=rf. Default: 3

Prediction script

cat ../data/file_name | python3 predictions.py [-h] <arguments>

Example:

cat ../data/train_posts.csv | python3 predictions.py --classifier rf \
                                                     --n_estimators 200 \
                                                     --max_depth 5 \
                                                     --test_files test_split01.csv test_split02.csv

Arguments

--test_data_dir: Directory path to the test datasets. Default: ../data
--test_files: Names of the test files in test_data_dir. Default: 'test_split01.csv', 'test_split02.csv', 'test_split03.csv', 'test_split04.csv', 'test_split05.csv', 'test_split06.csv', 'test_split07.csv', 'test_split08.csv', 'test_split09.csv', 'test_split10.csv', 'test_split11.csv'
--classifier : Classifier choices = nb: Naive Bayes, lr: Logistic Regression, rf: Random Forest. Default: nb
--alpha: Alpha value for classifier=nb. Default 0.005
--C: C value for classifier=lr. Default: 7.0
--max_iter: Maximum of iterations for classifier=lr. Default: 100
--solver: Solver for classifier=lr. Default: lbfgs. Choices: ['lbfgs', 'newton-cg', 'sag', 'saga']
--n_estimators: Number of estimators for classifier=rf. Default: 300
--max_depth: Max depth for classifier=rf. Default: 3

LSTM/BiLSTM train script

cat ../data/file_name | python3 lstm_train.py [-h] <arguments>

Example:

cat ../data/train_posts.csv | python3 lstm_train.py --bilstm \
                                                    --units 32 \
                                                    --spatial_dropout 0.5 \
                                                    --dropout 0.5 \
                                                    --recurrent_dropout 0.5 \
                                                    --batch_size 256 \
                                                    --epochs 5

Arguments

--test_size: Fraction of test size. Default: 0.10
--vocab_size: Vocabulary size for the embedding layer. Default: 10000
--max_words : Maximum number of words per blog. Default: 250
--embedding_dim: Dimension of the embedding. Default 100
--bilstm: Boolean indicating to use BiLSTM instead of LSTM.
--units: Number of units in the LSTM layer. Default: 100
--spatial_dropout: Spatial dropout 1D. Default: 0.4
--dropout: Dropout of the LSTM layer. Default: 0.4
--recurrent_dropout: Recurrent dropout of the LSTM layer. Default: 0.4
--batch_size: Batch size. Default: 1000
--epochs: Number of epochs. Default: 10

LSTM/BiLSTM prediction script

cat ../data/file_name | python3 lstm_predict.py [-h] <arguments>

Example:

cat ../data/train_posts.csv | python3 lstm_predict.py --bilstm \
                                                      --units 32 \
                                                      --spatial_dropout 0.5 \
                                                      --dropout 0.5 \
                                                      --recurrent_dropout 0.5 \
                                                      --batch_size 256 \
                                                      --epochs 5 \
                                                      --test_files test_split01.csv test_split02.csv

Arguments

--test_data_dir: Directory path to the test datasets. Default: ../data
--test_files: Names of the test files in test_data_dir. Default: 'test_split01.csv', 'test_split02.csv', 'test_split03.csv', 'test_split04.csv', 'test_split05.csv', 'test_split06.csv', 'test_split07.csv', 'test_split08.csv', 'test_split09.csv', 'test_split10.csv', 'test_split11.csv'
--vocab_size: Vocabulary size for the embedding layer. Default: 10000
--max_words : Maximum number of words per blog. Default: 250
--embedding_dim: Dimension of the embedding. Default 100
--bilstm: Boolean indicating to use BiLSTM instead of LSTM.
--units: Number of units in the LSTM layer. Default: 100
--spatial_dropout: Spatial dropout 1D. Default: 0.4
--dropout: Dropout of the LSTM layer. Default: 0.4
--recurrent_dropout: Recurrent dropout of the LSTM layer. Default: 0.4
--batch_size: Batch size. Default: 1000
--epochs: Number of epochs. Default: 1

Dummy classifier

Create the model

cat ../data/train_posts.csv | python3 dummy_model.py

Prediction

cat ../data/test_split01.csv | python3 dummy_predict.py models/dummy-most.clf

Evaluate

cat ../data/test_split01.csv | python3 dummy_eval.py out/dummy-most.clf.out

Universal Sentence Encoder with Google Colab IPython Notebook

Open IFT6285-Dev1.ipynb in Google Colab
Upload the test splits to the My Drive/Colab Notebooks/ repository
Activate the GPU by selecting Runtime > Change runtime type > Hardware accelerator > GPU > save (optional)
Run the notebook

Authors

Thach Jean-Pierre - University of Montreal

Wong Leo - University of Montreal

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
IFT6285-Dev1-Report.pdf		IFT6285-Dev1-Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blog Posts Classification

Setup

Dependencies:

To install the dependencies:

Usage

Basic Usage

Validation script

Arguments

Prediction script

Arguments

LSTM/BiLSTM train script

Arguments

LSTM/BiLSTM prediction script

Arguments

Dummy classifier

Create the model

Prediction

Evaluate

Universal Sentence Encoder with Google Colab IPython Notebook

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

jeanpierrethach/Blog-Posts-Classification

Folders and files

Latest commit

History

Repository files navigation

Blog Posts Classification

Setup

Dependencies:

To install the dependencies:

Usage

Basic Usage

Validation script

Arguments

Prediction script

Arguments

LSTM/BiLSTM train script

Arguments

LSTM/BiLSTM prediction script

Arguments

Dummy classifier

Create the model

Prediction

Evaluate

Universal Sentence Encoder with Google Colab IPython Notebook

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages