573

D4 Instructions

Command to test the model (Fine-tuned BERT):

condor_submit D4.cmd

Above command will run the system for adaptation task on its evaltest data.

We are using test dataset (/data/test/task_b_labels.csv, /data/test/test_b_tweets.tsv) to test our system. The predictions of the model will be stored in outputs/D4/adaptation/vinai/bertweet-base_1000/evaltest/predictions.csv and the confusion matrix and the model performance metrics will be stored in results/D4/adaptation/vinai/bertweet-base_1000/evaltest/D4_scores.out.

The model is huge and to upload it, we needed git lfs support which isn't installed on patas, so we are using hardcoded path to the model (/home2/chiragms/573/models/adaptation/vinai/bertweet-base_1000) on the cluster.

D3 Instructions

Command to test the model (Fine-tuned BERT):

condor_submit D3.cmd

We are using validation dataset (/data/dev/task_a_distant.tsv) to test our system. The predictions of the model will be stored in outputs/D3/bert-base-cased_2000/dev/predictions.csv and the confusion matrix and the model performance metrics will be stored in results /D3/bert-base-cased_2000/dev/D3_scores.out.

The model is huge and to upload it, we needed git lfs support which isn't installed on patas, so we are using hardcoded path to the model (/home2/chiragms/573/models/bert-base-cased_2000) on the cluster.

D2 Instructions

Command to test the model (tfidf + svm):

condor_submit D2.cmd

We are using validation dataset (/data/dev/task_a_distant.tsv) to test our system. The predictions of the model will be stored in /home2/chiragms/573/outputs/tfidf_svm/dev/predictions.csv and the confusion matrix and the model performance metrics will be stored in /home2/chiragms/573/results/tfidf_svm/dev/D2_scores.out.

Primary task - Offense detection

We are classifying tweets into two categories : OFF (offensive) and NOT (not offensive).

Data format

We originally have a semi-supervised dataset of 9 million tweets with their average affect scores. Affect score ranges from 0 to 1 with 0 being not offensive and 1 being the offensive extrema.

We were able to download only 7 million tweets yet due to Twitter's rate limit on its API. But we used only a subset of the dataset (around 64,000 tweets) for this assignment. We split this subset into further train and dev split of 80:20. Thus we trained our model on around 51,000 tweets and still observed decent results. We plan to use a larger subset of data in the next assignments.

We didn't do any preprocessing on the data in this assignment because we wanted to see the baseline without the preprocessing so that the effect of preprocessing can be observed in the next assignments.

Model

We used tf-idf (term frequency - inverse document frequency) to transform raw text into vector representation and then trained a SVM Regression model on the training dataset.

We used a threshold of 0.5 to classify the prediction score into OFF or NOT. Any prediction value greater than 0.5 was marked as OFF and others were marked NOT.

Directory structure

data: contains the datasets
    |- /dev: validation dataset
    |- /test: test dataset from the conference
    |- /train: training dataset
    |- /SOLID*: Original datasets from the conference
    |- /subsets*: Subset of the whole dataset
    |- /tweets*: Downloaded tweets
doc: reports, presentations
models: saved models in pickle binary format
outputs: predictions of the model
    |- /model_name: folder corresponding to the model
        |- /dev: predictions on dev dataset for this model
results: performance metrics such as precision, accuracy, recall, f1 score
    |- /model_name: folder corresponding to the model
        |- /dev: results on dev dataset for this model
src: all the source code
    |- /data: for crunching the datasets
        |- create.py: creates subsets of the dataset
        |- preprocess.py: sanitizes the data
        |- split.py: splits the data into train:dev ratio
    |- /models: code for training and testing the model
        |- model_name: folder for corresponding model
            |- train.py: to train and save the model
            |- dev.py: to run the model on dev dataset
            |- test.py: to run the model on test dataset
        |- /twitter: code related to Twitter APIs
            |- tweet_downloader.py: to download the tweets referenced in the dataset

*These folders are not included in the repository because of large size. They can however be found at /home2/chiragms/573/data on the cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

573

D4 Instructions

D3 Instructions

D2 Instructions

Primary task - Offense detection

Data format

Model

Directory structure

About

Uh oh!

Releases 10

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
doc		doc
models		models
outputs		outputs
results		results
src		src
.gitignore		.gitignore
D2.cmd		D2.cmd
D3.cmd		D3.cmd
D4.cmd		D4.cmd
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
run.sh		run.sh

RisenAgain/573

Folders and files

Latest commit

History

Repository files navigation

573

D4 Instructions

D3 Instructions

D2 Instructions

Primary task - Offense detection

Data format

Model

Directory structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Languages

Packages