Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

RisenAgain/573

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

573

D4 Instructions

Command to test the model (Fine-tuned BERT):

condor_submit D4.cmd

Above command will run the system for adaptation task on its evaltest data.

We are using test dataset (/data/test/task_b_labels.csv, /data/test/test_b_tweets.tsv) to test our system. The predictions of the model will be stored in outputs/D4/adaptation/vinai/bertweet-base_1000/evaltest/predictions.csv and the confusion matrix and the model performance metrics will be stored in results/D4/adaptation/vinai/bertweet-base_1000/evaltest/D4_scores.out.

The model is huge and to upload it, we needed git lfs support which isn't installed on patas, so we are using hardcoded path to the model (/home2/chiragms/573/models/adaptation/vinai/bertweet-base_1000) on the cluster.

D3 Instructions

Command to test the model (Fine-tuned BERT):

condor_submit D3.cmd

We are using validation dataset (/data/dev/task_a_distant.tsv) to test our system. The predictions of the model will be stored in outputs/D3/bert-base-cased_2000/dev/predictions.csv and the confusion matrix and the model performance metrics will be stored in results /D3/bert-base-cased_2000/dev/D3_scores.out.

The model is huge and to upload it, we needed git lfs support which isn't installed on patas, so we are using hardcoded path to the model (/home2/chiragms/573/models/bert-base-cased_2000) on the cluster.

D2 Instructions

Command to test the model (tfidf + svm):

condor_submit D2.cmd

We are using validation dataset (/data/dev/task_a_distant.tsv) to test our system. The predictions of the model will be stored in /home2/chiragms/573/outputs/tfidf_svm/dev/predictions.csv and the confusion matrix and the model performance metrics will be stored in /home2/chiragms/573/results/tfidf_svm/dev/D2_scores.out.

Primary task - Offense detection

We are classifying tweets into two categories : OFF (offensive) and NOT (not offensive).

Data format

We originally have a semi-supervised dataset of 9 million tweets with their average affect scores. Affect score ranges from 0 to 1 with 0 being not offensive and 1 being the offensive extrema.

We were able to download only 7 million tweets yet due to Twitter's rate limit on its API. But we used only a subset of the dataset (around 64,000 tweets) for this assignment. We split this subset into further train and dev split of 80:20. Thus we trained our model on around 51,000 tweets and still observed decent results. We plan to use a larger subset of data in the next assignments.

We didn't do any preprocessing on the data in this assignment because we wanted to see the baseline without the preprocessing so that the effect of preprocessing can be observed in the next assignments.

Model

We used tf-idf (term frequency - inverse document frequency) to transform raw text into vector representation and then trained a SVM Regression model on the training dataset.

We used a threshold of 0.5 to classify the prediction score into OFF or NOT. Any prediction value greater than 0.5 was marked as OFF and others were marked NOT.

Directory structure

data: contains the datasets
    |- /dev: validation dataset
    |- /test: test dataset from the conference
    |- /train: training dataset
    |- /SOLID*: Original datasets from the conference
    |- /subsets*: Subset of the whole dataset
    |- /tweets*: Downloaded tweets
doc: reports, presentations
models: saved models in pickle binary format
outputs: predictions of the model
    |- /model_name: folder corresponding to the model
        |- /dev: predictions on dev dataset for this model
results: performance metrics such as precision, accuracy, recall, f1 score
    |- /model_name: folder corresponding to the model
        |- /dev: results on dev dataset for this model
src: all the source code
    |- /data: for crunching the datasets
        |- create.py: creates subsets of the dataset
        |- preprocess.py: sanitizes the data
        |- split.py: splits the data into train:dev ratio
    |- /models: code for training and testing the model
        |- model_name: folder for corresponding model
            |- train.py: to train and save the model
            |- dev.py: to run the model on dev dataset
            |- test.py: to run the model on test dataset
        |- /twitter: code related to Twitter APIs
            |- tweet_downloader.py: to download the tweets referenced in the dataset

*These folders are not included in the repository because of large size. They can however be found at /home2/chiragms/573/data on the cluster.

About

Repository for LING 573 course

Resources

Stars

Watchers

Forks

Packages