Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Iulian277/sexism_identification

 
 

Repository files navigation

python black isort
pytorch lightning hydra
tests code-quality
license PRs All Contributors

Hugging Face Transformers Library


Text Classification in Romanian Language

This repository contains the code for our team's submission to a Natural Language Processing (NLP) competition (Nitro) hosted on Kaggle. The competition challenged participants to develop a pipeline for sexism text identification in the Romanian language.

Competition Details

The task in this competition was to classify each text into one of five possible categories: (0) Sexist Direct, (1) Sexist Descriptive, (2) Sexist Reporting, (3) Non-sexist Offensive, and (4) Non-sexist Non-offensive.

  • Sexist:
    • Direct: The post contains sexist elements and is directly addressed to a specific gender, usually women.
    • Descriptive: The post describes one or more individuals, usually a woman or a group of women, in a sexist manner without directly addressing them.
    • Reporting: The post reports a witnessed or heard sexist act from other sources.
  • Non-sexist:
    • Offensive: The post does not contain sexist connotations but includes offensive language.
    • Non-offensive: There are no sexist or offensive elements in the post.

The data for this competition has been collected from a variety of sources, including social media networks such as Facebook, Twitter, and Reddit, web articles, and books.

Disclaimer

The data used in this project includes instances of sexism and hate speech Therefore, reader discretion is strongly advised. The contributors to this project strongly oppose discrimination based on gender, religion, race or any other kind. One of the goals of this project is to raise awareness about gender bias online.

Training Data

The training dataset provided for this competition consists of 40,000 text files from CoRoSeOf: An annotated Corpus of Romanian Sexist and Offensive Language, while the test set comprises 3130 text files.

Participants were expected to use the training data to build a pipeline that can accurately classify the text documents in the test set into the appropriate category.

The submission was evaluated based on weighted accuracy, with the tiebreaker based on the count of false negatives in identifying offensive language.

Our Approach

Our team's approach consisted of the following steps:

  • Data sanitization: We removed any irrelevant information from the dataset, ensuring that it only contained data that was relevant for text classification.

  • Fine-tuning Romanian BERT: We fine-tuned the Romanian BERT model using the training data to improve its performance on the downstream task.

  • Adjusted the Cross Entropy loss based on the class weights: We tackled the problem of the imbalanced dataset by adjusting the cross-entropy loss function based on the weights of each category. This allowed us to give more weight to underrepresented categories and improve the overall performance of our model.

Results

Our team achieved 4th place out of 46 teams in the competition, obtaining a 56.84% balanced accuracy on the private test. This was the chosen metric for this competition. Our approach proved to be effective in achieving a high level of accuracy on this challenging task.

Ground Truth Predicted Text
non-offensive descriptive Bărbații sunt de obicei mai înalți decât femeile, în medie
non-offensive non-offensive Toate genurile merită să se simtă în siguranță în comunitățile lor
non-offensive non-offensive Am o nevastă și trei copii.
offensive offensive Sa mi bag p**a in toate numerel de inmatriculare de pe planeta asta
offensive offensive Învață bă unde sunt tastele a**l-f**e
offensive direct Sa ma sugi si de p**a si de p**s
descriptive descriptive Care e diferența dintre o femeie și o vacă? Vacii nu se enervează când le spui că au nevoie de mai mult lapte.
descriptive descriptive Bărbații sunt mai potriviți pentru poziții de putere.
descriptive descriptive Femeile au locul în bucătărie
direct direct Sa iti dau cu p**a peste c**r pana iti raman urme rosii
direct direct Nu vezi fă cât ești de grasă că te scoate cu elicopterul dacă ai căzut în gaura de canalizare
direct direct Sunt 20 de grade afară, dar p**a ta are mai multe grade de alcoolemie după ce am stropit-o pe față cu șampanie
reporting reporting Normalizarea hărțuirii și a agresiunii sexuale, adesea prin prezentarea personajelor feminine ca fiind dispuse sau meritând un astfel de tratament
reporting reporting O tanara a fost violata de catre un fost iubit.
reporting descriptive Femeilor li se refuză dreptul de a deține proprietate sau de a avea controlul asupra propriilor finanțe în multe societăți

As we can see in the table above, the model works most of the time but, because of the metric choice of the challenge, it tends to predict false positives (sexist or offensive instead of the most common label of non-offensive). In practice, more data would be needed and a higher threshold would be set for the decision to flag a comment for sexist or offensive content.

Other Attempts

We have also attempted to improve our results through ensemble techniques and backtranslation.

  • However, we found that our approach using fine-tuned Romanian BERT with adjusted CE loss provided the best results.

  • Regarding backtranslation, although we tried to augment the dataset, we encountered difficulties due to the nature of the language being sexist and offensive. The back-translated phrases did not contain bad words, which resulted in limited improvement.

Future Approaches

  • We believe that further improvements could be made by better sanitizing the dataset

  • We believe that using a model that has been specifically trained on similar types of text could be beneficial.

  • Additionally, one can also explore the possibility of using data augmentation to improve the results. An approach similar to Easy Data Augmentation could be implemented to evaluate its effectiveness.

Project Structure

.
├── .devcontainer                           <- Dev Container Setup
├── .github                                 <- Github Workflows
├── .project-root                           <- Used to identify the project root
├── .vscode                                 <- Visual Studio Code settings
├── configs                                 <- Hydra configs
│   ├── data                                    <- Dataset configs
│   ├── hparams_search                          <- Hyperparameter Search configs
│   ├── hydra                                   <- Hydra runtime configs
│   ├── model                                   <- Model configs
│   ├── paths                                   <- Commonly used project paths
│   ├── predict.yaml                            <- predict.py configs
│   ├── test.yaml                               <- test.py configs
│   ├── train.yaml                              <- train.py configs
│   └── trainer                                 <- Transformers trainer configs
├── data                                    <- Datasets
│   └── ro                                      <- Romanian language
│       ├── predict_example.txt                     <- Small sample to be used with predict.py
│       ├── test_data.csv                           <- CoRoSeOf Test Data
│       └── train_data.csv                          <- CoRoSeOf Training Data
├── experiments                             <- Experiments directory
│   └── train
│       ├── multiruns                           <- Hyperparameter search
│       └── runs                                <- Single experiments
├── notebooks
│   └── hackathon_notebook.ipynb            <- The original hackathon notebook
├── predictions                             <- Results from predict.py
├── src                                     <- Source code
│   ├── __init__.py
│   ├── data                                    <- Dataset related
│   │   └── coroseof_datamodule.py
│   ├── predict.py
│   ├── test.py
│   ├── train.log
│   ├── train.py
│   ├── trainers
│   │   └── imbalanced_dataset_trainer.py       <- Custom trainer with class weights
│   └── utils
│       └── config.py                           <- Custom Omegaconf resolvers
├── submissions                             <- Results from test.py
└── tests                                   <- Tests directory

Getting Started

Thanks to our devcontainer setup you can run our model right here on GitHub. Just create a new codespace and follow the steps below! Keep in mind, training requires a GPU, which is not available on GitHub, so you might want to use Visual Studio Code for that.

Warning If running in GitHub Codespaces (or without an available GPU) you need to comment the '"runArgs": ["--gpus", "all"]' lines from the .devcontainer/devcontainer.json file. Otherwise docker will give an error and your container will not start.

Predict with a pretrained model

Keep in mind that a prediction will be made for each line of text.

# predict using text from a file
cat data/ro/predict_example.txt | python src/predict.py --models cosminc98/sexism-identification-coroseof

# see the results
cat predictions/prediction.tsv

# predict from stdin; after entering the command write however many sentences
# you want and end with the with the [EOF] marker:
#   Femeile au locul în bucătărie
#   [EOF]
python src/predict.py --models cosminc98/sexism-identification-coroseof

Training a new model

# run a single training run with CoRoSeOf dataset and default hyperparameters
# the model will be available in experiments/train/runs/
python src/train.py

# run hyperparameter search with the Optuna plugin from Hydra
# the model will be available in experiments/train/multiruns/
python src/train.py -m hparams_search=optuna

The model

Creating a new Kaggle Submission

# predict on the test set
python src/test.py --models cosminc98/sexism-identification-coroseof

Now all you need to do is upload "submissions/submission.csv" to Kaggle.

Uploading to Hugging Face

A pretrained model in the Romanian language is already available on huggingface.

pip install huggingface_hub

# add your write token using
huggingface-cli login

echo "
push_to_hub: True
hub_model_id: \"<model-name>\"
" >> configs/trainer/default.yaml

Contact

If you have any questions about our approach or our code, please feel free to contact us at:

Contributors ✨

Ștefan-Cosmin Ciocan
Ștefan-Cosmin Ciocan

💻 📖🔬
Iulian Taiatu
Iulian Taiatu

💻 📖 🔬
AndreiDumitrescu99
AndreiDumitrescu99

💻 🔬

This project follows the all-contributors specification. Contributions of any kind welcome!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 62.0%
  • Jupyter Notebook 35.3%
  • Makefile 1.7%
  • Dockerfile 1.0%