Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Quennar-Inc/CultureBank

 
 

Repository files navigation

CultureBank

Quick Links: [Paper] [Project Page] [dataset-tiktok] [dataset-reddit] [Models]

alt text

We provide:

Setup

  1. Setup the environment

conda env create -f environment.yml

  1. Setup the api keys

Data process pipeline

alt text

The pipeline contains 9 components (see data_process_pipeline/pipeline/main_pipeline.py).

  1. data_process_pipeline/pipeline/component_0_culture_relevance_classifier.py: classify if a comment is related to culture

    • Uses fine-tuned model: SALT-NLP/CultureBank-Relevance-Classifier (based on distilbert-base-uncased)
  2. data_process_pipeline/pipeline/component_1_knowledge_extractor.py.py: extract cultural information from the comment

    • Uses base model: mistralai/Mistral-7B-Instruct-v0.2 (vanilla) or fine-tuned model: SALT-NLP/CultureBank-Extractor (with adapters)
  3. data_process_pipeline/pipeline/component_2_negation_converter.py: convert positive sentences to negative forms

    • Uses spacy with en_core_web_sm model
  4. data_process_pipeline/pipeline/component_3_clustering.py: perform clustering

    • Uses sentence-transformers with all-MiniLM-L6-v2 model
  5. data_process_pipeline/pipeline/component_4_cluster_summarizer.py: summarize the clusters

    • Uses base model: mistralai/Mistral-7B-Instruct-v0.2 (vanilla) or fine-tuned model: SALT-NLP/CultureBank-Summarizer (with adapters)
  6. data_process_pipeline/pipeline/component_5_topic_normalization.py: normalize the cultural groups and topics

    • Uses sentence-transformers with all-MiniLM-L6-v2 for clustering
    • Uses gpt-3.5-turbo-1106 for topic normalization
  7. data_process_pipeline/pipeline/component_6_agreement_calculator.py: calculate the agreement values

    • No models used, pure calculation
  8. data_process_pipeline/pipeline/component_7_content_moderation.py: identify potentially controversial and PII data for annotation

    • Uses fine-tuned model: SALT-NLP/CultureBank-Controversial-Classifier
    • Uses presidio_analyzer for PII detection
    • Uses keyword filtering
  9. data_process_pipeline/pipeline/component_8_final_formatter.py: format the final data

    • No models used, pure formatting

Note on Model Usage:

  • The pipeline can run in two modes:
    1. Vanilla mode: Uses base models without fine-tuning
    2. Fine-tuned mode: Uses specialized fine-tuned models with adapters
  • Configuration files:
    • config_dummy_data_vanilla_mistral.yaml: Uses vanilla models (lighter on GPU memory)
    • config_dummy_data_finetuned_mixtral.yaml: Uses fine-tuned models with adapters (requires ~27GB GPU memory)

How to run the pipeline

  1. Prepare a data file, e.g., the provided dummy data file

  2. Set up the paths in the config, e.g., the provided config_dummy_data_vanilla_mistral.yaml

  3. Run this command to run the components with index 0,1,3,4,5,6,7,8 in order with the config

python data_process_pipeline/main.py -i 0,1,3,4,5,6,7,8 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml
  1. The final output will be at data_process_pipeline/results/8_final_formatter/output.csv, as specified in config_dummy_data_vanilla_mistral.yaml.

How to run individual components

We can also run individual components, but need to make sure the input file exists.

# load the 0th component, relevance_classifier
python data_process_pipeline/main.py -i 0 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml

Some notes

  • The pipeline will also generate a file with controversial data for human annotation, output_file_for_manual_annotation, you need to annotate it and put it in controversial_annotation_file
  • We prepare two sample configs:

Evaluation scripts

alt text

  1. evaluation/convert_to_desc.py: concatenates the fields in CultureBank data and translates them into free-text paragraphs of cultural descriptors.
  2. evaluation/generate_questions.py: generates questions for grounded evaluation based on the cultural descriptors. The released adapter is here.
  3. evaluation/generate_questions_aug.py: generates questions for grounded evaluation based on the cultural descriptions with self-refinement method (very similar to evaluation/generate_questions.py, the only difference is that GPT-4 will score the generated question until max trials or good results). The released adapter is here.
  4. evaluation/grounded_eval.py: performs grounded evaluation on language models on the generated cultural questions. if -aug (augmentation) is turned on, it means we will have the golden cultural descriptor in the input for the evaluation; and the golden-knowledge-augmented responses from GPTs can be used for further SFT training steps.
  5. evaluation/knowledge_entailment.py: computes the knowledge entailment scores of models' generated responses in the grounded evaluations.
  6. evaluation/direct_eval.py: performs direct evaluation on language models on CultureBank data.

Evaluation on two downstream tasks

  1. evaluation/downstream_tasks/cultural_nli.py: evaluate on cultural nli.
  2. evaluation/downstream_tasks/world_value_survey.py: evaluate on the world value survey based on methods in this paper.

Fine-tuning scripts

  1. finetuning/sft_mixtral.py: a sample script to supervised-finetune a mixtral model on various tasks (extractor, summarizer, culturally-aware model, etc) with proper data preparation.
  2. finetuning/dpo_mixtral.py: a sample script to train a mixtral model with DPO on various tasks (culturally-aware model, etc) with proper data preparation.

Released models

  1. Knowledge extractor
  2. Cluster summarizer
  3. Evaluation question generator
  4. Llama2-7B SFT fine-tuned on CultureBank-TikTok
  5. Mixtral-8X7B SFT fine-tuned on CultureBank-TikTok
  6. Mixtral-8X7B DPO fine-tuned on CultureBank-TikTok

Acknowledgement

The codebase is adapted from Candle (paper) which is under this license. Thanks for the amazing work!

If you find our work helpful, please consider citing our paper:

@misc{shi2024culturebank,
    title={CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies},
    author={Weiyan Shi and Ryan Li and Yutong Zhang and Caleb Ziems and Chunhua Yu and Raya Horesh and Rogério Abreu de Paula and Diyi yang},
    year={2024},
    eprint={2404.15238},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

We welcome all kinds of contributions. If you have any questions, feel free to leave issues or email us.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.7%
  • Jupyter Notebook 8.3%