CultureBank

Quick Links: [Paper] [Project Page] [dataset-tiktok] [dataset-reddit] [Models]

We provide:

an easy-to-use and generalizable pipeline to construct cultural knowledge bank from online communities
two cultural knowledge datasets, [CultureBank-TikTok] and [CultureBank-Reddit]
grounded cultural evaluation and fine-tuning scripts

Setup

Setup the environment

conda env create -f environment.yml

Setup the api keys

OpenAI: os.getenv("OPENAI_API_KEY")
Perspective api: os.getenv("PERSPECTIVE_API")

Data process pipeline

The pipeline contains 9 components (see data_process_pipeline/pipeline/main_pipeline.py).

data_process_pipeline/pipeline/component_0_culture_relevance_classifier.py: classify if a comment is related to culture
- Uses fine-tuned model: SALT-NLP/CultureBank-Relevance-Classifier (based on distilbert-base-uncased)
data_process_pipeline/pipeline/component_1_knowledge_extractor.py.py: extract cultural information from the comment
- Uses base model: mistralai/Mistral-7B-Instruct-v0.2 (vanilla) or fine-tuned model: SALT-NLP/CultureBank-Extractor (with adapters)
data_process_pipeline/pipeline/component_2_negation_converter.py: convert positive sentences to negative forms
- Uses spacy with en_core_web_sm model
data_process_pipeline/pipeline/component_3_clustering.py: perform clustering
- Uses sentence-transformers with all-MiniLM-L6-v2 model
data_process_pipeline/pipeline/component_4_cluster_summarizer.py: summarize the clusters
- Uses base model: mistralai/Mistral-7B-Instruct-v0.2 (vanilla) or fine-tuned model: SALT-NLP/CultureBank-Summarizer (with adapters)
data_process_pipeline/pipeline/component_5_topic_normalization.py: normalize the cultural groups and topics
- Uses sentence-transformers with all-MiniLM-L6-v2 for clustering
- Uses gpt-3.5-turbo-1106 for topic normalization
data_process_pipeline/pipeline/component_6_agreement_calculator.py: calculate the agreement values
- No models used, pure calculation
data_process_pipeline/pipeline/component_7_content_moderation.py: identify potentially controversial and PII data for annotation
- Uses fine-tuned model: SALT-NLP/CultureBank-Controversial-Classifier
- Uses presidio_analyzer for PII detection
- Uses keyword filtering
data_process_pipeline/pipeline/component_8_final_formatter.py: format the final data
- No models used, pure formatting

Note on Model Usage:

The pipeline can run in two modes:
1. Vanilla mode: Uses base models without fine-tuning
2. Fine-tuned mode: Uses specialized fine-tuned models with adapters
Configuration files:
- config_dummy_data_vanilla_mistral.yaml: Uses vanilla models (lighter on GPU memory)
- config_dummy_data_finetuned_mixtral.yaml: Uses fine-tuned models with adapters (requires ~27GB GPU memory)

How to run the pipeline

Prepare a data file, e.g., the provided dummy data file
Set up the paths in the config, e.g., the provided config_dummy_data_vanilla_mistral.yaml
Run this command to run the components with index 0,1,3,4,5,6,7,8 in order with the config

python data_process_pipeline/main.py -i 0,1,3,4,5,6,7,8 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml

The final output will be at data_process_pipeline/results/8_final_formatter/output.csv, as specified in config_dummy_data_vanilla_mistral.yaml.

How to run individual components

We can also run individual components, but need to make sure the input file exists.

# load the 0th component, relevance_classifier
python data_process_pipeline/main.py -i 0 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml

Some notes

The pipeline will also generate a file with controversial data for human annotation, output_file_for_manual_annotation, you need to annotate it and put it in controversial_annotation_file
We prepare two sample configs:
- config_dummy_data_vanilla_mistral.yaml: uses vanilla mistral models as the extractor and summarizer, light-weight
- config_dummy_data_finetuned_mixtral.yaml: uses vanilla mixtral models plus our fine-tuned adapters on Reddit as the extractor and summarizer, requires more GPU memory (at least ~27GB)

Evaluation scripts

evaluation/convert_to_desc.py: concatenates the fields in CultureBank data and translates them into free-text paragraphs of cultural descriptors.
evaluation/generate_questions.py: generates questions for grounded evaluation based on the cultural descriptors. The released adapter is here.
evaluation/generate_questions_aug.py: generates questions for grounded evaluation based on the cultural descriptions with self-refinement method (very similar to evaluation/generate_questions.py, the only difference is that GPT-4 will score the generated question until max trials or good results). The released adapter is here.
evaluation/grounded_eval.py: performs grounded evaluation on language models on the generated cultural questions. if -aug (augmentation) is turned on, it means we will have the golden cultural descriptor in the input for the evaluation; and the golden-knowledge-augmented responses from GPTs can be used for further SFT training steps.
evaluation/knowledge_entailment.py: computes the knowledge entailment scores of models' generated responses in the grounded evaluations.
evaluation/direct_eval.py: performs direct evaluation on language models on CultureBank data.

Evaluation on two downstream tasks

evaluation/downstream_tasks/cultural_nli.py: evaluate on cultural nli.
evaluation/downstream_tasks/world_value_survey.py: evaluate on the world value survey based on methods in this paper.

Fine-tuning scripts

finetuning/sft_mixtral.py: a sample script to supervised-finetune a mixtral model on various tasks (extractor, summarizer, culturally-aware model, etc) with proper data preparation.
finetuning/dpo_mixtral.py: a sample script to train a mixtral model with DPO on various tasks (culturally-aware model, etc) with proper data preparation.

Released models

Acknowledgement

The codebase is adapted from Candle (paper) which is under this license. Thanks for the amazing work!

If you find our work helpful, please consider citing our paper:

@misc{shi2024culturebank,
    title={CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies},
    author={Weiyan Shi and Ryan Li and Yutong Zhang and Caleb Ziems and Chunhua Yu and Raya Horesh and Rogério Abreu de Paula and Diyi yang},
    year={2024},
    eprint={2404.15238},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

We welcome all kinds of contributions. If you have any questions, feel free to leave issues or email us.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data_process_pipeline		data_process_pipeline
evaluation		evaluation
figures		figures
finetuning		finetuning
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Step By Step Help.txt		Step By Step Help.txt
env_updated.yml		env_updated.yml
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CultureBank

Setup

Data process pipeline

How to run the pipeline

Evaluation scripts

Fine-tuning scripts

Released models

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

Quennar-Inc/CultureBank

Folders and files

Latest commit

History

Repository files navigation

CultureBank

Setup

Data process pipeline

How to run the pipeline

Evaluation scripts

Fine-tuning scripts

Released models

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages