Quick Links: [Paper] [Project Page] [dataset-tiktok] [dataset-reddit] [Models]
We provide:
- an easy-to-use and generalizable pipeline to construct cultural knowledge bank from online communities
- two cultural knowledge datasets, [CultureBank-TikTok] and [CultureBank-Reddit]
- grounded cultural evaluation and fine-tuning scripts
- Setup the environment
conda env create -f environment.yml
- Setup the api keys
-
OpenAI:
os.getenv("OPENAI_API_KEY") -
Perspective api:
os.getenv("PERSPECTIVE_API")
The pipeline contains 9 components (see data_process_pipeline/pipeline/main_pipeline.py).
-
data_process_pipeline/pipeline/component_0_culture_relevance_classifier.py: classify if a comment is related to culture- Uses fine-tuned model:
SALT-NLP/CultureBank-Relevance-Classifier(based on distilbert-base-uncased)
- Uses fine-tuned model:
-
data_process_pipeline/pipeline/component_1_knowledge_extractor.py.py: extract cultural information from the comment- Uses base model:
mistralai/Mistral-7B-Instruct-v0.2(vanilla) or fine-tuned model:SALT-NLP/CultureBank-Extractor(with adapters)
- Uses base model:
-
data_process_pipeline/pipeline/component_2_negation_converter.py: convert positive sentences to negative forms- Uses
spacywithen_core_web_smmodel
- Uses
-
data_process_pipeline/pipeline/component_3_clustering.py: perform clustering- Uses
sentence-transformerswithall-MiniLM-L6-v2model
- Uses
-
data_process_pipeline/pipeline/component_4_cluster_summarizer.py: summarize the clusters- Uses base model:
mistralai/Mistral-7B-Instruct-v0.2(vanilla) or fine-tuned model:SALT-NLP/CultureBank-Summarizer(with adapters)
- Uses base model:
-
data_process_pipeline/pipeline/component_5_topic_normalization.py: normalize the cultural groups and topics- Uses
sentence-transformerswithall-MiniLM-L6-v2for clustering - Uses
gpt-3.5-turbo-1106for topic normalization
- Uses
-
data_process_pipeline/pipeline/component_6_agreement_calculator.py: calculate the agreement values- No models used, pure calculation
-
data_process_pipeline/pipeline/component_7_content_moderation.py: identify potentially controversial and PII data for annotation- Uses fine-tuned model:
SALT-NLP/CultureBank-Controversial-Classifier - Uses
presidio_analyzerfor PII detection - Uses keyword filtering
- Uses fine-tuned model:
-
data_process_pipeline/pipeline/component_8_final_formatter.py: format the final data- No models used, pure formatting
Note on Model Usage:
- The pipeline can run in two modes:
- Vanilla mode: Uses base models without fine-tuning
- Fine-tuned mode: Uses specialized fine-tuned models with adapters
- Configuration files:
config_dummy_data_vanilla_mistral.yaml: Uses vanilla models (lighter on GPU memory)config_dummy_data_finetuned_mixtral.yaml: Uses fine-tuned models with adapters (requires ~27GB GPU memory)
-
Prepare a data file, e.g., the provided
dummy data file -
Set up the paths in the config, e.g., the provided
config_dummy_data_vanilla_mistral.yaml -
Run this command to run the components with index 0,1,3,4,5,6,7,8 in order with the config
python data_process_pipeline/main.py -i 0,1,3,4,5,6,7,8 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml
- The final output will be at
data_process_pipeline/results/8_final_formatter/output.csv, as specified inconfig_dummy_data_vanilla_mistral.yaml.
How to run individual components
We can also run individual components, but need to make sure the input file exists.
# load the 0th component, relevance_classifier
python data_process_pipeline/main.py -i 0 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml
Some notes
- The pipeline will also generate a file with controversial data for human annotation,
output_file_for_manual_annotation, you need to annotate it and put it incontroversial_annotation_file - We prepare two sample configs:
config_dummy_data_vanilla_mistral.yaml: uses vanilla mistral models as the extractor and summarizer, light-weightconfig_dummy_data_finetuned_mixtral.yaml: uses vanilla mixtral models plus our fine-tuned adapters on Reddit as the extractor and summarizer, requires more GPU memory (at least ~27GB)
evaluation/convert_to_desc.py: concatenates the fields in CultureBank data and translates them into free-text paragraphs of cultural descriptors.evaluation/generate_questions.py: generates questions for grounded evaluation based on the cultural descriptors. The released adapter is here.evaluation/generate_questions_aug.py: generates questions for grounded evaluation based on the cultural descriptions with self-refinement method (very similar toevaluation/generate_questions.py, the only difference is that GPT-4 will score the generated question until max trials or good results). The released adapter is here.evaluation/grounded_eval.py: performs grounded evaluation on language models on the generated cultural questions. if-aug(augmentation) is turned on, it means we will have the golden cultural descriptor in the input for the evaluation; and the golden-knowledge-augmented responses from GPTs can be used for further SFT training steps.evaluation/knowledge_entailment.py: computes the knowledge entailment scores of models' generated responses in the grounded evaluations.evaluation/direct_eval.py: performs direct evaluation on language models on CultureBank data.
Evaluation on two downstream tasks
evaluation/downstream_tasks/cultural_nli.py: evaluate on cultural nli.evaluation/downstream_tasks/world_value_survey.py: evaluate on the world value survey based on methods in this paper.
finetuning/sft_mixtral.py: a sample script to supervised-finetune a mixtral model on various tasks (extractor, summarizer, culturally-aware model, etc) with proper data preparation.finetuning/dpo_mixtral.py: a sample script to train a mixtral model with DPO on various tasks (culturally-aware model, etc) with proper data preparation.
- Knowledge extractor
- Cluster summarizer
- Evaluation question generator
- Llama2-7B SFT fine-tuned on CultureBank-TikTok
- Mixtral-8X7B SFT fine-tuned on CultureBank-TikTok
- Mixtral-8X7B DPO fine-tuned on CultureBank-TikTok
The codebase is adapted from Candle (paper) which is under this license. Thanks for the amazing work!
If you find our work helpful, please consider citing our paper:
@misc{shi2024culturebank,
title={CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies},
author={Weiyan Shi and Ryan Li and Yutong Zhang and Caleb Ziems and Chunhua Yu and Raya Horesh and Rogério Abreu de Paula and Diyi yang},
year={2024},
eprint={2404.15238},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
We welcome all kinds of contributions. If you have any questions, feel free to leave issues or email us.