The first step is to run spaCy on your input corpus of choice. The
script candle/run_spacy.py can be used for this
purpose. For example, to run this script on the dummy files in
the candle/data/input_corpus directory,
run the following command:
cd candle
python run_spacy.py \
-i data/input_corpus/dummy-000.jsonl \
-o data/spacy/dummy-000.spacyThe input file should be a JSONL file, where each line is a JSON object with the following fields:
text: The text of the document (required).timestamp: The timestamp of the document (optional).url: The URL of the document (optional).
After running spaCy on all the input files, you should create a file consisting
of the paths to all the spaCy output files (see
e.g., candle/data/spacy/dummy.txt).
This file should be passed to the next steps using the spacy_file_list
argument (see below).
There are 6 components
(see candle/pipeline/pipeline.py):
candle/pipeline/component_people_group_matcher.pycandle/pipeline/component_generic_sentence_filter.pycandle/pipeline/component_culture_classifier.pycandle/pipeline/component_clustering.pycandle/pipeline/component_rep_generator.pycandle/pipeline/component_ranking.py
For example, to run the pipeline for the religions domain (see also
candle/config_religions.yaml), follow these
steps:
Start your local MongoDB instance:
cd /path/to/mongodb/folder
bin/mongod --dbpath /folder/to/save/the/database --bind_ip_allRun the first 3 components:
cd candle/candle
python main.py \
--config config_religions.yaml \
--people_group religions \
--spacy_file_list data/spacy/dummy.txt \
--components 1 2 3Run the last 3 components:
for facet in "food" "drink" "ritual"
do
python main.py \
--config config_religions.yaml \
--people_group religions \
--components 4 5 6 \
--cluster_facet $facet \
--cluster_nid data/religions/religion_ids.txt \
--domain religions \
--output_file _outputs/religions_$facet.jsonl
doneIf you use this code or our datasets, please cite the following paper:
@inproceedings{candle2023,
title={Extracting Cultural Commonsense Knowledge at Scale},
author={Nguyen, Tuan-Phong and Razniewski, Simon and Varde, Aparna and Weikum, Gerhard},
booktitle={Proceedings of the ACM Web Conference},
year={2023}
}More information is available on: https://candle.mpi-inf.mpg.de/