You have to install several dependencies before running the script:
$ pip install -r scripts/requirements.txtRun the following command to process documents:
$ python3 scripts/preprocess_data_for_megatron.py \
--input=PATH_TO_JSON_FILE \
--json-key=JSON_KEY \
--tokenizer-library=LIBRARY \
--tokenizer-type=MODEL_NAME \
--output-prefix=PATH_TO_OUTPUT_DIR \
--max-seq-len=MAX_SEQ_LENIf you preprocess RefinedWeb for Llama 3.1 8B with context length of 8192, then run the following command:
$ python3 scripts/preprocess_data_for_megatron.py \
--input=PATH_TO_JSON_FILE \
--json-key=content \
--tokenizer-library=huggingface \
--tokenizer-type=meta-llama/Llama-3.1-8B \
--output-prefix=PATH_TO_OUTPUT_DIR \
--max-seq-len=8192Caution
This script assumes that the dataset files are merged into a single .jsonl file.
This may produce a .bin file containing the indexed dataset, and its byte offsets .idx file.