Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

Prerequisites

You have to install several dependencies before running the script:

$ pip install -r scripts/requirements.txt

Data preprocessing

Run the following command to process documents:

$ python3 scripts/preprocess_data_for_megatron.py \
    --input=PATH_TO_JSON_FILE \
    --json-key=JSON_KEY \
    --tokenizer-library=LIBRARY \
    --tokenizer-type=MODEL_NAME \
    --output-prefix=PATH_TO_OUTPUT_DIR \
    --max-seq-len=MAX_SEQ_LEN

If you preprocess RefinedWeb for Llama 3.1 8B with context length of 8192, then run the following command:

$ python3 scripts/preprocess_data_for_megatron.py \
    --input=PATH_TO_JSON_FILE \
    --json-key=content \
    --tokenizer-library=huggingface \
    --tokenizer-type=meta-llama/Llama-3.1-8B \
    --output-prefix=PATH_TO_OUTPUT_DIR \
    --max-seq-len=8192

Caution

This script assumes that the dataset files are merged into a single .jsonl file.

This may produce a .bin file containing the indexed dataset, and its byte offsets .idx file.