scripts

Prerequisites

You have to install several dependencies before running the script:

$ pip install -r scripts/requirements.txt

Data preprocessing

Run the following command to process documents:

$ python3 scripts/preprocess_data_for_megatron.py \
    --input=PATH_TO_JSON_FILE \
    --json-key=JSON_KEY \
    --tokenizer-library=LIBRARY \
    --tokenizer-type=MODEL_NAME \
    --output-prefix=PATH_TO_OUTPUT_DIR \
    --max-seq-len=MAX_SEQ_LEN

If you preprocess RefinedWeb for Llama 3.1 8B with context length of 8192, then run the following command:

$ python3 scripts/preprocess_data_for_megatron.py \
    --input=PATH_TO_JSON_FILE \
    --json-key=content \
    --tokenizer-library=huggingface \
    --tokenizer-type=meta-llama/Llama-3.1-8B \
    --output-prefix=PATH_TO_OUTPUT_DIR \
    --max-seq-len=8192

Caution

This script assumes that the dataset files are merged into a single .jsonl file.

This may produce a .bin file containing the indexed dataset, and its byte offsets .idx file.

Name		Name	Last commit message	Last commit date
parent directory ..
nemo		nemo
obfd		obfd
README.md		README.md
preprocess_data_for_megatron.py		preprocess_data_for_megatron.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Prerequisites

Data preprocessing

FilesExpand file tree

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

README.md

Prerequisites

Data preprocessing