-
Couldn't load subscription status.
- Fork 584
Open
mlcommons/r2-infra
#48Description
When I downloaded the raw (non-preprocessed) C4 train and eval datasets (~841GB) from https://training.mlcommons-storage.org, I found that it was supposed to have 1024 train parts. However, only 1023 were present in the archive and c4-train-and-eval-datasets.md5.
for i in $(seq -f "%05g" 0 1023); do \
if [[ -z $(grep "train.$i" c4-train-and-eval-datasets.md5) ]]; then \
echo $i;\
fi;\
done
The expected but missing part is en_json/3.0.1/c4-train.00018-of-01024.json
anton.u5bz@login44:/projects/u5bz/llama3.1_405b/c4/original> for i in $(seq -f "%05g" 0 1023); do \ > if [[ -z $(grep "train.$i" c4-train-and-eval-datasets.md5) ]]; then \ > echo $i;\ > fi;\ > done 00018 | anton.u5bz@login44:/projects/u5bz/llama3.1_405b/c4/original> sort -k2 c4-train-and-eval-datasets.md5 \ | head -n 22 | tail -n 5 9659014d466f74d905257908cb8560d3 en_json/3.0.1/c4-train.00016-of-01024.json dd0f638d8eb910091729ebb2b515794f en_json/3.0.1/c4-train.00017-of-01024.json c717b028a4931fa938954e21c68c3578 en_json/3.0.1/c4-train.00019-of-01024.json 52e03521e6a0fac74c29b8b204205e5b en_json/3.0.1/c4-train.00020-of-01024.json 3dbc03d06e4f3fb9c5003fa01a6bda50 en_json/3.0.1/c4-train.00021-of-01024.json
Metadata
Metadata
Assignees
Labels
No labels