Thanks to visit codestin.com
Credit goes to github.com

Skip to content

C4 train download misses one part #837

@psyhtest

Description

@psyhtest

When I downloaded the raw (non-preprocessed) C4 train and eval datasets (~841GB) from https://training.mlcommons-storage.org, I found that it was supposed to have 1024 train parts. However, only 1023 were present in the archive and c4-train-and-eval-datasets.md5.

for i in $(seq -f "%05g" 0 1023); do \
  if [[ -z $(grep "train.$i" c4-train-and-eval-datasets.md5) ]]; then \
    echo $i;\
  fi;\
done
The expected but missing part is en_json/3.0.1/c4-train.00018-of-01024.json
anton.u5bz@login44:/projects/u5bz/llama3.1_405b/c4/original> for i in $(seq -f "%05g" 0 1023); do \
>   if [[ -z $(grep "train.$i" c4-train-and-eval-datasets.md5) ]]; then \
>     echo $i;\
>   fi;\
> done
00018
|
anton.u5bz@login44:/projects/u5bz/llama3.1_405b/c4/original> sort -k2 c4-train-and-eval-datasets.md5 \
| head -n 22 | tail -n 5
9659014d466f74d905257908cb8560d3  en_json/3.0.1/c4-train.00016-of-01024.json
dd0f638d8eb910091729ebb2b515794f  en_json/3.0.1/c4-train.00017-of-01024.json
c717b028a4931fa938954e21c68c3578  en_json/3.0.1/c4-train.00019-of-01024.json
52e03521e6a0fac74c29b8b204205e5b  en_json/3.0.1/c4-train.00020-of-01024.json
3dbc03d06e4f3fb9c5003fa01a6bda50  en_json/3.0.1/c4-train.00021-of-01024.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions