Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lingwei-gu
Copy link
Contributor

@lingwei-gu lingwei-gu commented Nov 20, 2025

This PR adds the fineWebCollection, so that we can create index from fineWeb parquet files.

Command to create an index:

bin/run.sh io.anserini.index.IndexCollection \
 -collection FineWebCollection \
 -input /home/tardis/shared/llms/hub/datasets--karpathy--fineweb-edu-100b-shuffle/snapshots \
 -index /home/tardis/shared/llms/hub/datasets--karpathy--fineweb-edu-100b-shuffle/index \
 -generator DefaultLuceneDocumentGenerator \
 -threads 16

@lintool lintool merged commit b3ab9aa into castorini:master Dec 10, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants