Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tags: allenai/dolma

Tags

v1.2.1

Toggle v1.2.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Adding tool to reshard npy files based on maximum desired size. (#269)

* resharding

* style

* Enhance S3 utility functions with improved logging and worker management. Set default max_workers based on CPU count, and update logging messages to include worker counts and file statistics for better traceability during downloads, merges, and uploads. Change argument name from --seed to --random-seed for clarity.

* style

* increased version number

* fixed f link

* initial tests

* formatting

* unit test

* removed unused imports!

* Refactor main function to support optional local temporary directory. Replace TemporaryDirectory with mkdtemp and ensure cleanup with shutil.rmtree. Update argument parser to include --local-tempdir option.

* Add usage instructions and contact information to reshard.py script

* api change, upsampling, max count

* Update s5cmd command in ReshardingPrefixConfig to include -sp flag for improved performance and modify subprocess.run to capture stdout and stderr separately.

* fixed bug in case max per file is very small

* Refactor ReshardingPrefixConfig to improve path sampling logic and update unit tests for enhanced sequence handling and validation.

* Remove unnecessary blank line in MemMapParallelWriter class to improve code readability.

* stack-edu

* Update destination paths in stack-edu configuration files to new S3 structure for improved organization and consistency.

* Add local_tempdir configuration for stack-edu files to specify temporary storage paths for each language.

* Enhance stack-edu configuration by adding support for language-specific temporary storage paths and updating argument parser to include --local-tempdir option for improved flexibility.

* Refactor run.sh script in stack-edu to simplify command for resharding by removing the unnecessary -c flag, enhancing clarity and usability.

* Add weighted sampling function to resharding logic for improved bucket distribution. Update group_paths_by_max_num_files to utilize weighted sampling, ensuring more balanced allocation of elements across buckets and removing empty buckets for cleaner output.

* fixing paths

* all_dressed

* mo snaz

* Remove deprecated configuration files and associated vigintiles for the 'snazzy' category in the dolma2-resharding project, streamlining the overall structure and eliminating unused resources.

* Update resharding script paths in snazzy1.sh and snazzy2.sh to point to the correct configuration directories, ensuring proper execution of language-specific tokenization.

* snazzy2

* Refactor s2orc generation script to load pstar values from JSON file and utilize tqdm for progress tracking. Update size calculations for languages based on new pstar data, improving output clarity and ensuring accurate token size reporting.

* s2orc and s2pdf

* Enhance s2orc and s2pdf functionality by integrating new features for improved output clarity and performance. Update scripts to ensure accurate token size reporting and streamline processing.

* removed since 0 sampling rate

* Improve path sampling in ReshardingPrefixConfig by ensuring at least one path is selected when the sampling rate allows. Added logging for clarity on the number of paths taken.

* Remove obsolete configuration files for geography, linguistics, and sociology in the dolma2-resharding project to streamline the project structure and eliminate unused resources.

* Update generate.log files for s2orc and s2pdf to include final size and off-by metrics for various languages, enhancing clarity and accuracy in token size reporting.

* skip logs

* pagination

* Refactor S3 size calculation in generate.py scripts across multiple configurations to implement pagination for large object lists, enhancing efficiency and accuracy in size reporting.

* Update generate.log for dolma2-resharding to reflect new natural token count and adjusted sampling rate, improving accuracy in token size reporting.

* fix paths

* Update generate.log for dolma2-resharding to include final size and off-by metrics for finemath-3plus, wikipedia, and arxiv subsets, enhancing accuracy in token size reporting.

* final mix

* Refactor ReshardingConfig initialization in from_dict method for improved clarity and type safety, and update path counting to use a dictionary comprehension for better readability.

* fixing ring issues.

v1.2.0

Toggle v1.2.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Tokenizer over custom fields and w/o IDs; BOS/EOS tokens. (#266)

* pass type and name

* new tests

* adding tests

* more PRers

* tests

* Refactor tokenizer functions to improve type annotations and enhance tokenization output. Updated `make_spec_from_fields` and `recursively_make_struct` to return `type[msgspec.Struct]`. Modified `tokenize_file` to yield `TokenizerOutput` with dtype parameter.

* Refactor tokenizer initialization to use `make_tokenizer` for improved dtype validation. Added a new test case to check for dtype mismatch errors during tokenization.

* documentation.

* Update tests/python/test_tokenizer.py

Co-authored-by: Copilot <[email protected]>

* Update CI workflow to use `uv` for environment management and command execution. Refactor type annotations in tokenizer-related files to use `Optional` for nullable fields. Enhance S3 utility functions to improve type safety.

* Add `uv venv` command to CI workflow for environment setup

* Update dependencies in pyproject.toml, enhance CI workflow with UV logging format, and modify record_info.py to handle optional fastwarc import with error handling.

* Enhance CI workflow by adding a step to install the latest version of the toolkit, ensuring up-to-date dependencies are used during the build process.

* removed unnecessary deps + better rust caching

Refactor dependency management in pyproject.toml by removing unnecessary cached-path entry and updating PII detection comments. Enhance CI workflow to cache Rust targets alongside the virtual environment for improved build efficiency. Update imports in various modules to use the new cached_path location.

* one final thing whatever

* Refactor type annotations in test files for improved clarity

Updated type annotations in `test_tokenizer.py` to specify the type of `extracted_sequences` as `list[list[int]]`. Removed unnecessary type ignore comment in `test_nested_struct.py` for better code readability.

* sorting

* typo

* style

* mypy madness

* Disable test for CodeProseCompositionClassifier until path issue is resolved

* sorting

* Update Python version in CI workflow from 3.9 to 3.10

* Refactor error handling in tokenize_file function to improve logging and maintainability. Moved try-except blocks to streamline error management and added logging for line processing errors.

* Remove unused import of 'exception' from the logging module in tokenizer.py to clean up the code.

* 3.10 doesnt delete=false

* removed older langid

---------

Co-authored-by: Copilot <[email protected]>

v1.2.0-dev8

Toggle v1.2.0-dev8's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Use original s3 path to delete local cache (#257)

1.2.0.dev7

Toggle 1.2.0.dev7's commit message
Bump artifacts version

v1.1.2

Toggle v1.1.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Bump version to 1.1.2 for release (#243)

v1.1.1.post3

Toggle v1.1.1.post3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Pattern match for all artifacts (#239)

* Pattern match for all artifacts

* Bump package

* filter at build command

v1.2.0-dev6

Toggle v1.2.0-dev6's commit message
pattern matching

v1.2.0-dev5

Toggle v1.2.0-dev5's commit message
onward and upward

v1.2.0-dev4

Toggle v1.2.0-dev4's commit message
allow builds on learn2code

v1.2.0-dev3

Toggle v1.2.0-dev3's commit message
v4 upload artifacts is very broken, turning most of this off