A fast and efficient tokenizer library for natural language processing tasks, built with Python and optimized C backend.
- High Performance: Fast tokenization powered by optimized C libraries
- Multiple Encodings: Support for various tokenization models and vocabularies
- Flexible API: Easy-to-use Python interface with comprehensive functionality
- Special Tokens: Built-in support for special tokens and custom vocabularies
- Fallback Mechanisms: Robust error handling with fallback tokenization
- BPE Support: Byte Pair Encoding implementation for subword tokenization
pip install shredwordfrom shred import load_encoding
# Load a tokenizer
tokenizer = load_encoding("pre_16k")
# Encode text to tokens
tokens = tokenizer.encode("Hello, world!")
print(tokens) # [10478, 10408, 10416, 10416, ...
# Decode tokens back to text
text = tokenizer.decode(tokens)
print(text) # "Hello, world!"
# Get vocabulary information
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Special tokens: {tokenizer.special_tokens}")For detailed usage instructions, API reference, and examples, please see our User Documentation.
Shredword supports various pre-trained tokenization models. The library automatically downloads vocabulary files from the official repository when needed.
We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.
- Clone the repository
- Install development dependencies:
pip install -r requirements.txt - Run tests:
python -m pytest
- Follow PEP 8 style guidelines
- Add tests for new features
- Update documentation as needed
- Ensure all tests pass before submitting PRs
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
- Issues: Report bugs or request features on GitHub Issues
- Discussions: Join community discussions on GitHub Discussions
Built with performance and simplicity in mind for the NLP community.
Note: This library requires a C/CPP compiler for optimal performance. Fallback Python implementations are available when C/CPP extensions are not available.