Codestin Search App

BabyBERT logo

Minimal implementation of the BERT architecture proposed by Devlin et al. using the PyTorch library. This implementation focuses on simplicity and readability, so the model code is not optimized for inference or training efficiency. BabyBERT can be fine-tuned for downstream tasks such as named-entity recognition (NER), sentiment classification, or question answering (QA).

See the roadmap below for my future plans for this library!

📦 Installation

pip install babybert

🚀 Quickstart

The following example demonstrates how to tokenize text, instantiate a BabyBERT model, and obtain contextual embeddings:

from babybert.tokenizer import WordPieceTokenizer
from babybert.model import BabyBERTConfig, BabyBERT

# Load a pretrained tokenizer and encode a text
tokenizer = WordPieceTokenizer.from_pretrained("toy-tokenizer")
encoded = tokenizer.batch_encode(["Hello, world!"])

# Initialize an untrained BabyBERT model
model_cfg = BabyBERTConfig.from_preset(
  "tiny", vocab_size=tokenizer.vocab_size, block_size=len(encoded['token_ids'][0])
)
model = BabyBERT(model_cfg)

# Obtain contextual embeddings
hidden = model(**encoded)
print(hidden)

Tip

For more usage examples, check out the examples/ directory!

📐 Architecture

The following diagram is a simplified representation of BabyBERT's architecture.

🗺️ Roadmap

Model Implementation

Build initial model implementation
Write trainer class
Create custom WordPiece tokenizer
Introduce more parameter configurations
Set up pretrained model checkpoints

Usage Examples

Pretraining
Sentiment classification
Named entity recognition
Question answering

Educational Features

Attention visualization

🤝 Contributing

Contributions to this project are welcome! See the the contributing page for more details.

📜 License

This project is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github		.github
babybert		babybert
examples		examples
tests/unit		tests/unit
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

📦 Installation

🚀 Quickstart

📐 Architecture

🗺️ Roadmap

Model Implementation

Usage Examples

Educational Features

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

dross20/babybert

Folders and files

Latest commit

History

Repository files navigation

📦 Installation

🚀 Quickstart

📐 Architecture

🗺️ Roadmap

Model Implementation

Usage Examples

Educational Features

🤝 Contributing

📜 License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages