Skip to main content

Minimal BERT implementation in PyTorch

Project description

Static Badge GitHub license PyTorch Ruff


Minimal implementation of the BERT architecture proposed by Devlin et al. using the PyTorch library. This implementation focuses on simplicity and readability, so the model code is not optimized for inference or training efficiency. BabyBERT can be fine-tuned for downstream tasks such as named-entity recognition (NER), sentiment classification, or question answering (QA).

See the roadmap below for my future plans for this library!

📦 Installation

pip install babybert

🚀 Quickstart

The following example demonstrates how to tokenize text, instantiate a BabyBERT model, and obtain contextual embeddings:

from babybert.tokenizer import WordPieceTokenizer
from babybert.model import BabyBERTConfig, BabyBERT

# Load a pretrained tokenizer and encode a text
tokenizer = WordPieceTokenizer.from_pretrained("toy-tokenizer")
encoded = tokenizer.batch_encode(["Hello, world!"])

# Initialize an untrained BabyBERT model
model_cfg = BabyBERTConfig.from_preset(
  "tiny", vocab_size=tokenizer.vocab_size, block_size=len(encoded['token_ids'][0])
)
model = BabyBERT(model_cfg)

# Obtain contextual embeddings
hidden = model(**encoded)
print(hidden)

[!TIP] For more usage examples, check out the examples/ directory!

🗺️ Roadmap

Model Implementation

  • Build initial model implementation
  • Write trainer class
  • Create custom WordPiece tokenizer
  • Introduce more parameter configurations
  • Set up pretrained model checkpoints

Usage Examples

  • Pretraining
  • Sentiment classification
  • Named entity recognition
  • Question answering

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babybert-0.1.0.tar.gz (280.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babybert-0.1.0-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file babybert-0.1.0.tar.gz.

File metadata

  • Download URL: babybert-0.1.0.tar.gz
  • Upload date:
  • Size: 280.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for babybert-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5bb5b7ff378829603a93188fd963280b108162fa13cb391fd8c673aefd1fef81
MD5 37a963644586fcd2ff725f5e9a72985c
BLAKE2b-256 65cd903a2b8a22b7944f9023d66d7068926c7325be1dd6251acf3afb540795f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for babybert-0.1.0.tar.gz:

Publisher: publish-to-pypi.yml on dross20/babybert

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file babybert-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: babybert-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for babybert-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c44a6689fbcbd8dddc6fd81832e60362e946d85e5a7344477151fbd68c30cb5f
MD5 3e1cb1a406336ae974fed25117e0e1b6
BLAKE2b-256 13d1fd0a674fe4d37f88e0ea535aa03f9cff96d8c116d75c156be3084e44dcb5

See more details on using hashes here.

Provenance

The following attestation bundles were made for babybert-0.1.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on dross20/babybert

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page