simple-bert-pytorch
Project description
simple-bert-pytorch
A very simple BERT implementation in PyTorch, which only depends on PyTorch itself. Includes pre-trained models, tokenizers, and usage examples.
Coincidentally, the Tokenizer
implementation in this project is 6-7x faster than the one in the transformers
library! There is a lot of unneeded complexity/overhead in transformers
, which is why I created this project in the first place.
Install
From PyPI:
pip install simple-bert-pytorch
From source:
pip install "simple-bert-pytorch @ git+ssh://git@github.com/fkodom/simple-bert-pytorch.git"
For contributors:
# Install all dev dependencies (tests etc.)
pip install "simple-bert-pytorch[test] @ git+ssh://git@github.com/fkodom/simple-bert-pytorch.git"
# Setup pre-commit hooks
pre-commit install
Usage
from simple_bert_pytorch.tokenizer import Tokenizer
from simple_bert_pytorch.models.bert import Bert
# You can also load a Tokenizer by passing the `lower_case` argument. Essentially
# all BERT models use one of 2 vocabularies (cased or uncased). If you know that
# your model is cased/uncased, this is equivalent to laoding by name.
# tokenizer = Tokenizer.from_pretrained(lower_case=True)
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
# Similar to `transformers`, pretrained models are loaded using the `from_pretrained`
# method. But you can also instantiate models directly! We keep it simple by using
# keyword arguments, rather than config objects, so you can easily see what you're
# passing in.
# model = Bert(
# vocab_size=tokenizer.vocab_size,
# num_layers=6,
# dim=512,
# num_heads=8,
# intermediate_size=2048,
# )
model = Bert.from_pretrained("bert-base-uncased")
texts = [
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
"Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
]
# The tokenizer accepts a sequence of text strings. If you don't provide any other
# arguments, the returned "input_ids" and "attention_mask" are lists of lists. You
# can also truncate, pad, and convert them to tensors in one step:
# tokenized = tokenizer(texts, max_length=128, padding=True, return_tensors=True)
tokenized = tokenizer(texts)
print(tokenized)
# {
# 'input_ids': [[101, 9850, 24727, ...], [101, 1736, 2079, ...]]
# 'attention_mask': [[1, 1, 1, ...], [1, 1, 1, ...]]
# }
# Check that decoding the tokenized inputs works as expected! This model is uncased,
# so the texts will be lowercased.
decoded = [
tokenizer.decode(input_ids, skip_special_tokens=True)
for input_ids in tokenized["input_ids"]
]
print(decoded)
# [
# "lorem ipsum dolor sit amet, consectetur adipiscing elit.",
# "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
# ]
# The model arguments have the same names as the tokenizer keys! You can implicitly
# pass keyword arguments using the `**` operator, but I prefer to be explicit.
# NOTE: For this example, let use `return_tensors=True` so we don't have to worry
# about padding the input tensors.
tokenized = tokenizer(texts, padding=True, return_tensors=True)
# Pretrained BERT is a masked language model. It returns logits for each input token.
# NOTE: This is a significant difference from the `transformers` library! Rather than
# returning a dict or tuple of outputs, the model returns a single Tensor. This library
# is simple enough that, if you need different outputs, you can easily modify the model
# code to make that happen.
logits = model(input_ids=tokenized["input_ids"], attention_mask=tokenized["attention_mask"])
print(logits)
# tensor([[[ -7.5046, -7.4059, -7.4317, ..., -6.8106, -6.6921, -4.7529],
# ...,
# [-13.6784, -13.2786, -13.8128, ..., -11.6681, -12.6777, -7.2111]]],
# grad_fn=<ViewBackward0>)
print(logits.shape)
# torch.Size([2, 25, 30522])
Each pretrained model may return slightly different outputs, depending on the model architecture:
Model Type | Outputs | Example Model |
---|---|---|
Masked Language Model | logits for each input token | Bert |
Embedding Model | single embedding vector for each input sequence | BGE |
Sequence Classification Model (AKA "rerankers") |
classification score for each input sequence | CrossEncoder |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file simple_bert_pytorch-0.2.0.tar.gz
.
File metadata
- Download URL: simple_bert_pytorch-0.2.0.tar.gz
- Upload date:
- Size: 407.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cbe92eb5d454b8cf635642e6876bbf055c0f0f49362f53a09d5e21ab737c332 |
|
MD5 | d452f07f72bcaa3af0a4d6fad8040adc |
|
BLAKE2b-256 | c8709933e98f39a9d5872e585ad6f645648107d18f3f9dfee5775bacd242f3d7 |
File details
Details for the file simple_bert_pytorch-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: simple_bert_pytorch-0.2.0-py3-none-any.whl
- Upload date:
- Size: 408.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 43746207ef6eb0d59ca5facedefc161852ff6e2ec010d251568beed15031615c |
|
MD5 | 908c04c9f59229c393491a4663c4fe88 |
|
BLAKE2b-256 | 33ef531bb073d99753bd948f5178bfe164ba9dcb95285efdcad01180a0c63ede |