Skip to main content

Swarmauri Bert Embedding Parser

Project description

Swarmauri Logo

PyPI - Downloads Hits PyPI - Python Version PyPI - License PyPI - swarmauri_parser_bertembedding


Swarmauri Parser Bert Embedding

Parser that converts text into embeddings using a Hugging Face BERT encoder. Produces Document objects whose metadata carries the averaged token embedding so downstream Swarmauri pipelines can work with dense vectors.

Features

  • Uses transformers.BertModel + BertTokenizer (default bert-base-uncased).
  • Accepts single strings or lists of strings and emits Document instances with original text and embedding metadata.
  • Runs in inference (eval) mode with automatic torch.no_grad() handling.
  • Works on CPU by default; configure PyTorch device settings to leverage GPU.

Prerequisites

  • Python 3.10 or newer.
  • PyTorch compatible with your hardware (installed automatically via transformers if not present; install CUDA-enabled wheels manually when needed).
  • Internet access on first run so Hugging Face downloads tokenizer/model weights (or warm the cache ahead of time).

Installation

# pip
pip install swarmauri_parser_bertembedding

# poetry
poetry add swarmauri_parser_bertembedding

# uv (pyproject-based projects)
uv add swarmauri_parser_bertembedding

Quickstart

from swarmauri_parser_bertembedding import BERTEmbeddingParser

parser = BERTEmbeddingParser(parser_model_name="bert-base-uncased")

documents = parser.parse([
    "Swarmauri agents cooperate over shared memory.",
    "Dense embeddings power semantic search.",
])

for doc in documents:
    vector = doc.metadata["embedding"]
    print(doc.content)
    print(len(vector), vector[:5])

Custom Models & Devices

import torch
from swarmauri_parser_bertembedding import BERTEmbeddingParser
from transformers import BertModel

class GPUParser(BERTEmbeddingParser):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._model = BertModel.from_pretrained(self.parser_model_name).to("cuda")

parser = GPUParser(parser_model_name="bert-base-multilingual-cased")
parser._model.eval()

Batch Embeddings at Scale

from tqdm import tqdm
from swarmauri_parser_bertembedding import BERTEmbeddingParser

texts = [f"Paragraph {i}" for i in range(1000)]
parser = BERTEmbeddingParser()

batched_docs = []
batch_size = 32
for start in tqdm(range(0, len(texts), batch_size)):
    batch = texts[start:start + batch_size]
    batched_docs.extend(parser.parse(batch))

Persist the resulting vectors into Swarmauri vector stores (Redis, Qdrant, etc.) via the metadata field.

Tips

  • Preprocess text to match model expectations (lowercase for uncased BERT, language-specific cleanup for multilingual models).
  • For extremely long documents, consider chunking before calling parse to respect the 512 token limit.
  • Use PyTorch's to("cuda") or to("mps") to execute on GPUs or Apple silicon accelerators.
  • Cache Hugging Face weights in CI/CD environments (HF_HOME=/cache/hf) to avoid repeated downloads.

Want to help?

If you want to contribute to swarmauri-sdk, read up on our guidelines for contributing that will help you get started.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarmauri_parser_bertembedding-0.8.3.dev19.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file swarmauri_parser_bertembedding-0.8.3.dev19.tar.gz.

File metadata

  • Download URL: swarmauri_parser_bertembedding-0.8.3.dev19.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_bertembedding-0.8.3.dev19.tar.gz
Algorithm Hash digest
SHA256 4739d32b2e9336088c0795fde6f71289f907bbc8410d24e947f9240d1cf9981e
MD5 e56b1daf782df100a8cea9b6a9f8fffa
BLAKE2b-256 53109a1464f22c9cfc2aba3c2f9cca57b6cfbc76ad70675244ee29ca09cbb128

See more details on using hashes here.

File details

Details for the file swarmauri_parser_bertembedding-0.8.3.dev19-py3-none-any.whl.

File metadata

  • Download URL: swarmauri_parser_bertembedding-0.8.3.dev19-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_parser_bertembedding-0.8.3.dev19-py3-none-any.whl
Algorithm Hash digest
SHA256 8e317194fc4117b8c2a689cf5ada3faa703e2760620e82d0182a6cb0f9c51ff4
MD5 4c56982ee629fe77b282148b8dc3dcc5
BLAKE2b-256 82e0297efacc44a5e689b4c500c7e43f716296a5100d4b225e5032b1a3951700

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page