Skip to main content

Trainable Swarmauri embedding component for Hugging Face masked language models, tokenizer adaptation, local fine-tuning, and Vector output.

Project description

Swarmauri Logo

PyPI - Downloads Hits PyPI - Python Version PyPI - License PyPI - swarmauri_embedding_mlm Discord

Swarmauri MLM Embedding

swarmauri_embedding_mlm provides MlmEmbedding, a Swarmauri embedding component built on Hugging Face transformers and PyTorch masked language models. It can fine-tune a masked language model on local text, optionally expand the tokenizer vocabulary, and return Swarmauri Vector objects for retrieval, clustering, similarity search, and downstream agent memory workflows.

Why Swarmauri MLM Embedding?

Use this package when you want a trainable embedding adapter inside the Swarmauri component system instead of a fixed API-only embedding provider. MlmEmbedding keeps model loading, masking, fine-tuning, pooling, vector wrapping, and save/load behavior behind the shared EmbeddingBase interface so it can plug into Swarmauri vector stores and retrieval pipelines.

FAQ

Q: Which models can this package load?

A: MlmEmbedding uses AutoTokenizer.from_pretrained() and AutoModelForMaskedLM.from_pretrained(), so use a Hugging Face model compatible with masked language modeling, such as BERT-style or DistilBERT-style models.

Q: Does fit() train a complete embedding model from scratch?

A: No. It fine-tunes an existing masked language model for one pass per fit() call using masked-token loss, AdamW, and the configured batch size and learning rate.

Q: What does transform() return?

A: It returns a list of Swarmauri Vector objects. The current implementation mean-pools model outputs and falls back to mean-pooled logits when the masked-language-model output does not expose last_hidden_state.

Q: Can I persist a tuned model?

A: Yes. Use save_model(path) to write the model and tokenizer, then load_model(path) to restore them later.

Features

  • MlmEmbedding registered under the swarmauri.embeddings entry point.
  • Hugging Face masked-language-model loading through AutoTokenizer and AutoModelForMaskedLM.
  • PyTorch training loop with automatic CUDA or CPU selection.
  • Configurable embedding_name, batch_size, learning_rate, masking_ratio, and randomness_ratio.
  • Optional tokenizer vocabulary expansion with add_new_tokens=True.
  • fit(), transform(), fit_transform(), and infer_vector() workflows.
  • save_model() and load_model() helpers for model reuse.
  • Swarmauri Vector outputs for vector stores and retrieval pipelines.
  • Python 3.10, 3.11, 3.12, 3.13, and 3.14 support.

Prerequisites

  • PyTorch installed for your target CPU or GPU environment.
  • Network access or a local Hugging Face cache for the selected model.
  • Enough disk and memory for the selected masked language model.
  • Training data as a list of text strings.

Installation

Install with uv:

uv add swarmauri_embedding_mlm

Install with pip:

pip install swarmauri_embedding_mlm

Usage

Fine-tune a masked language model and embed documents:

from swarmauri_embedding_mlm import MlmEmbedding

documents = [
    "Swarmauri components compose agents, memory, and tools.",
    "Masked language models can adapt to domain terminology.",
]

embedder = MlmEmbedding(
    embedding_name="distilbert-base-uncased",
    batch_size=8,
    learning_rate=3e-5,
)

embedder.fit(documents)
vectors = embedder.transform(
    [
        "Agents retrieve context from vector stores.",
        "Domain adaptation improves local vocabulary coverage.",
    ]
)

print(len(vectors))
print(vectors[0].value[:5])

Expand the tokenizer vocabulary before fine-tuning:

from swarmauri_embedding_mlm import MlmEmbedding

corpus = [
    "Swarmauri pipelines use composable intelligence infrastructure.",
    "Qdrant and Redis vector stores support retrieval workflows.",
]

embedder = MlmEmbedding(add_new_tokens=True)
embedder.fit(corpus)

print(embedder.epochs)
print(len(embedder.extract_features()))

Save and reload a tuned model:

from pathlib import Path

from swarmauri_embedding_mlm import MlmEmbedding

model_dir = Path("models/domain-mlm")

embedder = MlmEmbedding(embedding_name="distilbert-base-uncased")
embedder.fit(["short adaptation corpus"])
embedder.save_model(model_dir.as_posix())

restored = MlmEmbedding(embedding_name=model_dir.as_posix())
vector = restored.infer_vector("reuse the tuned model")

print(len(vector.value))

Related Packages

Embedding and vector packages:

Foundational packages:

Best Practices

  • Pin embedding_name to a model you have tested in your deployment environment.
  • Pre-download or cache Hugging Face model weights for offline or repeatable builds.
  • Use smaller models and smaller batches on memory-constrained machines.
  • Save tuned models after adaptation so workers do not repeat fine-tuning.
  • Pair generated vectors with a Swarmauri vector store for retrieval-augmented workflows.

License

Apache-2.0

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarmauri_embedding_mlm-0.11.0.dev1.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarmauri_embedding_mlm-0.11.0.dev1-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file swarmauri_embedding_mlm-0.11.0.dev1.tar.gz.

File metadata

  • Download URL: swarmauri_embedding_mlm-0.11.0.dev1.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_embedding_mlm-0.11.0.dev1.tar.gz
Algorithm Hash digest
SHA256 bd7229c9aec5bc6b3dbc8a0bc46570b4d116aef9ca6233102dc988b440d82be5
MD5 32db242d49a965b34dfce8664449baa5
BLAKE2b-256 9e13102c12c6928a3e8107a65eeba756a304099c22786e0fd3961940d194c0d5

See more details on using hashes here.

File details

Details for the file swarmauri_embedding_mlm-0.11.0.dev1-py3-none-any.whl.

File metadata

  • Download URL: swarmauri_embedding_mlm-0.11.0.dev1-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for swarmauri_embedding_mlm-0.11.0.dev1-py3-none-any.whl
Algorithm Hash digest
SHA256 462a9e483f9366e1eac0b93af9166cd377e195c86d5cf861d03fb8b26e8a63c0
MD5 db40676b18a0f278c6d6df4e956616dd
BLAKE2b-256 6884f7157881cca05403fdc49ba61a000488f53b23f6efb30a7dd7ce38bc4b58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page