Trainable Swarmauri embedding component for Hugging Face masked language models, tokenizer adaptation, local fine-tuning, and Vector output.
Project description
Swarmauri MLM Embedding
swarmauri_embedding_mlm provides MlmEmbedding, a Swarmauri embedding component built on Hugging Face transformers and PyTorch masked language models. It can fine-tune a masked language model on local text, optionally expand the tokenizer vocabulary, and return Swarmauri Vector objects for retrieval, clustering, similarity search, and downstream agent memory workflows.
Why Swarmauri MLM Embedding?
Use this package when you want a trainable embedding adapter inside the Swarmauri component system instead of a fixed API-only embedding provider. MlmEmbedding keeps model loading, masking, fine-tuning, pooling, vector wrapping, and save/load behavior behind the shared EmbeddingBase interface so it can plug into Swarmauri vector stores and retrieval pipelines.
FAQ
Q: Which models can this package load?
A: MlmEmbedding uses AutoTokenizer.from_pretrained() and AutoModelForMaskedLM.from_pretrained(), so use a Hugging Face model compatible with masked language modeling, such as BERT-style or DistilBERT-style models.
Q: Does fit() train a complete embedding model from scratch?
A: No. It fine-tunes an existing masked language model for one pass per fit() call using masked-token loss, AdamW, and the configured batch size and learning rate.
Q: What does transform() return?
A: It returns a list of Swarmauri Vector objects. The current implementation mean-pools model outputs and falls back to mean-pooled logits when the masked-language-model output does not expose last_hidden_state.
Q: Can I persist a tuned model?
A: Yes. Use save_model(path) to write the model and tokenizer, then load_model(path) to restore them later.
Features
MlmEmbeddingregistered under theswarmauri.embeddingsentry point.- Hugging Face masked-language-model loading through
AutoTokenizerandAutoModelForMaskedLM. - PyTorch training loop with automatic CUDA or CPU selection.
- Configurable
embedding_name,batch_size,learning_rate,masking_ratio, andrandomness_ratio. - Optional tokenizer vocabulary expansion with
add_new_tokens=True. fit(),transform(),fit_transform(), andinfer_vector()workflows.save_model()andload_model()helpers for model reuse.- Swarmauri
Vectoroutputs for vector stores and retrieval pipelines. - Python 3.10, 3.11, 3.12, 3.13, and 3.14 support.
Prerequisites
- PyTorch installed for your target CPU or GPU environment.
- Network access or a local Hugging Face cache for the selected model.
- Enough disk and memory for the selected masked language model.
- Training data as a list of text strings.
Installation
Install with uv:
uv add swarmauri_embedding_mlm
Install with pip:
pip install swarmauri_embedding_mlm
Usage
Fine-tune a masked language model and embed documents:
from swarmauri_embedding_mlm import MlmEmbedding
documents = [
"Swarmauri components compose agents, memory, and tools.",
"Masked language models can adapt to domain terminology.",
]
embedder = MlmEmbedding(
embedding_name="distilbert-base-uncased",
batch_size=8,
learning_rate=3e-5,
)
embedder.fit(documents)
vectors = embedder.transform(
[
"Agents retrieve context from vector stores.",
"Domain adaptation improves local vocabulary coverage.",
]
)
print(len(vectors))
print(vectors[0].value[:5])
Expand the tokenizer vocabulary before fine-tuning:
from swarmauri_embedding_mlm import MlmEmbedding
corpus = [
"Swarmauri pipelines use composable intelligence infrastructure.",
"Qdrant and Redis vector stores support retrieval workflows.",
]
embedder = MlmEmbedding(add_new_tokens=True)
embedder.fit(corpus)
print(embedder.epochs)
print(len(embedder.extract_features()))
Save and reload a tuned model:
from pathlib import Path
from swarmauri_embedding_mlm import MlmEmbedding
model_dir = Path("models/domain-mlm")
embedder = MlmEmbedding(embedding_name="distilbert-base-uncased")
embedder.fit(["short adaptation corpus"])
embedder.save_model(model_dir.as_posix())
restored = MlmEmbedding(embedding_name=model_dir.as_posix())
vector = restored.infer_vector("reuse the tuned model")
print(len(vector.value))
Related Packages
Embedding and vector packages:
- swarmauri_embedding_doc2vec
- swarmauri_embedding_nmf
- swarmauri_vectorstore_mlm
- swarmauri_vectorstore_qdrant
- swarmauri_vectorstore_redis
- swarmauri_vectorstore_pinecone
Foundational packages:
- swarmauri_core defines embedding interfaces.
- swarmauri_base provides
EmbeddingBase. - swarmauri_standard provides
Vector. - swarmauri provides namespace imports and plugin discovery.
Best Practices
- Pin
embedding_nameto a model you have tested in your deployment environment. - Pre-download or cache Hugging Face model weights for offline or repeatable builds.
- Use smaller models and smaller batches on memory-constrained machines.
- Save tuned models after adaptation so workers do not repeat fine-tuning.
- Pair generated vectors with a Swarmauri vector store for retrieval-augmented workflows.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swarmauri_embedding_mlm-0.11.0.dev1.tar.gz.
File metadata
- Download URL: swarmauri_embedding_mlm-0.11.0.dev1.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd7229c9aec5bc6b3dbc8a0bc46570b4d116aef9ca6233102dc988b440d82be5
|
|
| MD5 |
32db242d49a965b34dfce8664449baa5
|
|
| BLAKE2b-256 |
9e13102c12c6928a3e8107a65eeba756a304099c22786e0fd3961940d194c0d5
|
File details
Details for the file swarmauri_embedding_mlm-0.11.0.dev1-py3-none-any.whl.
File metadata
- Download URL: swarmauri_embedding_mlm-0.11.0.dev1-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
462a9e483f9366e1eac0b93af9166cd377e195c86d5cf861d03fb8b26e8a63c0
|
|
| MD5 |
db40676b18a0f278c6d6df4e956616dd
|
|
| BLAKE2b-256 |
6884f7157881cca05403fdc49ba61a000488f53b23f6efb30a7dd7ce38bc4b58
|