A unified inference library for transformer-based pre-trained multilingual embedding models
Project description
Text Embedder
text_embedder
is a powerful and flexible Python library for generating and managing text embeddings using pre-trained transformer based multilingual embedding models. It offers support for various pooling strategies, similarity functions, and quantization techniques, making it a versatile tool for a variety of NLP tasks, including embedding, similarity search, clustering, and more.
🚀 Features
- Model Integration: Wraps around 🤗 transformers to leverage the state-of-ther-art pre-trained embedding models.
- Pooling Strategies: Choose from multiple pooling methods such as CLS token, max/mean pooling, and more to tailor to your need.
- Flexible Similarity Metrics: Compute similarity scores between embeddings using cosine, dot, euclidean, and manhattan metrics.
- Quantization Support: Reduce memory usage and improve performance by quantizing embeddings to multiple precision levels with support for auto mixed precision quantization.
- Prompt Support: Optionally include a custom prompt in embeddings for contextualized representation.
- Configurable Options: Tune embedding generation with options for batch size, sequence length, normalization, and more.
🛠 Installation
Install text_embedder
from PyPI using pip:
pip install text_embedder
📖 Usage
Initialization
Initialize the TransformersEmbedder
with your desired configurations:
from text_embedder import TextEmbedder
embedder = TextEmbedder(
model="BAAI/bge-small-en",
sim_fn="cosine",
pooling_strategy=["cls"],
device="cuda", # Specify device if needed
)
Generating Embeddings
Generate embeddings for a list of texts:
embeddings = embedder.embed(["Hello world", "Transformers are amazing!"])
print(embeddings)
Computing Similarity
Compute similarity between two embeddings:
embedding1 = embedder.embed(["Cat jumped from a chair"])
embedding2 = embedder.embed(["Mamba architecture is better than transformers tho, ngl."])
similarity_score = embedder.get_similarity(embedding1, embedding2)
print(f"Similarity Score: {similarity_score}")
Advanced Usage
Pooling Strategies
You can choose from various pooling strategies:
"cls"
: Use the CLS token embedding."max"
: Take the maximum value across tokens."mean"
: Compute the mean of token embeddings."mean_sqrt_len"
: Compute the mean divided by the square root of token length."weightedmean"
: Compute a weighted mean of token embeddings."lasttoken"
: Use the last token embedding.
Similarity Functions
Supported similarity functions:
- Cosine Similarity: Measures the cosine of the angle between two vectors.
- Dot Product: Measures the dot product between two vectors.
- Euclidean Distance: Measures the straight-line distance between two vectors. (L1)
- Manhattan Distance: Measures the sum of absolute differences between two vectors. (L2)
Quantization
Quantize embeddings to lower precision:
- float32: 32-bit floating-point precision.
- float16: 16-bit floating-point precision.
- int8: 8-bit integer precision.
- uint8: 8-bit unsigned integer precision.
- binary: Binary quantization.
- ubinary: Unsigned binary quantization.
- 2bit: 2-bit quantization.
- 4bit: 4-bit quantization.
- 8bit: 8-bit quantization.
Future Work
- Additional Pooling Strategies: Implement more advanced pooling methods (eg., attention-based). Also have to add a
auto
option to pooling_strategy to find a right pooling method based on model config. - Custom Quantization Methods: Add to new quantization techniques for further improvement.
- Similarity function: Also add more similarity metric functions
🤝 Contributing
Contributions are welcome! Please follow these steps to get started with your contribution:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature
). - Make your changes.
- Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature/your-feature
). - Create a new Pull Request.
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgement
Special Thanks to devs of Sentence-Transformers library.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file text_embedder-0.1.2.tar.gz
.
File metadata
- Download URL: text_embedder-0.1.2.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bda2085a59268da3be1c8bf7cfa6600483d2112db0b246fb83b526d824219919 |
|
MD5 | 6cec24325e1cc436cce51260a7b7f3e8 |
|
BLAKE2b-256 | 692ef9565b3ff447f1a7898c5ab9238af98014c8e06445c64ed8d53c0b1d9b0e |
File details
Details for the file text_embedder-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: text_embedder-0.1.2-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d281ade1dd0ac133d31f8a274b6c3e77010c0f72535a4ff4dcf748ed13a2437f |
|
MD5 | d6069ae0d986f8bb12ab948c5b3afa39 |
|
BLAKE2b-256 | 56e0e41d53817909ebb06bda360b03c7f303d9f9bf518b2daed69d42fc6e5647 |