A unified inference library for transformer-based pre-trained multilingual embedding models
Project description
Text Embedder
text_embedder
is a powerful and flexible Python library for generating and managing text embeddings using pre-trained transformer based multilingual embedding models. It offers support for various pooling strategies, similarity functions, and quantization techniques, making it a versatile tool for a variety of NLP tasks, including embedding, similarity search, clustering, and more.
🚀 Features
- Model Integration: Wraps around 🤗 transformers to leverage the state-of-ther-art pre-trained embedding models.
- Pooling Strategies: Choose from multiple pooling methods such as CLS token, max/mean pooling, and more to tailor to your need.
- Flexible Similarity Metrics: Compute similarity scores between embeddings using cosine, dot, euclidean, and manhattan metrics.
- Quantization Support: Reduce memory usage and improve performance by quantizing embeddings to multiple precision levels with support for auto mixed precision quantization.
- Prompt Support: Optionally include a custom prompt in embeddings for contextualized representation.
- Configurable Options: Tune embedding generation with options for batch size, sequence length, normalization, and more.
🛠 Installation
Install text_embedder
from PyPI using pip:
pip install text_embedder
📖 Usage
Initialization
Initialize the TransformersEmbedder
with your desired configurations:
from text_embedder import TextEmbedder
embedder = TextEmbedder(
model="BAAI/bge-small-en",
sim_fn="cosine",
pooling_strategy=["cls"],
device="cuda", # Specify device if needed
precision="int8" # Optional: for quantization
)
Generating Embeddings
Generate embeddings for a list of texts:
embeddings = embedder.embed(["Hello world", "Transformers are amazing!"])
print(embeddings)
Computing Similarity
Compute similarity between two embeddings:
embedding1 = embedder.embed(["Cat jumped from a chair"])
embedding2 = embedder.embed(["Mamba architecture is better than transformers tho, ngl."])
similarity_score = embedder.get_similarity(embedding1, embedding2)
print(f"Similarity Score: {similarity_score}")
Advanced Usage
Pooling Strategies
You can choose from various pooling strategies:
"cls"
: Use the CLS token embedding."max"
: Take the maximum value across tokens."mean"
: Compute the mean of token embeddings."mean_sqrt_len"
: Compute the mean divided by the square root of token length."weightedmean"
: Compute a weighted mean of token embeddings."lasttoken"
: Use the last token embedding.
Similarity Functions
Supported similarity functions:
- Cosine Similarity: Measures the cosine of the angle between two vectors.
- Dot Product: Measures the dot product between two vectors.
- Euclidean Distance: Measures the straight-line distance between two vectors. (L1)
- Manhattan Distance: Measures the sum of absolute differences between two vectors. (L2)
Quantization
Quantize embeddings to lower precision:
- float32: 32-bit floating-point precision.
- float16: 16-bit floating-point precision.
- int8: 8-bit integer precision.
- uint8: 8-bit unsigned integer precision.
- binary: Binary quantization.
- ubinary: Unsigned binary quantization.
- 2bit: 2-bit quantization.
- 4bit: 4-bit quantization.
- 8bit: 8-bit quantization.
Future Work
- Additional Pooling Strategies: Implement more advanced pooling methods (eg., attention-based). Also have to add a
auto
option to pooling_strategy to find a right pooling method based on model config. - Custom Quantization Methods: Add to new quantization techniques for further improvement.
- Similarity function: Also add more similarity metric functions
🤝 Contributing
Contributions are welcome! Please follow these steps to get started with your contribution:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature
). - Make your changes.
- Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature/your-feature
). - Create a new Pull Request.
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgement
Special Thanks to devs of Sentence-Transformers library.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text_embedder-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0eec69d9b3dede657a85030adb1059fdc4c4f5190593a770be125f623dddbec1 |
|
MD5 | 511336bcc49371485aeb9d54c17d74ea |
|
BLAKE2b-256 | 86cfb494e82efd0dd40ad1355b1be9abc4f09de118e2904174b13c2322f06edf |