Skip to main content

A unified inference library for transformer-based pre-trained multilingual embedding models

Project description

Text Embedder

text_embedder is a powerful and flexible Python library for generating and managing text embeddings using pre-trained transformer based multilingual embedding models. It offers support for various pooling strategies, similarity functions, and quantization techniques, making it a versatile tool for a variety of NLP tasks, including embedding, similarity search, clustering, and more.

🚀 Features

  • Model Integration: Wraps around 🤗 transformers to leverage the state-of-ther-art pre-trained embedding models.
  • Pooling Strategies: Choose from multiple pooling methods such as CLS token, max/mean pooling, and more to tailor to your need.
  • Flexible Similarity Metrics: Compute similarity scores between embeddings using cosine, dot, euclidean, and manhattan metrics.
  • Quantization Support: Reduce memory usage and improve performance by quantizing embeddings to multiple precision levels with support for auto mixed precision quantization.
  • Prompt Support: Optionally include a custom prompt in embeddings for contextualized representation.
  • Configurable Options: Tune embedding generation with options for batch size, sequence length, normalization, and more.

🛠 Installation

Install text_embedder from PyPI using pip:

pip install text_embedder

📖 Usage

Initialization

Initialize the TransformersEmbedder with your desired configurations:

from text_embedder import TextEmbedder

embedder = TextEmbedder(
    model="BAAI/bge-small-en",
    sim_fn="cosine",
    pooling_strategy=["cls"],
    device="cuda",  # Specify device if needed
)

Generating Embeddings

Generate embeddings for a list of texts:

embeddings = embedder.embed(["Hello world", "Transformers are amazing!"])
print(embeddings)

Computing Similarity

Compute similarity between two embeddings:

embedding1 = embedder.embed(["Cat jumped from a chair"])
embedding2 = embedder.embed(["Mamba architecture is better than transformers tho, ngl."])
similarity_score = embedder.get_similarity(embedding1, embedding2)
print(f"Similarity Score: {similarity_score}")

Advanced Usage

Pooling Strategies

You can choose from various pooling strategies:

  • "cls": Use the CLS token embedding.
  • "max": Take the maximum value across tokens.
  • "mean": Compute the mean of token embeddings.
  • "mean_sqrt_len": Compute the mean divided by the square root of token length.
  • "weightedmean": Compute a weighted mean of token embeddings.
  • "lasttoken": Use the last token embedding.

Similarity Functions

Supported similarity functions:

  • Cosine Similarity: Measures the cosine of the angle between two vectors.
  • Dot Product: Measures the dot product between two vectors.
  • Euclidean Distance: Measures the straight-line distance between two vectors. (L1)
  • Manhattan Distance: Measures the sum of absolute differences between two vectors. (L2)

Quantization

Quantize embeddings to lower precision:

  • float32: 32-bit floating-point precision.
  • float16: 16-bit floating-point precision.
  • int8: 8-bit integer precision.
  • uint8: 8-bit unsigned integer precision.
  • binary: Binary quantization.
  • ubinary: Unsigned binary quantization.
  • 2bit: 2-bit quantization.
  • 4bit: 4-bit quantization.
  • 8bit: 8-bit quantization.

Future Work

  • Additional Pooling Strategies: Implement more advanced pooling methods (eg., attention-based). Also have to add a auto option to pooling_strategy to find a right pooling method based on model config.
  • Custom Quantization Methods: Add to new quantization techniques for further improvement.
  • Similarity function: Also add more similarity metric functions

🤝 Contributing

Contributions are welcome! Please follow these steps to get started with your contribution:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature/your-feature).
  3. Make your changes.
  4. Commit your changes (git commit -am 'Add new feature').
  5. Push to the branch (git push origin feature/your-feature).
  6. Create a new Pull Request.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgement

Special Thanks to devs of Sentence-Transformers library.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_embedder-0.1.2.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

text_embedder-0.1.2-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file text_embedder-0.1.2.tar.gz.

File metadata

  • Download URL: text_embedder-0.1.2.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for text_embedder-0.1.2.tar.gz
Algorithm Hash digest
SHA256 bda2085a59268da3be1c8bf7cfa6600483d2112db0b246fb83b526d824219919
MD5 6cec24325e1cc436cce51260a7b7f3e8
BLAKE2b-256 692ef9565b3ff447f1a7898c5ab9238af98014c8e06445c64ed8d53c0b1d9b0e

See more details on using hashes here.

File details

Details for the file text_embedder-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for text_embedder-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d281ade1dd0ac133d31f8a274b6c3e77010c0f72535a4ff4dcf748ed13a2437f
MD5 d6069ae0d986f8bb12ab948c5b3afa39
BLAKE2b-256 56e0e41d53817909ebb06bda360b03c7f303d9f9bf518b2daed69d42fc6e5647

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page