Skip to main content

Embed anything at lightning speed

Project description

Downloads gpu Open in Colab package roadmap roadmap

Inference, ingestion, and indexing – supercharged by Rust 🦀
Python docs »
Rust docs »
View Demo · Benches · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist, highly performant, lightning-fast, lightweight, multisource, multimodal, and local embedding pipeline built in Rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything streamlines the process of generating embeddings from various sources and seamlessly streaming (memory-efficient-indexing) them to a vector database. It supports dense, sparse, ONNX and late-interaction embeddings, offering flexibility for a wide range of use cases.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • ONNX Models: Works with ONNX models for BERT and ColPali
  • ColPali : Support for ColPali in GPU version
  • Splade : Support for sparse embeddings for hybrid
  • ReRankers : Support for ReRanking Models for better RAG.
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • GPU support : We have taken care of hardware acceleration on GPU as well.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once. The embedding process happens separetly from the main process, so as to maintain high performance enabled by rust MPSC.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.
➡️Supports range of models, Dense, Sparse, Late-interaction, ReRanker, ModernBert.

⭐ Supported Models

We support any hugging-face models on Candle. And We also support ONNX runtime for BERT and ColPali.

How to add custom model on candle: from_pretrained_hf

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embedder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

Splade Models:

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)

ONNX-Runtime: from_pretrained_onnx

BERT

model = EmbeddingModel.from_pretrained_onnx(
  WhichModel.Bert, model_id="onnx_model_link"
)

ColPali

model: ColpaliModel = ColpaliModel.from_pretrained_onnx("starlight-ai/colpali-v1.2-merged-onnx", None)

ModernBERT

model = EmbeddingModel.from_pretrained_onnx(
    WhichModel.Bert, ONNXModel.ModernBERTBase, dtype = Dtype.Q4F16
)

ReRankers

reranker = Reranker.from_pretrained("jinaai/jina-reranker-v1-turbo-en", dtype=Dtype.F16)

results: list[RerankerResult] = reranker.rerank(["What is the capital of France?"], ["France is a country in Europe.", "Paris is the capital of France."], 2)

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embedder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embedder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embedder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embedder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embedder=embedder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    Accomplishments

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    Adding Fine-tuning

    One of the major goals of this year is to add finetuning these models on your data. Like a simple sentence transformer does.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files

    • Markdowns

    • Websites

    • Images

    • Videos

    • Graph

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Coming Soon

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing Candle-ONNX along with our previous framework on hugging-face ,

    ➡️ Support for GGUF models

    • Significantly faster performance
    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.5.0.tar.gz (922.1 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.5.0-cp313-cp313-macosx_11_0_arm64.whl (11.4 MB view details)

    Uploaded CPython 3.13 macOS 11.0+ ARM64

    embed_anything-0.5.0-cp312-cp312-win_amd64.whl (14.5 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.5.0-cp312-cp312-manylinux_2_34_x86_64.whl (19.1 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.5.0-cp312-cp312-macosx_11_0_arm64.whl (11.4 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.5.0-cp311-cp311-win_amd64.whl (14.6 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.5.0-cp311-cp311-manylinux_2_34_x86_64.whl (19.2 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.5.0-cp311-cp311-macosx_11_0_arm64.whl (11.5 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.5.0-cp310-cp310-win_amd64.whl (14.6 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.5.0-cp310-cp310-manylinux_2_34_x86_64.whl (19.2 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.5.0-cp39-cp39-win_amd64.whl (14.6 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl (19.2 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.5.0-cp38-cp38-win_amd64.whl (14.6 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.5.0.tar.gz.

    File metadata

    • Download URL: embed_anything-0.5.0.tar.gz
    • Upload date:
    • Size: 922.1 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.8.1

    File hashes

    Hashes for embed_anything-0.5.0.tar.gz
    Algorithm Hash digest
    SHA256 592a9934bbd95c001e7310d73256ae8ff39d6475fe2e64fbc72ac1476c7599b9
    MD5 cc49e1fa553a543bcd41c4fd5fa6d881
    BLAKE2b-256 1e57710c6166531649cc0dfca8568a956c5c48208edbcef895aa5dba52205aff

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp313-cp313-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp313-cp313-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 b3b6bce518f49f0ed55a8eaf54638eca12e39ae234d8a69143c1abdb2aead305
    MD5 2fe4e30ca609570ea74089dfc9031cc5
    BLAKE2b-256 fb3e46d8282f9ca103d0f4f09a6274e93ac026bfa14f6d74dd3c74a93818be84

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp312-cp312-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp312-cp312-win_amd64.whl
    Algorithm Hash digest
    SHA256 27a87bf34cb7a43b7ac9562d3c09ace010dd3ac1dc2bc9eb400759a83f52e29e
    MD5 6805c272be4d2d88bc3e2cb655dc9218
    BLAKE2b-256 cd3783f97d356b2a5c551ddcc316575ff41c02c4a06738cb0f4125cc99b4c9d8

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 d65ecf6792491e9c89b5d678ce01167f07cd2ca03628ae729ccadaca75579e96
    MD5 bf42273194c4ccdb51463b3d407f188e
    BLAKE2b-256 e5dc6c792d1c383def46cda5f2768826b4cc65c81f132dd65b8392b7a625a56d

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 07c1532bb5df9ff3dbe53f2d647fe3e35c57c6b901723eced737b3738e232a51
    MD5 f97cc694787d06f41ddfd7f1dc3a4c1f
    BLAKE2b-256 1799a857a96742ce2b84a557ce19d74cc6850d3d01c74798a2cc33bc2496f7b4

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp311-cp311-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp311-cp311-win_amd64.whl
    Algorithm Hash digest
    SHA256 1a94e9490517ea1f329656bd90a2106bbc2913f178350922b8ca141e3a764bca
    MD5 52035873867187409056edbe6b51021d
    BLAKE2b-256 a89e976912ff40b7ef44cdbed3dd8ed8be1e3708516357f54b6991accb8cd58a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 42fe49d8828f3142b34148f417651898388c5467ba15449ccc493346a2445273
    MD5 3dd3872f43ba782e5a227ba3e6413094
    BLAKE2b-256 bb67966645d5c460c95cbc84a4c5d398520d6bc5217d9807af59ef42de0963d8

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 befcbfcd35f9ef55c7ee863214b4da2fe92cb79b055d36c19667a444cd1ad531
    MD5 b65f11d7097c4d12194c59758e7bb2f4
    BLAKE2b-256 69263dd405868ef24dccf9b43a09b29f34b35c603f4d011bc07cd95cf24aaf3d

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp310-cp310-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp310-cp310-win_amd64.whl
    Algorithm Hash digest
    SHA256 25d35a514c7772834fd9edb2031bd993d853b95d927b47d5afea783739e02d83
    MD5 93bf390e70ff9ee4fb8fce14fe6fa39c
    BLAKE2b-256 0bb4952c695f350287e44334e0e14e5ddd19e4f02036463f15375ef248716d1e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 6807ea781962e59bb887fffa4d4e9101a69888bde007c2cc93665ca71b014671
    MD5 74d47e7d1d27db9967a68ebf5f977e90
    BLAKE2b-256 cad941afdb4f51db69076d673b2d09e8395d6e09316814889a18e3ad6bbda582

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp39-cp39-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp39-cp39-win_amd64.whl
    Algorithm Hash digest
    SHA256 247e385a9fc1917056d19e229e78c8f81ac70ebcde5e996f3bf292087833fb32
    MD5 277448bbc29ea494e1f342cf781f2f76
    BLAKE2b-256 f80e2c14d8776864028d60989c03a7db26e07ef399dc277b5d29e53c34c2f24a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 4daa4fb64b39d353d64f9d064dbd4381a83b69042b2e024176ff6074e37b0d29
    MD5 d68a6856034929472fe15076c9764c2c
    BLAKE2b-256 d5a1595580912d26adc2dc3c5facad7a155253bdfd5bde28e1939d7ad7fb71ed

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.5.0-cp38-cp38-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.5.0-cp38-cp38-win_amd64.whl
    Algorithm Hash digest
    SHA256 e14b625de9f999a4130a1865a7134fc518fd994395cde15bdd4508e458c7211d
    MD5 d510c901d8ccad39807e48a0a8343ea9
    BLAKE2b-256 c19b5b70230f7faac8e3dbebf85d7cfa9d9d687484e47f015327d29fd56044d3

    See more details on using hashes here.

    Supported by

    AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page