Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab gpu package roadmap roadmap

Generate and stream embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.We support dense, sparse and late-interaction embeddings.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • ColPali : Support for ColPali in GPU version
  • Splade : Support for sparse embeddings for hybrid
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

Splade Models:


model = EmbeddingModel.from_pretrained_hf(
    WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)

ColPali Models Only runs with embed-anything-gpu

model: ColpaliModel = ColpaliModel.from_pretrained("vidore/colpali-v1.2-merged", None)

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    Accomplishments

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files

    • Markdowns

    • Websites

    • Images

    • Videos

    • Graph

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Coming Soon

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing Candle-ONNX along with our previous framework on hugging-face ,

    ➡️ Support for GGUF models

    • Significantly faster performance
    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone
    • Qdrant

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything_gpu-0.4.13.tar.gz (939.2 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything_gpu-0.4.13-cp312-none-win_amd64.whl (12.6 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything_gpu-0.4.13-cp312-cp312-manylinux_2_31_x86_64.whl (15.9 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.13-cp311-none-win_amd64.whl (12.6 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything_gpu-0.4.13-cp311-cp311-manylinux_2_31_x86_64.whl (15.9 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.13-cp310-none-win_amd64.whl (12.6 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything_gpu-0.4.13-cp310-cp310-manylinux_2_31_x86_64.whl (15.9 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.13-cp39-none-win_amd64.whl (12.6 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything_gpu-0.4.13-cp39-cp39-manylinux_2_31_x86_64.whl (15.9 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.13-cp38-none-win_amd64.whl (12.6 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    embed_anything_gpu-0.4.13-cp38-cp38-manylinux_2_31_x86_64.whl (15.9 MB view details)

    Uploaded CPython 3.8 manylinux: glibc 2.31+ x86-64

    File details

    Details for the file embed_anything_gpu-0.4.13.tar.gz.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13.tar.gz
    Algorithm Hash digest
    SHA256 5c9fe4fb0edc459f0a8b8827a27613c01172508e6c510ccda8e78d1421406212
    MD5 c287197b437218af0ed762b35a24bb7b
    BLAKE2b-256 75ed09a631e9fc0273b50c9a2879d22db086c8fcacff1712d2568ced8b8072e0

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 bd64dda1fcfc6803ccda3922e3d027b1073621be3c99053d172df55383c50912
    MD5 4b813763566c47f944695c971c389055
    BLAKE2b-256 777c38ae4144cb9398b13a4a60974a52ac8d703b8916e30fd67b0c1828c3c09f

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp312-cp312-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp312-cp312-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 d9c66112fc4aa720dc7c3a8ff3db9559b9e88be760bff43cf53520dd3aec02c4
    MD5 d6cb9f801365327dceeec45fdcc6e1c1
    BLAKE2b-256 771ed0619100e1f2717cf30d7b9210c0ac5995986d35eef53a28b555e29d18fd

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 69a36572ac11f3fd3efccae7ae58806b1fd79a6d4f7fd09d816765f8c01d02d7
    MD5 39ae630c05b3a2f7d305dec78cf08a3f
    BLAKE2b-256 553c93775118c67cd9dfd3e7aaeef97ee9c79d451687548b00470a8a5c7ee6be

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp311-cp311-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp311-cp311-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 6a3b626cfd89c26cdf21064973ed1f237b407cf8299586dc36d0ed3e3434f241
    MD5 bf54e47e9f0efba5655059fd57e53fd5
    BLAKE2b-256 8f570d753bc93f97f739963d3edabcc83c7033315a3c8e26ce510065269f4e77

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 02417bdf35fa18f6ce0557819baa9e138c92aac208f2dff9dd7592c6b53a0013
    MD5 05ecd17519ef072f1e5195720c0aec93
    BLAKE2b-256 2a8d887583fd837b7167771f3fa7a68fc53141076ed939a08b0b62e9d7bb0d9d

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp310-cp310-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp310-cp310-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 3a1ef40942d8707fccb7aea51b69d65a23bf2967567b1f231318d6ee838e89e2
    MD5 5b05ee7a49e6e92eb0dbdb264f17b803
    BLAKE2b-256 f31d93f2edded53a7f2fe124dc16a9040f975158461d3b40837663dff8421f53

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 4c0ffaf85c5d02368c498eaf1110d1d93f156fcc836c840d607ea2054c835f0c
    MD5 97afd5827dcaa1ac1ce5ae405c975ceb
    BLAKE2b-256 ae72abead2691a894714d85d5c0e74c4ad376b04899aeb176c61ea782603565f

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp39-cp39-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp39-cp39-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 c4f43ec6c2fdb56e09c76fa871e8d711893a335af08c8e3b6ab1e30ad3d21e51
    MD5 44aad1856fe92458115d548b2475d322
    BLAKE2b-256 2750c1dbadd0e7b6fb7d58b24b04c6129e85aeed63c4ff187e169962be0cdb9e

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 9d33a4c1695e16ae463cd3884bedefd6fd15fb3b311aa4cedb77d719c92a85ba
    MD5 9eabfdfa531af61f3d6e030b126ed95e
    BLAKE2b-256 3a36057cbe55f0388081d15d70d00cb0ff10dc6a783b2c0b64045c4f543b26d0

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.13-cp38-cp38-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.13-cp38-cp38-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 5bcea2e57d8615e7cbe44970b26a07f5e07fcf85bf3a5055473ee2635831bb61
    MD5 ce38bb6f64bd8f0fcc445e02354cc107
    BLAKE2b-256 ea45d82cbadaaf2a7c888fabc3fb7439d0d7da8ce47167113e6f720f51569463

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page