Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab gpu package roadmap roadmap

Generate and stream embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.We support dense, sparse and late-interaction embeddings.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • ColPali : Support for ColPali in GPU version
  • Splade : Support for sparse embeddings for hybrid
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

Splade Models:


model = EmbeddingModel.from_pretrained_hf(
    WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)

ColPali Models Only runs with embed-anything-gpu

model: ColpaliModel = ColpaliModel.from_pretrained("vidore/colpali-v1.2-merged", None)

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    Accomplishments

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files

    • Markdowns

    • Websites

    • Images

    • Videos

    • Graph

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Coming Soon

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing Candle-ONNX along with our previous framework on hugging-face ,

    ➡️ Support for GGUF models

    • Significantly faster performance
    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone
    • Qdrant

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.4.15.tar.gz (947.0 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.4.15-cp312-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.4.15-cp312-cp312-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.15-cp312-cp312-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.4.15-cp312-cp312-macosx_10_12_x86_64.whl (10.4 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.4.15-cp311-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.4.15-cp311-cp311-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.15-cp311-cp311-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.4.15-cp311-cp311-macosx_10_12_x86_64.whl (10.4 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.4.15-cp310-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.4.15-cp310-cp310-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.15-cp310-cp310-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.4.15-cp39-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.4.15-cp39-cp39-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.15-cp39-cp39-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.4.15-cp38-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.4.15.tar.gz.

    File metadata

    • Download URL: embed_anything-0.4.15.tar.gz
    • Upload date:
    • Size: 947.0 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything-0.4.15.tar.gz
    Algorithm Hash digest
    SHA256 950d41aafb451780d2569b5a38d559fabf1ddad2d0f8849edc5b0d178771823a
    MD5 f39626aa44f719c22ac9b5fd488d2525
    BLAKE2b-256 189f41276e49ec6c9ff477d6bcf69fce01e8386e51a0850d872ca400c8b4fa92

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 13629fff12ff128860085baaf679fd0894616cb00058fc422f446db58b5557c4
    MD5 ea53250ddab99ce3f404e4610c172f02
    BLAKE2b-256 040e49185e1dbb89ea8611bf499015b211f805c5a5d9cfe89daa31b3845c29b6

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 f7078cad4db1a29924571dc9c870e7cac427b69a9fc08ecdb88bc0612d746923
    MD5 a648c1c826d00ee462a2f36864cc2032
    BLAKE2b-256 6d45dcbded3d39956f98a61a72412bc5a155a051987f5353dddf90c16663a1f4

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 c5e5c607f026d5ee41cd01f43e326973050bd3d4e96ead05f6d22aed1a4d9d91
    MD5 c81eefb022ffc3698e59570d418b468d
    BLAKE2b-256 c824799051342ff66c5c8834b4f6d4be4db3cacd973d72627cfec95b73028fff

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 9b0681c1af3ea33b816b102bf15f7ca83602856b48008905ac6129bc14dd9920
    MD5 b488b6405bf5c56f7c07373afd96d734
    BLAKE2b-256 7b63c49b1c7238b03b7f1891edc117add823e9fcea79184793b6989dcc85504d

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 9641711b83de229d77647fe8d4ce9662ea7cf46a8631c14842c9031a56364ed9
    MD5 b58578b7025e01251976a49351b86901
    BLAKE2b-256 b574ca9035bd066580a2daa44abf818794cb94d0f48d4b4068e114487f37033d

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 38ce5d3518308d5e829366716eb360ae00517ef5730131d806598841e3db60dc
    MD5 24d86eb55fb524c5e67d8fb7e265c084
    BLAKE2b-256 389fe6a19146e96c1e92b3b4e105c2acb17ad514ee7b4d91fd9daec6a15c99d7

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 926d356798d23b382d5857f61f96f65a8abfc39a82d15062d7e5224ec8a545e9
    MD5 93a17f3cfa3f2cb4e41a82a18945ca18
    BLAKE2b-256 6ed8f9a8c94d2b35ecb7823da8ad799cfee7882e529934c7705ef6d6c8e3990b

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 a4a2aa409da2caebb292d51ebe534f5dc0e6a0539b4852068f605366379858c7
    MD5 e107af0f891cc7ccfb1b054fe3f6ea74
    BLAKE2b-256 b64ff8a94051d08ac75e91257f79ac52671b113f55fc90c19083cd2e5f70451c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 b4c440d46521c26609986184e734cf561b0a9c3aaa5f45ea6e8b400860304a2a
    MD5 c1197f4822b8ad52bbdfb9bb95d53afa
    BLAKE2b-256 46ccd940632a8e9deaba032bc6b7ef7ef20ff506e6312aee625f50da98127b74

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 e1790798797ad633a7435a287a190dfeea231d00d23a58d45333b60a801648dd
    MD5 fe33ead45120a0f0b49fd8765ff42cd4
    BLAKE2b-256 2b047d4e54e95c9d5170af2bf55da4e865977547a40faf62e91c064da16ed252

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 599b34409999b247138fe00a6511768b0a529aac1802c004f68d3dc45402f5b6
    MD5 c8f330768655b28990bcb66c6f2cacca
    BLAKE2b-256 4569e828e4ac76d3abb633c3c37efd8a9be072ef4158ba102e19f5cdc2996991

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 bcf9491478e97e2b1f6a51ff7537394af5a080ca3beefccd4fc5f13040f86df9
    MD5 ec1e5c71b75d3ee35d33f64c0b16bd80
    BLAKE2b-256 3ce5c04527d4dc0ceced7ca8e7c36ee07c115e9369905a26a7e220958cdc681f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 ccab153b848a35c78aa706a7df45f89d1a76fab442cea67bf97c87b849ee6f08
    MD5 f1abe8df13ce960d7e1ac943d394896a
    BLAKE2b-256 9188c42ccc43541bef9889bab39aa6f1f51a2eb179171ebba826adfae5d6c0af

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 84a586f2983a56b1bf78d8c3170a6057d00e986c6b49aea14eb85aaca6c3cc55
    MD5 e859c5e10ec8f760b0124d7c369be1b4
    BLAKE2b-256 7a6c40a394343acd492c909cc5b8590b214b72ad961634482c15af7d4250317d

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.15-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.15-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 167a643729a2a9745c5a45eecf120a667292125a2ff30d32daf1c75bf7b3b5fd
    MD5 44b79761932f293491f16e68464549e3
    BLAKE2b-256 5d155a57eaacaf1c666940160689ef95416a9cc29ec12ca3bd1c882befdebb15

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page