Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab gpu package roadmap roadmap

Generate and stream embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.We support dense, sparse and late-interaction embeddings.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • ColPali : Support for ColPali in GPU version
  • Splade : Support for sparse embeddings for hybrid
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

Splade Models:


model = EmbeddingModel.from_pretrained_hf(
    WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)

ColPali Models Only runs with embed-anything-gpu

model: ColpaliModel = ColpaliModel.from_pretrained("vidore/colpali-v1.2-merged", None)

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    Accomplishments

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files

    • Markdowns

    • Websites

    • Images

    • Videos

    • Graph

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Coming Soon

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing Candle-ONNX along with our previous framework on hugging-face ,

    ➡️ Support for GGUF models

    • Significantly faster performance
    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone
    • Qdrant

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.4.13.tar.gz (939.1 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.4.13-cp312-none-win_amd64.whl (13.9 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.4.13-cp312-cp312-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.13-cp312-cp312-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.4.13-cp312-cp312-macosx_10_12_x86_64.whl (11.0 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.4.13-cp311-none-win_amd64.whl (13.9 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.4.13-cp311-cp311-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.13-cp311-cp311-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.4.13-cp311-cp311-macosx_10_12_x86_64.whl (11.0 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.4.13-cp310-none-win_amd64.whl (13.9 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.4.13-cp310-cp310-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.13-cp310-cp310-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.4.13-cp39-none-win_amd64.whl (13.9 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.4.13-cp39-cp39-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.13-cp39-cp39-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.4.13-cp38-none-win_amd64.whl (13.9 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.4.13.tar.gz.

    File metadata

    • Download URL: embed_anything-0.4.13.tar.gz
    • Upload date:
    • Size: 939.1 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything-0.4.13.tar.gz
    Algorithm Hash digest
    SHA256 887fc715e368371b5aaac63a446ed9d7bb324b7ad896e297b673f3aca611f9f5
    MD5 d89e329b48dafeb111ed7813a788c0ca
    BLAKE2b-256 bb6965c5f3c50b9337a27e8331b285903a4bb3d436d6710912092f77c6a81f25

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 d5eaab457a6e53be01a87049eaf2b6230e64fee27cb614fa4fd1073fab719ede
    MD5 178c55e90e62f666dd246881d510e266
    BLAKE2b-256 daee6b9c622eeac70da0b375b15023f2208665a2b721c46ec758b989ff29391b

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 d2d2512197b4751f0dd2af68d806f78ea40ef62de6f342f8418616e7e088212e
    MD5 68a9a7a88128e71d07e6a71e51ce0352
    BLAKE2b-256 86e71982f6f4a4e185fc9e6564ea944da6b82cf66e3f8dd44bd7e461e29b1378

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 ee47f5d53034688af25cfc0d4f808928bff8361ea18ae993f3938a5e05d8c4f5
    MD5 ff2a654c5d3e8bd27edcb06c399fd358
    BLAKE2b-256 f62acb6ded452210619c180c7b99e3083832f5fda6748f989d3dc86596696ff2

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 ba8c3c5d77885e0899708a8d6a99e422ba1b7b9d76b5b45e1122cbbf1a12ebc5
    MD5 20db9a77fc693001d8fb2c2258b8048b
    BLAKE2b-256 69b40763b13cf5ad0b90a93454b49b5245ef0ba2d8193f94c0593d605db179eb

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 a4c3c44f024431cea9f4c993e02df6b3213897039566e3d8d2607e5826002368
    MD5 5b44405be43f764a6e5654f08d2bcb22
    BLAKE2b-256 e59aa56307386f9c257de2f10cb009311722528852726bb769b509518723983f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 c73712561af526f05c2cf6b63b2a49631c3f3ac9ce7bb0773405ba82ad71ee75
    MD5 b56d4c3b4bb4369a7f6a8b5ca41bf87a
    BLAKE2b-256 77f96003c068129e4ac02b0628271c7430013a594a1619a887d6b13cc1a3dadb

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 4de9f1e4a4d3af59106e7ef95a3b037c697dc3c217506b6c607f0a1484b494ea
    MD5 ca54441b29631d9b459fd150c71ac813
    BLAKE2b-256 7c6c35c79f6ab69fdebd42e960ff31f0c20e933e9461661da0a2671269521f19

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 2e52c11efe912e1be3d250ce83aa0d709ec087dc154216782ae47af6975bdb1e
    MD5 5f9313c4841f18dd64c3cb3ff46da199
    BLAKE2b-256 bec0701ff6ebf42254f09d1b979361f1a8c616b2df9e41871fe7184f8c2b5174

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 2b5362e782de4eccfc1631b0928dfbe106ee0a194c9b07639779cafc6b60c61c
    MD5 204460b47db6bc75f1a2b6c7512edd47
    BLAKE2b-256 6451a6f5165dcacd4bdedccc5e4ef82b5648b658eeedc009f449690f1db5c6a4

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 058148b83339c97bc5ad0709eb094ce551d64011b1a96345ff8f0bc0a60cdeb3
    MD5 93ad8ca33fcbae37679b05cb2cc61e63
    BLAKE2b-256 567e7efa73601f37c28c2ad80689391bb13e239e2e9933924bfaaea346d589aa

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 70d6fc91b64d34969b55ae6fdf090a97c7553e779a9b5e788df372ce433858c8
    MD5 1a070539dda81f93e14508a1775637b1
    BLAKE2b-256 db5c35774a8dd99a78ae9b521be60ff6c4d565c55eef943585d0ed8633339ea7

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 90c3892c686e6dcc940e0a125a1f3c5f360085eeab1c19ad03ed8a57f9afc790
    MD5 c6e7908dedfbce7abbcb537e3401436a
    BLAKE2b-256 9d92219311a1f9217de28104fa0f124a1c6033aa3e29d45a9630b6b3f7322501

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 61a45a0099e345bfbc75aca8a1e0dabe30c84ef2179eb19684cc0ecfa4e67e88
    MD5 c669c0c9894b56ae98a9652b6e1f263a
    BLAKE2b-256 eeadf71d6d8753bad86ee915a30c58e3bf5b4d0e811864852eeb848fa830c00d

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 2e1f17a15f9cebc74204e900a9e9e76fc287a10e1c3e0765a51c31624cc3e6cc
    MD5 5c6a069a2211a8e9fe67919d56e138ca
    BLAKE2b-256 8652672ad1456143cfa541f1043bac679e02b15494a5796b5d038ea673e624c5

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.13-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.13-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 29d604393021934ccd4133368ef6f869c396b6e68b5ad420a14bb29b6d2e78bb
    MD5 66f203f58272859a7a4c429bcf1e2f90
    BLAKE2b-256 57f2b67368e682f7f4d8042954a65c44156659bfa9d4309ed7c32031d4f94a39

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page