Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab gpu package roadmap roadmap

Generate and stream embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.We support dense, sparse and late-interaction embeddings.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • ColPali : Support for ColPali in GPU version
  • Splade : Support for sparse embeddings for hybrid
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

Splade Models:


model = EmbeddingModel.from_pretrained_hf(
    WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)

ColPali Models Only runs with embed-anything-gpu

model: ColpaliModel = ColpaliModel.from_pretrained("vidore/colpali-v1.2-merged", None)

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files
    • Markdowns
    • Websites
    • Images
    • Custom model uploads

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Where are we heading

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing ORT support along with our previous framework on hugging-face ,

    ➡️ Significantly faster performance

    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.4.12.tar.gz (938.0 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.4.12-cp312-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.4.12-cp312-cp312-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.12-cp312-cp312-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.4.12-cp312-cp312-macosx_10_12_x86_64.whl (11.0 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.4.12-cp311-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.4.12-cp311-cp311-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.12-cp311-cp311-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.4.12-cp311-cp311-macosx_10_12_x86_64.whl (11.1 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.4.12-cp310-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.4.12-cp310-cp310-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.12-cp310-cp310-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.4.12-cp39-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.4.12-cp39-cp39-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.12-cp39-cp39-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.4.12-cp38-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.4.12.tar.gz.

    File metadata

    • Download URL: embed_anything-0.4.12.tar.gz
    • Upload date:
    • Size: 938.0 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything-0.4.12.tar.gz
    Algorithm Hash digest
    SHA256 6e03e2b5552a5dbf203e64849846bf1e31b169924032737068d8da7a05867a22
    MD5 8d66d1123ee73c67506cdb7395299b5e
    BLAKE2b-256 836aae41be132c5f24906c45c2413f926e566fe1c4abccdf415d7f6b42b4a4c4

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 a08b2c7ce27dcbe4423b2e84a5f5c15eeeeb2dbdd97dbee3a7df1bd75d5b0d7a
    MD5 ed7ae319e8efdbce5fbe259b8c47ae1b
    BLAKE2b-256 22a9df90d2aee224d9558bdae25e2c7a31c95466b0e5e3af4c90be33074e0cc7

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 6719154d5c226f49a8228aea35cd7a7c88d15cff2a030338e4f78ddc75aa933a
    MD5 101790658f05da8fe058997bebca50e8
    BLAKE2b-256 78923e5dcb7df7092fe18af8cdc5f8c0a0a0b9b54b056ca357161ce7c148e6ab

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 547a24571aa10837b98e0a7a7e5696f967db8c1bad93bfd5311f851d22d149ab
    MD5 9f0700d53c32924142ba0d0f2f1781f1
    BLAKE2b-256 cef7ddd4bef145a2974ad3bec71cc80810659eae634b77a8596934f0b854482b

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 b5a10040efc0acc73db02d04bfb3084ccfc62c1d3988f699ff9e81c1ea4f0716
    MD5 e474fec8d94920a5ec2b2047f1015773
    BLAKE2b-256 7a098b844e1eebae39e89e465998de8c204e6108e718b8ed6814c656d3d2e044

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 928281abe0cd555cb5432fef4c7fdf4511775e48e26c535437601a7ff6c70698
    MD5 670edd843a789de8dc845883920a1530
    BLAKE2b-256 c40fca65c92f96f1975282623aab44e4243b3cf949e7ccae232d428267fbe824

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 4159a57f3b536819e0d80abacee3a1a449fe8d5263222a1b2f3345913e537492
    MD5 b414976fbc9ee7f466b3d7fe4ac9b680
    BLAKE2b-256 2fd38ef462cb24b7e26f3e1cdca86513707a7546b21ab94294b0c53bb84fe0c1

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 4b0c483aefbfca2f8468e0eda907b68da134a9ee966f448cba7927737634d5d0
    MD5 e8b3968d3e8721e8277deba14b4469eb
    BLAKE2b-256 12c7d7ef4fbcec6dafaf4b2cf33d72727cdb5d8ac812442370e3a5cabe5100c7

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 957acdefe8fc5bf674afd3c5a9a91c68c24fd9656c9e37f6a178110163ae86d2
    MD5 a963bc6d5802fddf9adc54c43c8c005b
    BLAKE2b-256 2b196c30c921e80fbb7fb540639e7076629a1e0479e11e9ab5b796d9296185bf

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 042446c5d414f44e283ec4eddb0da48d05d07f3ba49195d2f336e1a63034a8d4
    MD5 88ceca5b464bddde82fda408dad5aba6
    BLAKE2b-256 75c6a7895dc9d07ce31630b0b0b6ce3ac2c253f3ad99777c7894560ff882b2e3

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 d659e9b627ca34405c8fc312f38f7b40e798d89637e647a9e62ff2119b22cfb4
    MD5 e74ae70de8af159142b184cbe79f186d
    BLAKE2b-256 e50724359e324a994bc62c510aa7d1046ed8a9e32b7fc0b9e6defe87f3b92c51

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 98158728a13fb1c03164e44f2e7eb9ab8cbba4c8d8c1390589d39efa6584b0e9
    MD5 1807f22f48bd6a329d41756f0dc47d24
    BLAKE2b-256 a2237f2e4870cbeae1b9f6f85136e88462a41dae07a09a06269797a4a7b4bb06

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 3550676949e82dc265d5a1cb130e8b6f90ca947b30af7e192b581de655e4967d
    MD5 ed89ffac006b6261bcdf323e2d490dd8
    BLAKE2b-256 57f990e4d90bd66f0eb538c489078a1123707d792d65aee5dc0183cec2ecf2cb

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 f49d95cbc8d5b0c63bf769b6c7e02e548fb256d17cca67bc1fb295afcb2dfb43
    MD5 a0748998ebb03b70cce294071eee0dc0
    BLAKE2b-256 7f1d334d12225d91afd04a918b9186420f0af7f6c26535a5348517e652b2e647

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 c2bf5ecf95147607f2149f3a6113d617c5cd2aeb6fe3e22262562794dd00688c
    MD5 9fc29ac2e06b28c326f5e5adaf236ee0
    BLAKE2b-256 91168a84045ba15ab34db4ea909d9210a3375d5c5f1a66d36c3c4285b40f77c4

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.12-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.12-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 0b5079f94235de2e5a2e8d541d53a3d35f8f4240acc14af6f19373a8a729e394
    MD5 ecd83b9cebe38944aa558c4db2817493
    BLAKE2b-256 61b32df75cdd4ff234781803124e413a4725df3e5e5f6c11c333bf2b62456524

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page