Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab gpu package roadmap roadmap

Generate and stream embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.We support dense, sparse and late-interaction embeddings.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • ColPali : Support for ColPali in GPU version
  • Splade : Support for sparse embeddings for hybrid
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

Splade Models:


model = EmbeddingModel.from_pretrained_hf(
    WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)

ColPali Models Only runs with embed-anything-gpu

model: ColpaliModel = ColpaliModel.from_pretrained("vidore/colpali-v1.2-merged", None)

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files
    • Markdowns
    • Websites
    • Images
    • Custom model uploads

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Where are we heading

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing ORT support along with our previous framework on hugging-face ,

    ➡️ Significantly faster performance

    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.4.11.tar.gz (934.2 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.4.11-cp312-none-win_amd64.whl (13.7 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.4.11-cp312-cp312-manylinux_2_34_x86_64.whl (17.8 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.11-cp312-cp312-macosx_11_0_arm64.whl (10.1 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.4.11-cp312-cp312-macosx_10_12_x86_64.whl (10.6 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.4.11-cp311-none-win_amd64.whl (13.7 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.4.11-cp311-cp311-manylinux_2_34_x86_64.whl (17.6 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.11-cp311-cp311-macosx_11_0_arm64.whl (10.1 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.4.11-cp311-cp311-macosx_10_12_x86_64.whl (10.4 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.4.11-cp310-none-win_amd64.whl (13.7 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.4.11-cp310-cp310-manylinux_2_34_x86_64.whl (17.6 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.11-cp310-cp310-macosx_11_0_arm64.whl (10.1 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.4.11-cp39-none-win_amd64.whl (13.7 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.4.11-cp39-cp39-manylinux_2_34_x86_64.whl (17.6 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.11-cp39-cp39-macosx_11_0_arm64.whl (10.1 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.4.11-cp38-none-win_amd64.whl (13.7 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.4.11.tar.gz.

    File metadata

    • Download URL: embed_anything-0.4.11.tar.gz
    • Upload date:
    • Size: 934.2 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything-0.4.11.tar.gz
    Algorithm Hash digest
    SHA256 0dfa85150e42517017c5e907a0409c90ce02502dbd459f5833ab4f084f8a8f89
    MD5 e4e6c2fdd4b78d9ab829783c282dc8da
    BLAKE2b-256 12cf223e4a2cb655a555e40da1f12862f024358b45c492cf7e625d6c84605a2e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 287ba703e4379ff33a6bc9baf3757e8c240379230e58cf81a3e2ea29be4788b6
    MD5 23c1b21c2488b9d87156b916fea20f4d
    BLAKE2b-256 7a77e51b80b6ae6f5a133fdf3678d8a6dd799730f761b220f5805b36003dd7a6

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 86b649ba7d1ea243348e9b71f98c7e99f1060315092cf700e5e94c8db2c889b5
    MD5 88a1aff4b76e133d786400fe18520c60
    BLAKE2b-256 6928217b8b324860afaf4017d36da82b203da7cdddb61285899656b6f6a60a44

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 c6db7b101643a9d4176bec54a529930b4f998fb7f939fe72558cc822cc30d1c0
    MD5 4819bb035309fd75b0f76561a189e3b4
    BLAKE2b-256 2b819700c45936bdaee4b462a0f58a946e1ebb00e50a92f803b7c3fe311c50d9

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 99efcf0a5a25c099b3020bc145063a333d3050358b1204be6471feab3e33b3c5
    MD5 48411b37f392145e745d708ecd6819b9
    BLAKE2b-256 56a88da013e1c2b03aef7a67dd4ec449df1cd867631d7e9f7c6806529287ba57

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 01e731902e7858c3f5cedacd47e7f1527b63d33f1d421067cc55cd71ef88f0de
    MD5 58be93070a324c05d8dc21bcac7e603b
    BLAKE2b-256 2d335327158bd40ac18fc7dd4371c16455812ed1d63a15a7a6a9ec9214251853

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 f4833562bfbe07b52ca8e48720ae7146afe141d48667ce61d26e87057f406e5b
    MD5 3c02a95570d536e8085764d2e79daddc
    BLAKE2b-256 6216c24587085ab395a9e6c6bc411f3c9d5ee874227950e6a81cb03f84397b34

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 dfcb301856426fffba109a8e729f197c0eca523e69f737819f5c48ca14c0bf87
    MD5 df5ed4a16b673d14ad4086a37539ce74
    BLAKE2b-256 7fe884df313d78e229f124bdf43d0a3e642eddf3c445d59f8a27119f6b413cf8

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 81a80d0fcfc930bf63aac702984e24b27f2e0f574bfb84256807063e434f3c6a
    MD5 e8e9836055b8f8295f73295bd9436e2e
    BLAKE2b-256 14f9b6035665e56383eb38fc0ca57d5061affa9f5d4fa72783cd2ce822a8b117

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 983125882c01370bb4df85c346469ecce13cfb01adf54696a348776cb3a856d4
    MD5 81d6efc9309c2800f1095a55e39170fb
    BLAKE2b-256 6a36e3ddad45e8bf02358e4bb3452a7a54b385eeee4bcd3de3ea0caefe411e22

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 3e1dee74ef46388042de232770c028bcd5a19c00e45a5ff142727101626312df
    MD5 edbeaf8df326ea63798e698300c6784e
    BLAKE2b-256 a1c6c362c8d89008343ee3f088f4e6c0ce92de3ed6517be3b2bb6e81e3a59ff4

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 5503feb42b93d24355d05fe3be60dd76a0dc9ac177b244fe63512c473ab1fa5f
    MD5 a644210ac6769b1c7be33e1562803421
    BLAKE2b-256 4e885ad2aaac68b3c5e2ed6de615154e593c7bb123f3dbd1f791ac99c971218f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 4696cd5d3ed630761fc91a7b848866128eeac84f4d6cbf76444715c7a9fff1a0
    MD5 10878ebcd6388f39a625a233d53b8879
    BLAKE2b-256 8c30c6416cb92e358eb38ef8594a4a38807f45c77ac0749572899ba3190efd38

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 26ea2f6874dbf1f2756bde9c11c9ae80bce12ab2efe95d60072198305c178cf8
    MD5 b4a809497f003c662edc9efffc935bd0
    BLAKE2b-256 b1c593049b3eefb9792c034655f62c88a00c3914deec5ebdc506310cd75eb0a4

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 4c12751c5467be4e4ea45d943b199521efdb3345457a32562132b2af7f9c39cf
    MD5 3664567563eb472f67fecc5d378d26dd
    BLAKE2b-256 e37f745323e2c01ecfa3cc1db46d888537a8925696aff93281b47362d762b43f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.11-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.11-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 3f23497523fedafa4a34f362c6fececba81900bc8f477fec3471beedddb90280
    MD5 159d4c4dd5d56277f62ece587bfedafc
    BLAKE2b-256 bf28bcc2d8c77a0aadd34ff34dcb088e50f04ccfac8cbfb7847ea85ffe7eea31

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page