Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab gpu package roadmap roadmap

Generate and stream embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.We support dense, sparse and late-interaction embeddings.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • ColPali : Support for ColPali in GPU version
  • Splade : Support for sparse embeddings for hybrid
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

Splade Models:


model = EmbeddingModel.from_pretrained_hf(
    WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)

ColPali Models Only runs with embed-anything-gpu

model: ColpaliModel = ColpaliModel.from_pretrained("vidore/colpali-v1.2-merged", None)

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    Accomplishments

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files

    • Markdowns

    • Websites

    • Images

    • Videos

    • Graph

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Coming Soon

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing Candle-ONNX along with our previous framework on hugging-face ,

    ➡️ Support for GGUF models

    • Significantly faster performance
    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone
    • Qdrant

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.4.14.tar.gz (946.8 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.4.14-cp312-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.4.14-cp312-cp312-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.14-cp312-cp312-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.4.14-cp312-cp312-macosx_10_12_x86_64.whl (10.4 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.4.14-cp311-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.4.14-cp311-cp311-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.14-cp311-cp311-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.4.14-cp311-cp311-macosx_10_12_x86_64.whl (10.4 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.4.14-cp310-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.4.14-cp310-cp310-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.14-cp310-cp310-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.4.14-cp39-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.4.14-cp39-cp39-manylinux_2_34_x86_64.whl (18.3 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.14-cp39-cp39-macosx_11_0_arm64.whl (10.7 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.4.14-cp38-none-win_amd64.whl (14.4 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.4.14.tar.gz.

    File metadata

    • Download URL: embed_anything-0.4.14.tar.gz
    • Upload date:
    • Size: 946.8 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything-0.4.14.tar.gz
    Algorithm Hash digest
    SHA256 c0c78e9b631765a585d0e60e7e4997377d3809d540e1c5f3e5e0e24f4afdd75b
    MD5 ef336fc8874132182f26bc49d1244c12
    BLAKE2b-256 daa0253607016453797e199989d5ad0eda66fac180acd5369bd4cc16ad570f2c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 c7b158260bb0a9b91539984ed1b53cf55483cd102b6c59e4e1f2011490334a0e
    MD5 8acd3469bcf8e3a44f07b0ac0320c106
    BLAKE2b-256 97132b63499022c9d01436a942c5e491b3eb0c622ace5e60c3031e0f854366fa

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 4cbd126f5978d2d7c88a97b55e0b537b9924a05a128cbb232b4c1a20ae3f5532
    MD5 ecfad637903878c45122a17d6e87ca09
    BLAKE2b-256 53e58ee2a20c558b3aab33015351b5aa294254e4fb9d56e0f9bbf2fcd04d52ae

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 8bbd55fbecb1c81faa90e745e316867f1832b0df2717181943c01e7dea4812e1
    MD5 dc208dd02ad7d8d74283796ad9592e66
    BLAKE2b-256 af8f4522abae2b39ad925e4dbf23a96b2c72f94b241e7799d75cec8520a1f64c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 239e3bb9a682a1c1289ddf1bf8119d3af2bbd6898dda89f9bf9fdd7cf9637a1f
    MD5 6c4de3c13b0b9bbf72acee974f7f2afd
    BLAKE2b-256 d5cd647d22821dcb6a283be35fa97d28a5848187729981c3ae9af587e56a5ef8

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 dfd39636c9969903a4a2610489b92207c1ae99649c888be717b2121dfd7a4a4e
    MD5 df784176fbbc0a122408cf533e8c0b75
    BLAKE2b-256 a7293758fa6e9e1a2f2de044a6221d2e38196c9ef5a5441e63f4cc7286c1a118

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 c9e611018da5486c0494fe3e2decd678e570147a25282f4d485a8fa3134ea322
    MD5 d77984de2aaa1180e0a7a790fa00edb4
    BLAKE2b-256 17c1803a2a1ad79bbaf120ddfbe801517533b9c4fe99d8157af16408fb9a520c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 6b4ac8c5d810103098b4ce868790401a8b7de4b2128772d0e922a09565e74dea
    MD5 33f9aed5be95b76bf92069db31002d94
    BLAKE2b-256 fd91f1ddf2051e8276590b7107c2ae646a67865321f9e7a45accf00ba2ec590e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 5381682a99148d9f92af6af4cd5e3b7951de516ecdc77ede5fb8d830098bcfb7
    MD5 ffa6c393b9b2d5de910e6baefadc9e40
    BLAKE2b-256 8131819886bc2f2f899994d00e73005d6d2751cd1d4cd3119971ff2fdf9fb3fc

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 c5dca3b131d32838f18aee10c484471ed4c899f7020295bd1b0a4b066feaf2ef
    MD5 c14002e54771169d9f73bc1dd74aeac8
    BLAKE2b-256 75e54c49e4bb40a06c366d4b7e6fe01b08d991822ed3e3561e254a0244047bb9

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 920d964fac214af451e77b51c8c91e443b953e15094cabfaac79479de33a965e
    MD5 c6a63f7f6314ac1d4786fbc7655cdc3d
    BLAKE2b-256 7f31bd0ca211147a86faf60076946cf0a122d40cd2cb3d05aa81de5206ee1cf7

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 85bd1bf8341c7c7db2015c797909a1429fe4b505b28f3c82c9dadbe99ae698cb
    MD5 96b0bd0c3c0bc7d7c4a4ab4e8325a3a3
    BLAKE2b-256 53b0a034a8f288373403daee2c4d26bd39336b2aa7a720d581a43ec97e6a00b1

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 fe9513c914c9d7da3ac2192973e0129932a8bcc14f3f8462e5a4bbbdc8849ca0
    MD5 8d6ac1ee8e958d130c83a4397aed3eab
    BLAKE2b-256 a9b590b52bf0470d7aa29e2b7fab26a1e3cf7780c7911bc821bbdf0d09d01011

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 05be4089a6d84156e5344c0ced374613e9ee679fb030be7b0f3009bd6cd0bab2
    MD5 ec822ad591ddddc6bba25aac4fb8ae32
    BLAKE2b-256 599993967babe7dac09bf5873eeb47ab2f6d7d20284f50cc01d65e6d62f15450

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 34c068e4c8839eb5d716c58ba6791a928309d8793fb478821257681c5419d567
    MD5 d6dca158b15c7c90e9ae64707208008d
    BLAKE2b-256 74bcd068aa9c7b1c8d1f47494c996984f79294a9b635b5667635fdc35b8b3294

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.14-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.14-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 11929c9310d28faa904b7cddaffab3ba2695850b1141e4fa059178508474b366
    MD5 d2e81d141d637fc3ad816ec7558f50bd
    BLAKE2b-256 b9fc7c8c79e5ca7f9de8309fc11e6bb57996f5085399507c3a6147c6be050890

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page