Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab gpu package roadmap roadmap

Generate and stream embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.We support dense, sparse and late-interaction embeddings.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • ColPali : Support for ColPali in GPU version
  • Splade : Support for sparse embeddings for hybrid
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

Splade Models:


model = EmbeddingModel.from_pretrained_hf(
    WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)

ColPali Models Only runs with embed-anything-gpu

model: ColpaliModel = ColpaliModel.from_pretrained("vidore/colpali-v1.2-merged", None)

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files
    • Markdowns
    • Websites
    • Images
    • Custom model uploads

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Where are we heading

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing ORT support along with our previous framework on hugging-face ,

    ➡️ Significantly faster performance

    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything_gpu-0.4.11.tar.gz (934.2 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything_gpu-0.4.11-cp312-cp312-manylinux_2_31_x86_64.whl (15.3 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.11-cp311-cp311-manylinux_2_31_x86_64.whl (15.3 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.11-cp310-cp310-manylinux_2_31_x86_64.whl (15.3 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.11-cp39-cp39-manylinux_2_31_x86_64.whl (15.3 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.11-cp38-cp38-manylinux_2_31_x86_64.whl (15.3 MB view details)

    Uploaded CPython 3.8 manylinux: glibc 2.31+ x86-64

    File details

    Details for the file embed_anything_gpu-0.4.11.tar.gz.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.11.tar.gz
    Algorithm Hash digest
    SHA256 e0bb320cb41a9b844a8d15a421989982d03f73d393c2090f76364b509130d43f
    MD5 7a501126dabe83c727768700ebce889a
    BLAKE2b-256 ff638a431a5bdb287312db293b63034990fb03c40c2ebae32d7e8b2e3812ee09

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.11-cp312-cp312-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.11-cp312-cp312-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 3c5e45eb189eafdca0b1c89156b895a7dd80ff5228fcbf78bebbc46cf9f4a0e6
    MD5 3d2b376c698d8510d72878054bc0e2d3
    BLAKE2b-256 0bc084e211481345f2ce03fd50077e82d36448475b1c7050e52421b17df89009

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.11-cp311-cp311-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.11-cp311-cp311-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 90f9083cca448f292586252f1cdabc09dc7dc4646c3d11aab48464ddfad33c97
    MD5 07c278f56dd82b74103314db05d2c505
    BLAKE2b-256 b7267229a5d414e6028f5be9095bed4df0d9df03053ece3a6630b6ba5b7e8ff9

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.11-cp310-cp310-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.11-cp310-cp310-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 fc6ec9878c8f57393a24afd77cb437fff54e9ba0fdb3dc1fba8962d8c3b9b229
    MD5 d13dd8ebc9609ac25ea81688bafd70b0
    BLAKE2b-256 bd05fcdc82e99a959c3aafa6750be27e9e709a118389632432a2dd90b3235eb3

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.11-cp39-cp39-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.11-cp39-cp39-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 c3e97c8c4e7a0558580fc71a939ae7f6ab285f52e3b54d9ea263fee5f2ad7f04
    MD5 218e58f37781339e8c5c842a1814865c
    BLAKE2b-256 697dcf4cb64c051bfb4c7e022a238b4327190f53c6aea08eaa36da72951a5619

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.11-cp38-cp38-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.11-cp38-cp38-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 f60b825a257ae5bc5dedb7c10366f25c15a6231851c52b795f3c7abb18874d14
    MD5 0f07d4a066091dd86f8ce934e7fe44f0
    BLAKE2b-256 395df7f8d676ffe8030be454d4af505718bd8518b027950f21add63a15fee755

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page