Skip to main content

Embed anything at lightning speed

Project description

Downloads gpu Open in Colab package roadmap roadmap

Inference, ingestion, and indexing – supercharged by Rust 🦀
Explore the docs »

View Demo · Benches · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist, highly performant, lightning-fast, lightweight, multisource, multimodal, and local embedding pipeline built in Rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything streamlines the process of generating embeddings from various sources and seamlessly streaming (memory-efficient-indexing) them to a vector database. It supports dense, sparse, ONNX and late-interaction embeddings, offering flexibility for a wide range of use cases.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • ONNX Models: Works with ONNX models for BERT and ColPali
  • ColPali : Support for ColPali in GPU version
  • Splade : Support for sparse embeddings for hybrid
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once. The embedding process happens separetly from the main process, so as to maintain high performance enabled by rust MPSC.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support any hugging-face models on Candle. And We also support ONNX runtime for BERT and ColPali.

How to add custom model on candle: from_pretrained_hf

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embedder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

Splade Models:

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)

ONNX-Runtime: from_pretrained_onnx

BERT

model = EmbeddingModel.from_pretrained_onnx(
  WhichModel.Bert, model_id="onnx_model_link"
)

ColPali

model: ColpaliModel = ColpaliModel.from_pretrained_onnx("starlight-ai/colpali-v1.2-merged-onnx", None)

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embedder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embedder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embedder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embedder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embedder=embedder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    Accomplishments

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files

    • Markdowns

    • Websites

    • Images

    • Videos

    • Graph

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Coming Soon

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing Candle-ONNX along with our previous framework on hugging-face ,

    ➡️ Support for GGUF models

    • Significantly faster performance
    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.4.18.tar.gz (953.9 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.4.18-cp313-cp313-macosx_11_0_arm64.whl (11.5 MB view details)

    Uploaded CPython 3.13 macOS 11.0+ ARM64

    embed_anything-0.4.18-cp312-cp312-win_amd64.whl (15.2 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.4.18-cp312-cp312-manylinux_2_34_x86_64.whl (18.5 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.18-cp312-cp312-macosx_11_0_arm64.whl (11.5 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.4.18-cp311-cp311-win_amd64.whl (15.2 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.4.18-cp311-cp311-manylinux_2_34_x86_64.whl (18.5 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.18-cp311-cp311-macosx_11_0_arm64.whl (11.5 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.4.18-cp310-cp310-win_amd64.whl (15.2 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.4.18-cp310-cp310-manylinux_2_34_x86_64.whl (18.5 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.18-cp39-cp39-win_amd64.whl (15.2 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.4.18-cp39-cp39-manylinux_2_34_x86_64.whl (18.5 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.18-cp38-cp38-win_amd64.whl (15.2 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.4.18.tar.gz.

    File metadata

    • Download URL: embed_anything-0.4.18.tar.gz
    • Upload date:
    • Size: 953.9 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.8.0

    File hashes

    Hashes for embed_anything-0.4.18.tar.gz
    Algorithm Hash digest
    SHA256 96e0aafdf9d9cff27b364894cc21326d30b4244aa0f8199d2c4e20a22708c452
    MD5 908ba2f2fe1a2169ae74cc01f5cde9da
    BLAKE2b-256 0ad5ed20976bbc9d5072e9c73e06fb16083f85e0ec715a48f1d864153ee6b646

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp313-cp313-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp313-cp313-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 bb53a2459e18aa6f9b44899b040b3446ec7011965355c5551505bfb91eeb7fc3
    MD5 344a06b7b62e815f3558038fbf3bb888
    BLAKE2b-256 6a7177ca0b1765b792e5f82fb5cdbfe449c2fa5c1949f2468423d23efbc2ab48

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp312-cp312-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp312-cp312-win_amd64.whl
    Algorithm Hash digest
    SHA256 b56b50bec44966c565e6ac4bee539ce79f5bdc5e712db199e7fa90fb7c5e46d3
    MD5 4a6411ec3cc71ddb3e07e2c9bba52af2
    BLAKE2b-256 18a6289391703aaa89b7a5b7d0bc21cc6c44650d35a5b13d5cb1156ba55cb597

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 7469e77323074b95579991930d6c6af122f65a4bdd98bda04c9f527e14c81073
    MD5 95e0a053f0d9894495e86b4d2c33aed4
    BLAKE2b-256 7c36c2ecc02fb161c1740de590915bb0600e7a464623c2913d53a368327932d8

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 95259bc297b53c4a99141902e0633471be8b96719118fc94e59ec96c1be8da98
    MD5 ccbc23611997c2fe61e576128fe96ad5
    BLAKE2b-256 ed54df86086d6b127ff13385de2c2bbecfc06ed2981da974e8acc5ce6bb52548

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp311-cp311-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp311-cp311-win_amd64.whl
    Algorithm Hash digest
    SHA256 f60dac9b157c80a2d040f999c1f4d1e331c4b4c3d6b6ddeb731e249f7885eab7
    MD5 206b926b4250cd7487e01bcbe469eb1a
    BLAKE2b-256 5e30c5af3f0a73b26154dea4e92a3f62bf63d37cf20ffffc0061b74af079ab97

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 304c24833554477625e1327181e12f776d3053ad74e8f6061a77ddd08364cbd7
    MD5 a50d4400151c20fd06d79550ec54f2e2
    BLAKE2b-256 bbd0815cace1020c446fdcad10ac8107d46ba19fd58a74c0e765863dbca23842

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 71fc0d8cf32d7ebf8556a3a064c882f1dd66e78c137d7b973af5ca1d6f5add3d
    MD5 645a386e565b03f237f8a1fd77f86b15
    BLAKE2b-256 869b6ebd4af1fcf519c163442327b59561eb656064b3e8f4470d1648d1c17cf7

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp310-cp310-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp310-cp310-win_amd64.whl
    Algorithm Hash digest
    SHA256 67a6f9ac6ea718f957cf0677aa19faf8c4389207573e891600fb420324ed8d04
    MD5 d19b509277d4262e5e7b56db01845944
    BLAKE2b-256 336707cfcfd04793f505ac9b2c0bf625a45c8515192e1c78c687d80ebee76e5e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 a89d2d95edf71c859ef6cfd0085901718035baf07a8b653cd9f39616309e5d71
    MD5 c956003a07df2d96d8ca4c2555140120
    BLAKE2b-256 767a7a426b1e1284c0d3c683e07ffd9aaa11f47b5499d601df4b7b19e538562e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp39-cp39-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp39-cp39-win_amd64.whl
    Algorithm Hash digest
    SHA256 887b63c3b3047a8c7ca2ae302dc645ea9080bbc91ce222a585ee8352ef5d8fac
    MD5 4a0517dc5ff17347522b94e1237ff678
    BLAKE2b-256 7bd25f46f98850162f8f5f1431cdb8b8efb069d4f433f5b8aa1d788183f5fb2c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 904de3740835f2fc4fabfd803b144429c376fcf9bc31d0c673d833e5cb05277d
    MD5 f792cd89eae8271cf733e92c016cba2a
    BLAKE2b-256 5d97a84303bfcda053caacc760870a2537b09a1d641a5da07f37fc13b1a39b5a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.18-cp38-cp38-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.18-cp38-cp38-win_amd64.whl
    Algorithm Hash digest
    SHA256 ab3f909883b0fb499a2e082bab5ccd24519c2adac3437be0fc09664ab2007944
    MD5 30bb7063f087394129e95bf802553bac
    BLAKE2b-256 680e577d1e7b6a04e22181b18df5b6f247bcfa043a99f5ebfa5d00775faf047f

    See more details on using hashes here.

    Supported by

    AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page