Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab gpu package roadmap roadmap

Generate and stream your embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files
    • Markdowns
    • Websites
    • Images
    • Custom model uploads

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Where are we heading

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing ORT support along with our previous framework on hugging-face ,

    ➡️ Significantly faster performance

    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone
    • Qdrant

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.4.10.tar.gz (932.8 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.4.10-cp312-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.4.10-cp312-cp312-manylinux_2_34_x86_64.whl (15.5 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.10-cp312-cp312-macosx_11_0_arm64.whl (9.5 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.4.10-cp312-cp312-macosx_10_12_x86_64.whl (9.8 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.4.10-cp311-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.4.10-cp311-cp311-manylinux_2_34_x86_64.whl (16.9 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.10-cp311-cp311-macosx_11_0_arm64.whl (9.5 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.4.10-cp311-cp311-macosx_10_12_x86_64.whl (8.5 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.4.10-cp310-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.4.10-cp310-cp310-manylinux_2_34_x86_64.whl (16.9 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.10-cp310-cp310-macosx_11_0_arm64.whl (9.5 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.4.10-cp39-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.4.10-cp39-cp39-manylinux_2_34_x86_64.whl (16.9 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.10-cp39-cp39-macosx_11_0_arm64.whl (9.5 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.4.10-cp38-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.4.10.tar.gz.

    File metadata

    • Download URL: embed_anything-0.4.10.tar.gz
    • Upload date:
    • Size: 932.8 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything-0.4.10.tar.gz
    Algorithm Hash digest
    SHA256 af0d8bca12783daf0f029eba5d5f0f8fb021b4deb8f43b616d7a0d33d41bd3db
    MD5 a21748d614663108e2f5ba0add365ee2
    BLAKE2b-256 20546ba9846bc20c8137cdf691c9d777c2161fc02f8a747186e903a6e5681d6c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 ffe70b44c88d05737c4a8d68758ef9e1d59409fc3ab5a77e7331413d484c00ce
    MD5 f5afa71efcf9c434c17af183583f5c9e
    BLAKE2b-256 81c1727346d005b1e74b8fef8827a8518cee722c0c6d704913edd5dbd294499c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 589635a32f8df42d4d344fa104c6f4a2d631388ed3606a99ab5ee924a7d1e3a0
    MD5 c4746a23ef9a703e026a560afd8b0765
    BLAKE2b-256 339cd835be51a92ed739ea048357d30a47c7203e86ad8f172196ca22f0659021

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 39af5e43a9c808c23350e22f180f6e36f8c66dc6dd55d23d2e25f802c274a4b6
    MD5 17e0c941b24a48abb899c4a08de17d3b
    BLAKE2b-256 040cf42d3168cd373d0fb680ba671da25274a52ecdee31a5f192c886b8c038db

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 089ec7dc80ec46780c4dc98150902acad6415529f287fcb8edc7000a90e26f0d
    MD5 22561e81efb19eea7555af73d38e1bed
    BLAKE2b-256 36d578eed2e6fa484b4b8de4d1e86ea934582bf442e4bbac8677864b8d0081af

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 9c329d8427c8850a7bcf89fab774666ec07baeb9b82c4c244d8a4c2906d7e3a8
    MD5 87f294479eb2e2e15ab8337ddc4c9571
    BLAKE2b-256 b2b9d888eda3d68c25852d0b3616eab3a680813f69eda09d466ad07febb1dbe9

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 deaa28b4fb7b44802757a3b4d2e06a03889099a45e62d9255fd7ece6b789a819
    MD5 38bfa1934ef4a104e17bffb7dc05500f
    BLAKE2b-256 c10ae4a422d8b3612e61d44cdc7e67d978da9c70d79ea014bdf0d7a0a31ae870

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 4f0ecb8f03a98d24e66842dc5745ab9ebb19bf37370576a5da5c7a0a33862d17
    MD5 7bafa47696f38500b471c4c7c97b8872
    BLAKE2b-256 e05fea4d2b47384c19d8af366cf16c9733ac22e8f77b6ba7e550a7a9562ee47e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 fb1b4d552451c5aa94d69bdb7e55f992209b0c9d5030ad4cb8766ca62d06544e
    MD5 7b8e71e49091abb898af0a727a539466
    BLAKE2b-256 0c79d3ec7daa20f113aea50299ee90799b2117e95112c489047fbbbc56fee1c8

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 db78e05e19a969c77b3bdc8b004cc828dd5a8297e11a6c3fe2678fc1b7a846a2
    MD5 b0d1a52a21c124f1da6f4e0902e0cfe7
    BLAKE2b-256 b266aff1771e3b7dbbfb1d6045438afd48bb09fb8324d0ba3ee291ae386f5a6e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 a4bd4b09a8fdef05af58bf7866d0718ec6923eeb8b59ef27a075e41488a29385
    MD5 e2c42a5e12ca4912664e863d70e293ee
    BLAKE2b-256 73c612434ca34756853774afd6e8a2986838ad446687a604610d0d841a539095

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 d47394a931733a9d5b20c5c7c9b95e106b5be8583d8b47a4779369322f7e962e
    MD5 7be0e36e44952ed801d1d9b8b173e311
    BLAKE2b-256 a009a5e0c34e664f73b8f68cae1d71e2cfe072203a60f482d7d8a74a4f09fda6

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 6db78d3c8922986303540b850b3cb1c00642321f7c798cf8e1fa538a5f9a2e9b
    MD5 8beaf2f57f85506b8aabf2e7d9b6a81f
    BLAKE2b-256 64aa59c7fbc2db00d86b9c86190dce77a82ee428fd3b187f6735cd3c2d0b5513

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 ee102b1ff9610bce5e9310a71e4016e5b3d7467067ff783fdf77d4e96ce15116
    MD5 f8be34e9bd4cd074745edcf43c42f4db
    BLAKE2b-256 52669c4f719a8cdd6c90db1239756f6f6b34bbe0c79291492e80665c2ddf7e4a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 1df27b0e005ab0571ba874a5b0c295127ccc864d3cbdf0b0a9f220976aa1a4df
    MD5 c811bb961b490e21ee270e425e740124
    BLAKE2b-256 80c84cbc005fa4048a7031293b2f5bef56fadd8a6518a51738a2817c71670e91

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.10-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.10-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 4f2ae37940c775c1c1f7d662b98529b18e45a45ffdceab576d8cd8286145d602
    MD5 14713570dd15685fabbf32fd5183517c
    BLAKE2b-256 97bf1af2e6869a870e6c2c4356be60cfad047806b3d163f23cee829f582b8e48

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page