Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab gpu package roadmap roadmap

Generate and stream your embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

For GPUs and using special models like ColPali

pip install embed-anything-gpu

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files
    • Markdowns
    • Websites
    • Images
    • Custom model uploads

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Where are we heading

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing ORT support along with our previous framework on hugging-face ,

    ➡️ Significantly faster performance

    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone
    • Qdrant

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.4.4.tar.gz (932.9 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.4.4-cp312-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.4.4-cp312-cp312-manylinux_2_34_x86_64.whl (15.5 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.4-cp312-cp312-macosx_11_0_arm64.whl (9.5 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.4.4-cp312-cp312-macosx_10_12_x86_64.whl (9.8 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.4.4-cp311-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.4.4-cp311-cp311-manylinux_2_34_x86_64.whl (16.9 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.4-cp311-cp311-macosx_11_0_arm64.whl (9.5 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.4.4-cp311-cp311-macosx_10_12_x86_64.whl (8.5 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.4.4-cp310-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.4.4-cp310-cp310-manylinux_2_34_x86_64.whl (16.9 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.4-cp310-cp310-macosx_11_0_arm64.whl (9.5 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.4.4-cp39-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.4.4-cp39-cp39-manylinux_2_34_x86_64.whl (16.9 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.4-cp39-cp39-macosx_11_0_arm64.whl (9.5 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.4.4-cp38-none-win_amd64.whl (13.3 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.4.4.tar.gz.

    File metadata

    • Download URL: embed_anything-0.4.4.tar.gz
    • Upload date:
    • Size: 932.9 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything-0.4.4.tar.gz
    Algorithm Hash digest
    SHA256 c70e58f4cc584e2b195ce0ce2f1166f7694ec92fd6a276029594d9bd5c5dc628
    MD5 3ea51d1c2b7fd54cfb8742ff963d9e1a
    BLAKE2b-256 8cd5c726b4c4a2b331bc06b7eb976ef0ac71487905475e43403e35ad9b4e7f7f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 2a572e7d5ddd8e2f6861df371295c068dc8430d509e00f73ea7b8be529669245
    MD5 f5c9676a4cb47bb4a96064c124b83e79
    BLAKE2b-256 1b0f3444684f243608a5faafbd044361f84ad148b8930170212544cb88065415

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 47417540765af13ecb51385cf2263b8c287e9444331e07e59bf04965531444f5
    MD5 94a321b990ab4442a87f35b3a4b3ba04
    BLAKE2b-256 85d1784ef92828e2e863737d25b6d9bfe3b1a42755c6a7b86fe6e20ff27ff079

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 3e98411d9574387dde48f11d1a491588bb4405d8354093dbb8f9f9a3db8b3948
    MD5 0e2532d85d5ef0508ee97b171b40898b
    BLAKE2b-256 3114d642783eaad489bab76984e0134d81633914bcf47c63a0913cad99c06582

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 e397772b0e2ff43d96760e4595cf1b7350b54b80f1eda81b9f20244575f71550
    MD5 ba6a18ac6b1d2ca4eb556681df42405f
    BLAKE2b-256 fbc5b22ec24927b0cc525f66394940dc6cbbef53196b8ea4b39966dad3977616

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 59ef4dfa634ec2fba88d4a6aa186c9c6c218a1c3ef274f057d16beb1d4e1b735
    MD5 c6d101686da75b3b34a0e4245f0131fc
    BLAKE2b-256 d10c5ee04faf4d6d6ebc00ee5e6f1446ab46c733bdb48a7feb97ee310c825b7c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 69164233a6ff436638cd85fed8ef3992e6aefc10c6c470d0323c77d3793fae43
    MD5 408781e85a4b85edab5a9a17727a4309
    BLAKE2b-256 d95295a9d50a5f56ccbafec8e88eee0b107491a3ed44e2e88d008ea5ae66c756

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 b305b6cc3c69eca905e6aaa2577d4bfb496375b2a6535664cb769770c7d9bb67
    MD5 e5bc966bcc7fd0474f8784382e11b5cc
    BLAKE2b-256 2201653b0fad6ada58d382e8563f841f1de5c1783e5ee146b4153d078f408637

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 cd82f8902ccca307eb3b8442c68b4febafc5ab4cbc63ed1be3deec6d45776010
    MD5 9909676f391c46b6a32b0f5cceae38b1
    BLAKE2b-256 cd1eebe9c6761aab32173eb3fd8b7465f4b41de2628ff13c569fa6fa62601c65

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 2b21f46884b7276ba960c59032f9252a540b8f988c785a9e3f749cae32761c36
    MD5 32d4045c7c35aaecd3df055111694d8f
    BLAKE2b-256 61031e21b1f50da26f6bac8aeb7b2c600b18bfc3aaaaa7653db6ebfa1c0a2d51

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 27e6101dd328ee0244f4b4b9c7fcf451836d0f5c4c5af6efb7340570cbe2389d
    MD5 fef507eb2750e2fac0a5257a358cf120
    BLAKE2b-256 c8bb4d3030c28a315743e3578554528b6311f397e2b65ee87127c395bacd1512

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 f4cd5ddfc558af47775c8446ac79efd2b247f88ca0653f6d65307368cd025a3f
    MD5 de1de47aecad3e0cc15ede7ede0c7c46
    BLAKE2b-256 a52208d729ccac6cd3eccdea526ff160e787b2706daf304d547a7ae719e10867

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 07ba513bea3eb333809b72cc593f219cc0076ebdceda571f9e4b0f82cc76dba4
    MD5 466f1b4cd6e4ca067121fb9c49aecec1
    BLAKE2b-256 b75e1395b02a873073112efbdb4f9ac532dc053feb8c36fba0caaee3b3579012

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 bb6649aa2229691d585d02e96d5c6f3fa04676627d92231ae0943d62a7e55571
    MD5 a7d5c2a35e325cae57b06091206489b7
    BLAKE2b-256 d5f5d0a259bf7571f24e72c889dfaf5455bdeaa42efa8917c4b7ad375d84e2c5

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 6a11e8df5486af0066ad7abe59dd1915d43f652af85be60e2b5cd5b9299c3a98
    MD5 f4461e104d6223a345a1d22d6f2320b7
    BLAKE2b-256 9c75286e62f6d442a3b603dad57a5512a4f139963569b6afe0b5f9e7ab798e45

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.4-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.4-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 015630b7214d836960118159d31ff131942014c09c1ce9b4e06466cbfdf44791
    MD5 1fe2b5836e4c92979d82077d78698f0b
    BLAKE2b-256 a281103c661bf702e2753e5e7bf96d6e25d00dd4c34d1e94378c837bfa8471ee

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page