Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab license package discord

Generate and stream your embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI. Mistral and Cohere Support coming soon.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings file by file (Or chunk by chunk in future) and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

➡️ Usage For 0.2

To use local embedding: we support Bert and Jina

import embed_anything
data = embed_anything.embed_file("file_path.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
data = embed_anything.embed_directory("directory_path", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])

query = ["photo of a dog"]
query_embedding = np.array(embed_anything.embed_query(query, embeder= "Clip")[0].embedding)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import JinaConfig, EmbedConfig, AudioDecoderConfig
import time

start_time = time.time()

# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder_config = AudioDecoderConfig(
    decoder_model_id="openai/whisper-tiny.en",
    decoder_revision="main",
    model_type="tiny-en",
    quantized=False,
)
jina_config = JinaConfig(
    model_id="jinaai/jina-embeddings-v2-small-en", revision="main", chunk_size=100
)

config = EmbedConfig(jina=jina_config, audio_decoder=audio_decoder_config)
data = embed_anything.embed_file(
    "test_files/audio/samples_hp0.wav", embeder="Audio", config=config
)
print(data[0].metadata)
end_time = time.time()
print("Time taken: ", end_time - start_time)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.
    ✅ Markdown, PDFs, and Website
    ✅ WAV File
    ✅ JPG, PNG, webp
    ✅Add whisper for audio embeddings
    ✅Custom model upload, anything that is available in candle
    ✅Custom chunk size
    ✅Pinecone Adapter, to directly save it on it.
    ✅Zero-shot application
    ✅Vector database integration via streaming adapters
    ✅Refactoring for intuitive functions

    Yet to do be done
    ☑️Introducing chunkwise streaming instead of file
    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding ☑️ Yolo Clip ☑️ Add more Vector Database Adapters

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.3.1.tar.gz (911.2 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.3.1-cp312-none-win_amd64.whl (11.4 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.3.1-cp312-cp312-manylinux_2_34_x86_64.whl (14.9 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.3.1-cp312-cp312-macosx_11_0_arm64.whl (7.8 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.3.1-cp312-cp312-macosx_10_12_x86_64.whl (8.0 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.3.1-cp311-none-win_amd64.whl (11.4 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.3.1-cp311-cp311-manylinux_2_34_x86_64.whl (14.9 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.3.1-cp311-cp311-macosx_11_0_arm64.whl (7.8 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.3.1-cp311-cp311-macosx_10_12_x86_64.whl (8.0 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.3.1-cp310-none-win_amd64.whl (11.4 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.3.1-cp310-cp310-manylinux_2_34_x86_64.whl (14.9 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.3.1-cp310-cp310-macosx_11_0_arm64.whl (7.8 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.3.1-cp39-none-win_amd64.whl (11.4 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.3.1-cp39-cp39-manylinux_2_34_x86_64.whl (14.9 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.3.1-cp39-cp39-macosx_11_0_arm64.whl (7.8 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.3.1-cp38-none-win_amd64.whl (11.4 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.3.1.tar.gz.

    File metadata

    • Download URL: embed_anything-0.3.1.tar.gz
    • Upload date:
    • Size: 911.2 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.1

    File hashes

    Hashes for embed_anything-0.3.1.tar.gz
    Algorithm Hash digest
    SHA256 53e63daf72698c291b203292bfa4d1a74e4e969570e48c0addf7190ef7740ba6
    MD5 1665ac1c69dc19135df4721b5536c9fb
    BLAKE2b-256 edd731bf9a14742caa10f34214779aca5e0c02a16e2acb162d5deea733b0d38c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 20ebf933863826484f119b9e248ec26e011b2c8d0f038a3c0c5addacaf9b1c16
    MD5 0399708f1a120fb2b40f8a20d673c566
    BLAKE2b-256 b325e015d8ef49bd3655bbb643fdb00a75b9f4d98358b97157e60671fff0bee8

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 855b1e804b38ba1efb5162f260d7f5deeeb3ec237867a6fd711e98cfbe7e2487
    MD5 63bdf923fe8560eef07f871bdf4fb407
    BLAKE2b-256 9b4678fdb5bf50fe9900771437775e581f5aebdebe3bf9f9df256403029631d9

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 f183789f477599ebd39219443e0d5a50ea3b414f9d68501e54bdaf6468adcca2
    MD5 5a1c2612c8d1618d506dfa7f3c31aa13
    BLAKE2b-256 30de373c945b8b8b62a726768e3b35afa7b7ce8ee17c5d70b931eaf5ec15dd77

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 19068e66fe7844dc8ff70b6062f1d8c250803b12b0f55ad03dd926a5e6c4c2c6
    MD5 b025a607cf3168648356aaaa82d993f2
    BLAKE2b-256 e62a8a881923f552459135056f525f30c66768c7110895aa30776e4182add2f9

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 a6aa192139783fcaa397f4b2c0c591b23b2de2a95e07df6d36e055fab16305e0
    MD5 f7ad73821cc26aec153cc83e21a18227
    BLAKE2b-256 ca8f02763c54139c4c627b393fc82fe1d52b2dbe914b2f70afd15a045a17bc13

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 56d6dc64873147c83405495911f7a1d843e10b6765256618afcf96c00dfc2290
    MD5 b2507a6d19703ddf2c35df48d695922f
    BLAKE2b-256 cfda56b1cdf5c0add8de577705ac0203fa948d99cbe51cc4d8ebf0c9c6fe6721

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 27e0aed99871706dcaad98ccf901c80c7496d9c41eed33ce2d24e9a2ac3ca476
    MD5 9ef2941cabfa4bd6362a3f26b321c067
    BLAKE2b-256 2de96cfa23ad11bb2104f4534437a7780306dd11940bf9346d107a4fb59d123f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 61547f2e65c309f9a0d21485cfed04dc6bd7c2b17038fc2155a599997f199186
    MD5 2774cc4efd212629ae2d2b49e591c0f2
    BLAKE2b-256 1e7045477a06ef089b939351cf0188c0f599ed526df2524dec53d4ae7f543d43

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 70bd812ff37aed30aba948aef369dad4ff8078b1421f17fa84e66e1532cd20b9
    MD5 9e914b8c3bf2193ea9ce07e6c9a0eb91
    BLAKE2b-256 a9fe9ecc88d3070763f903653d279f4cdc7575eee0ac7a1b9ef09a3a83c8a3e2

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 a79e04658c16141b1d4175bcbd0c20d9b5598080d10563102c2916ec0ea84719
    MD5 c041727d285ad91b1fc6112b027e106c
    BLAKE2b-256 f858e5e8a6be0b44aef0b9145a2f16f287b50deb64edbef18ffa6f6f94bdb0d2

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 3f296addf72c1f949e154410064e8afa030fd60e84be3397739d1230f6940a93
    MD5 1b351c484514afc89b8514dbfbd5cac9
    BLAKE2b-256 e6abc9d17a763b20d140a6596408a5fda6ec9778a403f3fb6645667badd6639a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 b37adcc30a0e5ba73f67524bcbd7ce6043091f3d810a81d06c5c6e60a23825a6
    MD5 148ef68760b3091cf69128f79401b91f
    BLAKE2b-256 64ea1b39232ef3ce14d98202d1e265f603be787229c4e6de65108ccb813970de

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 4ee633103eca014c3667a6cfaaa079716f28089261a30edd83426104964d5613
    MD5 e761789a7281ae9688304516fa6449e6
    BLAKE2b-256 1de8e0c1b254df7e4da55a1e52a1eaa9bedbbcf37d3ad7d5fe2fc1bfb72388aa

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 04ccb144afdda9b1a157b5563102338fbda0ca41bc30706c46e373a7d97b473c
    MD5 5a397dc7c3a09890df40738a17ebeaf9
    BLAKE2b-256 f6106b2e157b1bb996fb8788182d584e2009a56af1c8f34c8d9f6baf7653e1a5

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.1-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.1-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 62d37aece63b3646ef6daac6578329fcd46efaeb4b86dca2881a7988ff2632ef
    MD5 c22adf7b79c6de5b78293df04c658349
    BLAKE2b-256 6f823ec8dc7e0753431f69c7e1dd0ba39faff284330b5f1d23a19ee7ee6b19ce

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page