Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab license package discord

Generate and stream your embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI. Mistral and Cohere Support coming soon.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings file by file (Or chunk by chunk in future) and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

➡️ Usage For 0.2

To use local embedding: we support Bert and Jina

import embed_anything
data = embed_anything.embed_file("file_path.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
data = embed_anything.embed_directory("directory_path", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])

query = ["photo of a dog"]
query_embedding = np.array(embed_anything.embed_query(query, embeder= "Clip")[0].embedding)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import JinaConfig, EmbedConfig, AudioDecoderConfig
import time

start_time = time.time()

# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder_config = AudioDecoderConfig(
    decoder_model_id="openai/whisper-tiny.en",
    decoder_revision="main",
    model_type="tiny-en",
    quantized=False,
)
jina_config = JinaConfig(
    model_id="jinaai/jina-embeddings-v2-small-en", revision="main", chunk_size=100
)

config = EmbedConfig(jina=jina_config, audio_decoder=audio_decoder_config)
data = embed_anything.embed_file(
    "test_files/audio/samples_hp0.wav", embeder="Audio", config=config
)
print(data[0].metadata)
end_time = time.time()
print("Time taken: ", end_time - start_time)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.
    ✅ Markdown, PDFs, and Website
    ✅ WAV File
    ✅ JPG, PNG, webp
    ✅Add whisper for audio embeddings
    ✅Custom model upload, anything that is available in candle
    ✅Custom chunk size
    ✅Pinecone Adapter, to directly save it on it.
    ✅Zero-shot application
    ✅Vector database integration via streaming adapters
    ✅Refactoring for intuitive functions

    Yet to do be done
    ☑️Introducing chunkwise streaming instead of file
    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding ☑️ Yolo Clip ☑️ Add more Vector Database Adapters

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.4.3.tar.gz (905.9 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.4.3-cp312-none-win_amd64.whl (11.6 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.4.3-cp312-cp312-manylinux_2_34_x86_64.whl (15.1 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.3-cp312-cp312-macosx_11_0_arm64.whl (7.9 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.4.3-cp312-cp312-macosx_10_12_x86_64.whl (8.2 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.4.3-cp311-none-win_amd64.whl (11.6 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.4.3-cp311-cp311-manylinux_2_34_x86_64.whl (15.1 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.3-cp311-cp311-macosx_11_0_arm64.whl (7.9 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.4.3-cp311-cp311-macosx_10_12_x86_64.whl (8.2 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.4.3-cp310-none-win_amd64.whl (11.6 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.4.3-cp310-cp310-manylinux_2_34_x86_64.whl (15.1 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.3-cp310-cp310-macosx_11_0_arm64.whl (7.9 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.4.3-cp39-none-win_amd64.whl (11.6 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.4.3-cp39-cp39-manylinux_2_34_x86_64.whl (15.1 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.4.3-cp39-cp39-macosx_11_0_arm64.whl (7.9 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.4.3-cp38-none-win_amd64.whl (11.6 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.4.3.tar.gz.

    File metadata

    • Download URL: embed_anything-0.4.3.tar.gz
    • Upload date:
    • Size: 905.9 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything-0.4.3.tar.gz
    Algorithm Hash digest
    SHA256 e2d353d4f928659798997bdad1b8f20a32f3a462ba6309189c1fa641e4251b0c
    MD5 417576791bc0061ed6f4dc3979ce7b6a
    BLAKE2b-256 5262bb08b4bbb8de51df43a9ce92eb65f4a1f7383fd49abd08039f9ba1346ac3

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 cf78e64dbd509061e255e4384daa8901708d7cd581872a1730dd84e63e3b2eff
    MD5 fd7636483530151aa316e8a1441d041b
    BLAKE2b-256 c4491ae406ae13d43701a4084698410da291d59d8394b0001fb4f02cdee61f2c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 5220e8bf54fa91c832823b682505e206a6da0887efa3b02a408423463999c1bb
    MD5 3f4ec7b4b7ab178f727ace7514796bad
    BLAKE2b-256 f182f00b3d890e6798ee8495a3c21ff8775167edb8b4d0392a97d7cd2bb4bb08

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 9ec595ceb7b5c29efac7ee546cd9ec09478d43ecb3191c45249bf3f10b02a7ea
    MD5 ab2f9a394b22fd98be336a80c4e348bb
    BLAKE2b-256 67647001776c6aeed32e30d649177c1ea5ea8c10a3a4f55118f067b87f095578

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 ccd475d29d2661617bde990fb5662e1ef54088c7ce08b37caedd28576a9bdc14
    MD5 e286d10e6528d7e03b965fb04f41439b
    BLAKE2b-256 2656fed4aa0c08289ff7a5b70523eac55ee818a3d006a99fd6ca20da4391c4f2

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 1607a8e36ef3330995f52e9576d443be2db339362e15ad1ef7978ee8baf28218
    MD5 a8222c6f65be588fe4b0188ddf1151d3
    BLAKE2b-256 06790938005dbffd66dd1d6a562750d9009523709b02ccf6f565daef33620011

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 47df4a7f2fe026045eca91f852011ba09d637be26ebce68305338e384c775388
    MD5 c9384100adcf4006a307eea977e28371
    BLAKE2b-256 29513322e550e9bbeed3f39efe9eabcdf19becca321e49b6165f7f79f668930f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 ee7c981c9ad4b2fe018de9a455b7e11fad6b0f50a9afc58b7403cfc85904434e
    MD5 944be8f33225798973d9add9b66bdf59
    BLAKE2b-256 96b23d35357924e92e649ca3e8162bd76695b2fca21d633743a006c9aed81c83

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 7f59f393ceb5108340d83854b0906e76f63d379dd8307ef8fc726c7f2f0fb141
    MD5 2db6a716ef9e022ec2e833c51c6a4f7f
    BLAKE2b-256 f9ccc31849892c45854aa2eb4afe51b688fde126f73c3414b828894e89c46471

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 b32f82c9b7fea35efb62c651b68c9dd1109113d093028a0384331eb0b4911c04
    MD5 838e4bbcaa80bb8f854289ab8426d81b
    BLAKE2b-256 5bf14b7ad79fa56b012eada3334397ca460c9cdf0577aa0163d054639251ff85

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 11790cae2c6c15864b843d0bf47c8c2e7e11a8e603da553df2c5e39fe194e1e1
    MD5 d50b6fc7059341b9d7a70b4ec8d88178
    BLAKE2b-256 de9112f8fbad6caabd776ede12a04860bf70d7105a8f324985a695ae6068e45c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 feed2def5ed72c8777b3620d7829d72f750be0a05f2f4e0f0af95b3f8c43661d
    MD5 8e2551c3b42fa7f6d4398eddbde30c18
    BLAKE2b-256 6421a041423ec8317dfe805c86c787e244bd21668f26d61125d8e726160aa12b

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 9ac325d9be918ff24c1ce94c86c5056075f0ff511eef306d3481f3fc5ac94f45
    MD5 c4c2d83ae7cf225ee1bb7395a55b1971
    BLAKE2b-256 6a46460cae5ab6b5011e3c13b58263e64cc142b598d89a5e1bd04614d330f4f3

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 b2c49d6d26e199ef83183daf8595fef61909c8d35e91ab99bd3a2294909222b9
    MD5 ce84cb7420b849e3ff9bf92f504f6eda
    BLAKE2b-256 77225f3791457ff7e30a37f619988fcdab4200288bf56b856a645fbcb9061045

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 1d4d9165a1635e002f10cf657ab52d67a98c2c08638e175c28274e723e1b21ff
    MD5 3bc207351627055ac5ccb38b4aa04cd4
    BLAKE2b-256 91d72ab7e719a9797e9c0b80b5e0e8ee91b5d390a4d4bd38914871e0910a7004

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.4.3-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.4.3-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 cd6de543341056e44fc8aed4e9905f76218a0f01cf42551866ee296067ac04c3
    MD5 537e39a2de5a692dec9986f78604649f
    BLAKE2b-256 46e13f7041565174e93f0b2dda1f9f04fe8ec54c875d70c3863b12319e5b7918

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page