Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab license package discord

Generate and stream your embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI. Mistral and Cohere Support coming soon.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings file by file (Or chunk by chunk in future) and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

➡️ Usage For 0.2

To use local embedding: we support Bert and Jina

import embed_anything
data = embed_anything.embed_file("file_path.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
data = embed_anything.embed_directory("directory_path", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])

query = ["photo of a dog"]
query_embedding = np.array(embed_anything.embed_query(query, embeder= "Clip")[0].embedding)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import JinaConfig, EmbedConfig, AudioDecoderConfig
import time

start_time = time.time()

# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder_config = AudioDecoderConfig(
    decoder_model_id="openai/whisper-tiny.en",
    decoder_revision="main",
    model_type="tiny-en",
    quantized=False,
)
jina_config = JinaConfig(
    model_id="jinaai/jina-embeddings-v2-small-en", revision="main", chunk_size=100
)

config = EmbedConfig(jina=jina_config, audio_decoder=audio_decoder_config)
data = embed_anything.embed_file(
    "test_files/audio/samples_hp0.wav", embeder="Audio", config=config
)
print(data[0].metadata)
end_time = time.time()
print("Time taken: ", end_time - start_time)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.
    ✅ Markdown, PDFs, and Website
    ✅ WAV File
    ✅ JPG, PNG, webp
    ✅Add whisper for audio embeddings
    ✅Custom model upload, anything that is available in candle
    ✅Custom chunk size
    ✅Pinecone Adapter, to directly save it on it.
    ✅Zero-shot application
    ✅Vector database integration via streaming adapters
    ✅Refactoring for intuitive functions

    Yet to do be done
    ☑️Introducing chunkwise streaming instead of file
    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding ☑️ Yolo Clip ☑️ Add more Vector Database Adapters

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything_gpu-0.4.5.tar.gz (919.7 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything_gpu-0.4.5-cp312-none-win_amd64.whl (9.8 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything_gpu-0.4.5-cp312-cp312-manylinux_2_31_x86_64.whl (12.8 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.5-cp311-none-win_amd64.whl (9.8 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything_gpu-0.4.5-cp311-cp311-manylinux_2_31_x86_64.whl (12.8 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.5-cp310-none-win_amd64.whl (9.8 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything_gpu-0.4.5-cp310-cp310-manylinux_2_31_x86_64.whl (12.8 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.5-cp39-none-win_amd64.whl (9.8 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything_gpu-0.4.5-cp39-cp39-manylinux_2_31_x86_64.whl (12.8 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.5-cp38-none-win_amd64.whl (9.8 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    embed_anything_gpu-0.4.5-cp38-cp38-manylinux_2_31_x86_64.whl (12.8 MB view details)

    Uploaded CPython 3.8 manylinux: glibc 2.31+ x86-64

    File details

    Details for the file embed_anything_gpu-0.4.5.tar.gz.

    File metadata

    • Download URL: embed_anything_gpu-0.4.5.tar.gz
    • Upload date:
    • Size: 919.7 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? No
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything_gpu-0.4.5.tar.gz
    Algorithm Hash digest
    SHA256 bb8558b6d8a479415f1aaccbbde8cd1c8d1385912c25fe8b2be4ecdfce574500
    MD5 01dff92a9d6d3d077f18a0a9409f065c
    BLAKE2b-256 4687a171b3463b5f0f80404de101d0b41af43d22b43453ea8ec07777f64ef212

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 3ac63e5d13479db8d8ea774ae285de8a8a38be3862c330be342172eace2e6398
    MD5 1cbef20aa2163372f8310fdbf0341ca0
    BLAKE2b-256 30566b14bad556b261fbe69985d74ea883e9d5916fe1ecac7d33ca64e4a0271f

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp312-cp312-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp312-cp312-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 8187c8a94e7a819ba95c942f8c392e161db7598028143219d39f3c7ad9b088b4
    MD5 dde0a143fdb62d64dc9edec336d4440b
    BLAKE2b-256 c4172028094136dc1c334ffc3b4a8df89bc408362932c645e645cc8c8a32bd81

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 5cce38344c6e8dfe64368774d0103d3aa80000c0836bd017cb69a00f8d8155fd
    MD5 6a01c719acc5bfef63526ff4fd28e1b0
    BLAKE2b-256 b2d27382d30b83ed3ba3b0e3cdaa11b97480173e209fe827f5b0a484add76b83

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp311-cp311-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp311-cp311-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 fcfd2fa203df3f65ba980486f41daf399c771f5e3fa1687be4dca4dd8fbea129
    MD5 0da7ef2ef51cf78df499231b568c0662
    BLAKE2b-256 f6dc89d44e9e8d14dd6d8f528fd6aabfcb6384a1eae18f5c22f6bca03df5f2c2

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 58946e3f42f7a99773cb59662235ad0790bf16981cd288b478e58e5acd8cddd9
    MD5 939e7be66c344d1576afb48cfc5c96e5
    BLAKE2b-256 8142be0be977b07aeacebb973c0e40a2f3d23ec5c614e022af2db52cd96120bc

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp310-cp310-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp310-cp310-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 bd4dbe54dbdd98dadce40bb074458189fd9f25d3a449aa1b2a65381e0c0ab771
    MD5 af23f12dc1fe30111b27f80670eca4d7
    BLAKE2b-256 61c0d51fb71515f6e32746a16e3a74060f606187c4d1d3a2c617951e6e315f04

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 4d9499592955c157ceb58528c697db9a3339351f626a22ffb539c1f681d71064
    MD5 57c99bc0a399ecc79556318ae390c4c8
    BLAKE2b-256 3975e60be0aa44ea03e31b19cf6d550eec53e0c308dfb54f6129488dc449221d

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp39-cp39-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp39-cp39-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 ceb4384a6576faea751db6e9debb9dfcadee8193abe6e39613c2feaa176a682c
    MD5 ec92196a28fc69e6c01fc2ac71c80c36
    BLAKE2b-256 0f95dd5b6e2aaa9c02470b61dc140babeb8a98f779890950e3ea40ebe9d5a09b

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 865f57d1699f773661e91be466429e5783f16aade9bd29500f5f55f413d51ce6
    MD5 fe7abd9ba0eadaf0fb752fe21194a808
    BLAKE2b-256 1e78fffc958ef4319d48628f1ec9d4f23491fe75e6cae707d08d2adb7c63f2ca

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.5-cp38-cp38-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.5-cp38-cp38-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 3a69dd08fb9cfa0990db324f04b1cc057f6d9da56024b1f3859ddcd4e43a03a3
    MD5 40c16cbda89210e96487e398d98c4024
    BLAKE2b-256 dbd3014e84660f5e070543be2300ea9ee1a4c95be6b316ab04866ed3b31a8042

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page