Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab license package discord

Generate and Stream your embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI. Mistral and Cohere Support coming soon.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings file by file (Or chunk by chunk in future) and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size.

jina_config = JinaConfig(
    model_id="Custom link given below", revision="main", chunk_size=100
)
embed_config = EmbedConfig(jina=jina_config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

Usage

To use local embedding: we support Bert and Jina

import embed_anything
data = embed_anything.embed_file("file_path.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
data = embed_anything.embed_directory("directory_path", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])

query = ["photo of a dog"]
query_embedding = np.array(embed_anything.embed_query(query, embeder= "Clip")[0].embedding)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import JinaConfig, EmbedConfig, AudioDecoderConfig
import time

start_time = time.time()

# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder_config = AudioDecoderConfig(
    decoder_model_id="openai/whisper-tiny.en",
    decoder_revision="main",
    model_type="tiny-en",
    quantized=False,
)
jina_config = JinaConfig(
    model_id="jinaai/jina-embeddings-v2-small-en", revision="main", chunk_size=100
)

config = EmbedConfig(jina=jina_config, audio_decoder=audio_decoder_config)
data = embed_anything.embed_file(
    "test_files/audio/samples_hp0.wav", embeder="Audio", config=config
)
print(data[0].metadata)
end_time = time.time()
print("Time taken: ", end_time - start_time)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.
    ✅ Markdown, PDFs, and Website
    ✅ WAV File
    ✅ JPG, PNG, webp
    ✅Add whisper for audio embeddings
    ✅Custom model upload, anything that is available in candle
    ✅Custom chunk size
    ✅Pinecone Adapter, to directly save it on it.
    ✅Zero-shot application
    ✅Vector database integration via streaming adapters

    Yet to do be done
    ☑️Introducing chunkwise streaming instead of file
    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec

    ✔️ Code of Conduct:

    Please read our [Code of Conduct] to understand the expectations we have for all contributors participating in this project. By participating, you agree to abide by our Code of Conduct.

    Quick Start

    You can quickly get started with contributing by searching for issues with the labels "Good First Issue" or "Help Needed" in the [Issues Section]. If you think you can contribute, comment on the issue and we will assign it to you.

    To set up your development environment, please follow the steps mentioned below :

    1. Fork the repository from dev, We don't allow direct contribution to main

    Contributing Guidelines

    🔍 Reporting Bugs

    1. Title describing the issue clearly and concisely with relevant labels
    2. Provide a detailed description of the problem and the necessary steps to reproduce the issue.
    3. Include any relevant logs, screenshots, or other helpful information supporting the issue.

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.3.0.tar.gz (910.9 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.3.0-cp312-none-win_amd64.whl (11.3 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.3.0-cp312-cp312-manylinux_2_34_x86_64.whl (14.8 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.3.0-cp312-cp312-macosx_11_0_arm64.whl (7.7 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.3.0-cp312-cp312-macosx_10_12_x86_64.whl (8.0 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.3.0-cp311-none-win_amd64.whl (11.3 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.3.0-cp311-cp311-manylinux_2_34_x86_64.whl (14.8 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.3.0-cp311-cp311-macosx_11_0_arm64.whl (7.7 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.3.0-cp311-cp311-macosx_10_12_x86_64.whl (8.0 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.3.0-cp310-none-win_amd64.whl (11.3 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.3.0-cp310-cp310-manylinux_2_34_x86_64.whl (14.8 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.3.0-cp310-cp310-macosx_11_0_arm64.whl (7.7 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.3.0-cp39-none-win_amd64.whl (11.3 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.3.0-cp39-cp39-manylinux_2_34_x86_64.whl (14.8 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.3.0-cp39-cp39-macosx_11_0_arm64.whl (7.7 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.3.0-cp38-none-win_amd64.whl (11.3 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.3.0.tar.gz.

    File metadata

    • Download URL: embed_anything-0.3.0.tar.gz
    • Upload date:
    • Size: 910.9 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.1

    File hashes

    Hashes for embed_anything-0.3.0.tar.gz
    Algorithm Hash digest
    SHA256 41bd274dd1e097e07d14393cc97420bfe49f60686f91ec1c9aa844acd9062a05
    MD5 76d6e2d57d1451c2ee63f8ee5cb932e2
    BLAKE2b-256 bb882a66092520c5b042ef950e3f12bf05c612874f81b33b9bb531100efd01f7

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 8802ea0ba10edce23d7d688ddef6f54d73747e26190d3124e17c1a4df9051710
    MD5 cea523e1a96a77bb2e73e42134d56b0d
    BLAKE2b-256 0162077bf0a36be6331bf295f01a68cc791683720a4d517e20af9cdf6c02479c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 7de78f4a8a0ba3904bcf4b8e42536ee2b58e4106072743939dd572bd9f4650fa
    MD5 fc35044bd8ba68163a248f5d2dcca193
    BLAKE2b-256 7a923e82e38f2a28ffbeb1c3ad5db4d7cf54117f4fec145f1d60ed9225f4020e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 c03141a3a10deaef7220201abb09c760530176f5845105c4180dc31d7074fdd2
    MD5 1c9060fcde17099a517770eb0e36745a
    BLAKE2b-256 6361dbfbd09d46dcea167fc2f08f5d29e14e5c56e4f35dc7347a90b703913183

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 d3fa3db21ac324446aa5692d238bdf600b752596faf6a6c7b3fd972cefd03db8
    MD5 383a4b7247fcf8bff8d8214ba6faee0c
    BLAKE2b-256 87efbaf84ab805d2bf8d01b30f021634581844b2684a681b6cfe46a865118b7a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 709d303acfab010efa99aafd23df0b234b86b08617ae46c27d0565a6e2ad7cb8
    MD5 c2a159a63d20e69c34684117cdc59f92
    BLAKE2b-256 8da7a5be14269c8e0eebdb3736d354498b567757f0a80fa9dd15b80f539b9bf5

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 d8f6b245f92564166e808f019954548136334805a5a7b2ac4f8fd71d3d3b33c3
    MD5 19999c0bf18ce4c61c1020f77e8c37c2
    BLAKE2b-256 92a58a360a483c765835915559d359dfbde9f205ec8e5ade0723f051068f1896

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 a030a0797dfe04a5f48495fed353ef75220921b5a52b9dcb2eae68bee635ff74
    MD5 37b17c8cee2a29968aa5e22b80a53ab3
    BLAKE2b-256 0f090852b0bda1245ed9966533db4adc39efbabdbcba02f6b77b96dcc4b440b5

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 a6f613f864442df9c452fba18fbecdb18e4ba24960b624e8303e7e59e6324b6e
    MD5 2a2bceea2a8b724beac97d4fa383decf
    BLAKE2b-256 12162225363dfbbbbfba0c09eaa21d21c38a2a7d2c193d3f7511ac128be647a0

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 aa56f6a55a6b926050c4d652da0b5759ab50ef6537e45aa3a3160ad76f73b41a
    MD5 b7db7cdadfb3635ed0e03e3a41b7208a
    BLAKE2b-256 89fc3bb8a23f2a1a2fbc10d4b9dfb151a775a6b72a0466b7c9352f6041c66d4a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 b7a81d002a6861e926b3d3ec59b926398f296f214eafbdd90550be2f9556547e
    MD5 80da59b073744ae26e6c77fae7393830
    BLAKE2b-256 21bf451e0eabbac50636fd771b7a9c915b178674689b1a8809e98d75eb8ee911

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 6ee9a593ba78583ff6be3c4d69b8a93b61203b2b8391e6f37f75c9ff12914624
    MD5 f2957617cfcf849be790e751b033060a
    BLAKE2b-256 6124a4f7a10d69010562deb990d87d6650d965740bdeb8c4d951fdbee55c712c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 d6b574106a6b7d9bd534451677a835d056a3c6ba9f55471c03d6b8500440c9b5
    MD5 e088ae4fdc1206210871ecc1bb896a82
    BLAKE2b-256 c26555d2235aae9a1f8b799f7c98300b9468d92e5f02e8a2cd019be22afbf0d4

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 30735b4226d433b5189bfd5736c635ad473a1d3107d3c0cff7039abd01e243bc
    MD5 3408795d7bad726d5cb13b0d956ebfd2
    BLAKE2b-256 6ec288bcc38833d38a93e64192dde04903a586964407e0b1c0472a00f5ebd6d0

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 913a99be5691162c38c693c59c3ac7f6529734601ba7383d4305138909b0bb60
    MD5 700f89d1f30e674f85ce7c0791c1d13d
    BLAKE2b-256 c0ffca5e94874ea7310009f9e982ebf3f4da7e70c0bb2d07b9164f8ccd5d4399

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.3.0-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.3.0-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 2af29f253731358ebb1a83622f1dd978c93113e36a6a378d38c8fcd1b9197184
    MD5 1026e2dab8d7361cd8d04c2e6cf4eb3a
    BLAKE2b-256 a88046ee026111f37bdd300de38e229acb94c26277d38ab5247619aaf273a78c

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page