Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab license package discord

Generate and Stream your embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI. Mistral and Cohere Support coming soon.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings file by file (Or chunk by chunk in future) and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size.

jina_config = JinaConfig(
    model_id="Custom link given below", revision="main", chunk_size=100
)
embed_config = EmbedConfig(jina=jina_config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

Usage

To use local embedding: we support Bert and Jina

import embed_anything
data = embed_anything.embed_file("file_path.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
data = embed_anything.embed_directory("directory_path", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])

query = ["photo of a dog"]
query_embedding = np.array(embed_anything.embed_query(query, embeder= "Clip")[0].embedding)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import JinaConfig, EmbedConfig, AudioDecoderConfig
import time

start_time = time.time()

# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder_config = AudioDecoderConfig(
    decoder_model_id="openai/whisper-tiny.en",
    decoder_revision="main",
    model_type="tiny-en",
    quantized=False,
)
jina_config = JinaConfig(
    model_id="jinaai/jina-embeddings-v2-small-en", revision="main", chunk_size=100
)

config = EmbedConfig(jina=jina_config, audio_decoder=audio_decoder_config)
data = embed_anything.embed_file(
    "test_files/audio/samples_hp0.wav", embeder="Audio", config=config
)
print(data[0].metadata)
end_time = time.time()
print("Time taken: ", end_time - start_time)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.
    ✅ Markdown, PDFs, and Website
    ✅ WAV File
    ✅ JPG, PNG, webp
    ✅Add whisper for audio embeddings
    ✅Custom model upload, anything that is available in candle
    ✅Custom chunk size
    ✅Pinecone Adapter, to directly save it on it.
    ✅Zero-shot application
    ✅Vector database integration via streaming adapters

    Yet to do be done
    ☑️Introducing chunkwise streaming instead of file
    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec

    ✔️ Code of Conduct:

    Please read our [Code of Conduct] to understand the expectations we have for all contributors participating in this project. By participating, you agree to abide by our Code of Conduct.

    Quick Start

    You can quickly get started with contributing by searching for issues with the labels "Good First Issue" or "Help Needed" in the [Issues Section]. If you think you can contribute, comment on the issue and we will assign it to you.

    To set up your development environment, please follow the steps mentioned below :

    1. Fork the repository from dev, We don't allow direct contribution to main

    Contributing Guidelines

    🔍 Reporting Bugs

    1. Title describing the issue clearly and concisely with relevant labels
    2. Provide a detailed description of the problem and the necessary steps to reproduce the issue.
    3. Include any relevant logs, screenshots, or other helpful information supporting the issue.

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.2.3.tar.gz (880.3 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.2.3-cp312-none-win_amd64.whl (11.0 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.2.3-cp312-cp312-manylinux_2_34_x86_64.whl (14.3 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.2.3-cp312-cp312-macosx_11_0_arm64.whl (7.3 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.2.3-cp312-cp312-macosx_10_12_x86_64.whl (7.5 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.2.3-cp311-none-win_amd64.whl (11.0 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.2.3-cp311-cp311-manylinux_2_34_x86_64.whl (14.4 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.2.3-cp311-cp311-macosx_11_0_arm64.whl (7.3 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.2.3-cp311-cp311-macosx_10_12_x86_64.whl (7.6 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.2.3-cp310-none-win_amd64.whl (11.0 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.2.3-cp310-cp310-manylinux_2_34_x86_64.whl (14.4 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.2.3-cp310-cp310-macosx_11_0_arm64.whl (7.3 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.2.3-cp39-none-win_amd64.whl (11.0 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.2.3-cp39-cp39-manylinux_2_34_x86_64.whl (14.4 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.2.3-cp39-cp39-macosx_11_0_arm64.whl (7.3 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.2.3-cp38-none-win_amd64.whl (11.0 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.2.3.tar.gz.

    File metadata

    • Download URL: embed_anything-0.2.3.tar.gz
    • Upload date:
    • Size: 880.3 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.1

    File hashes

    Hashes for embed_anything-0.2.3.tar.gz
    Algorithm Hash digest
    SHA256 d9479e19d0a9773d783a965372df1d388637d3b1db20b70e83baa3ac89bd8d72
    MD5 4670aa27bc80b8465db39d27d9b4a3d6
    BLAKE2b-256 2239a91f572ebc95d6c6dc27a737559c8b81bc55dcd34f00dd38c391ef761f0f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 278eaf25f5c18c1d69007a9ea00e1c4c0640b339230e6d4cfe524c3a241e54ae
    MD5 f364872b27a1882f819ee45420477b28
    BLAKE2b-256 10bdf15c2d6422adad434b95236976258d68575db262a7d02c7646483d5d15b1

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 02e22fdd9cd7c701fe2eb6893d30ae4f2217819e240a00b32346a391e0872573
    MD5 32a0d40db7a995d3d346f00128747e0b
    BLAKE2b-256 f06d388a39c35a9a304b4bd0be31290a208e1b575c765262330b5cb543e0c3a6

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 dc4f17b0d9356c56e356f28bd83db4fa788c164f4456e05d5854f479ed9fc050
    MD5 c7668393a337461de131a00a6abc8692
    BLAKE2b-256 ba9efefa71c19d78baf1c3e5891c30aaddb4fe2c4976c5c87ccd998a5ef2e29e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 86d4de82643c9ce4ee0ebb4f7314e4e1345ab948a0268ab560e9411748e20299
    MD5 7121ae483573019590f8eba76c5e9eec
    BLAKE2b-256 473bf6f1996c6806dab64d8bb006425f6696f4b913dc327d896074b52ecc60e7

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 200135dbb41c24c6eb7bf541107ebcbdb2e959bcf87ffa7b9aae55ee60379e3b
    MD5 cdcabde6e992ab6112c3b7d9f23ef6c0
    BLAKE2b-256 0e0dd95faea8ba2ba3deebc950fb31367e751655e59fbe02f81084123d53bbc3

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 a8fa3d42d711955f1996d3c84efbfcfa92090db7c5c993a439ea6a700a652e29
    MD5 e781699e78a4e2c00db2cfdb78f1cad5
    BLAKE2b-256 58ead92e4b6be80f4efc4c8f28f455f810bbf1b59e1639ef046a3a9337a54d5a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 ba2040ab895a976b5f9389f5c4993df5982049f06bca9edaaff311588e351947
    MD5 edc425d44ee69aa7ab11ed7ccce2a354
    BLAKE2b-256 6b4cdff2f2ebaa537e6162edf7abc7a6f2d4c84cc52f11bc6b41dde29dac6a3a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 2b7882f985b6f9fe13ec23ba2cd0b590a0b14ba082a15900438be83b98840dcf
    MD5 82e1fd6836949c93a540bde5978228c2
    BLAKE2b-256 90972c81c6f0eabe43ce7974375d2e70fec38efa33f23500db3b30a45c71fbee

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 96255faf5c22a2f0f5efb74ef220dbba5d6b197ffa4c86a7de41442e70dfa1ba
    MD5 2d5970c9dba5c0322d474c2798489700
    BLAKE2b-256 a55fb0c7c39e5ea16f510b1c0d731edf0672b5019f9e41d12fdd8d9de11b1420

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 24af594a4daada3ea7dc9b3119a6b2cae05f54051b9cc6b539605e3c38cca9f1
    MD5 1c4a3aa2c643d2d6ee25e45fe8a07532
    BLAKE2b-256 04b7b8581c261041549999af7bb02c1c3cfdb46ad47ff22de3011e560540491c

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 e33887c2f9408e0ec3ce64394ad8d05dbfb5815df431f72e8ac09269cc5f0715
    MD5 5a05b72c09a818394a0be82bba1306e9
    BLAKE2b-256 af1941e3e2766d10a13bad2c2bd03b90ee864d65086ad0063c28ec0c9e20b7b1

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 acb81d818a314cd4f39cf207b6bb97f4a538eb990048e317ea521bf108ccb564
    MD5 a0e2c01168458968ab8d70e45b5b0fb5
    BLAKE2b-256 137c5377cba829545def9e7f9b242b5e18126310b2c476ab593bd96b7674cb4d

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 c18400e39fbbeebfe9efc7d51663b79bafafce26f87621d483edbbcda5531198
    MD5 4068867dff8e64f94605a1d6ad809741
    BLAKE2b-256 9d41f3b433f7e67c54cf9b7ca3ce036331da0718bd3249efbd43b2184f2a21b4

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 cbf3b53caca473c1275271456a0b048b4f594811094b9c22a0f7719bfffe6aff
    MD5 188be8360630ef03a6a2109450e84e03
    BLAKE2b-256 b7a34dfefb7a201bc4adf2203d73baadb8007aaec22edfbef3600cdc24508672

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.3-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.3-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 72efbad77be48dec6719e4f28b2a5f6b1727b72ffa79c05a38bdf6f5136c717d
    MD5 b6cc5de64370a4741c3ad43d3e6d72b7
    BLAKE2b-256 7efe2a2c22cd0fc42384cf04e2358308bece94d5d115116b3a101601604750e8

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page