Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab license package discord

Generate and Stream your embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI. Mistral and Cohere Support coming soon.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.
  • Vector Streaming: continuously create and stream embeddings if you have low resource.

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size.

jina_config = JinaConfig(
    model_id="Custom link given below", revision="main", chunk_size=100
)
embed_config = EmbedConfig(jina=jina_config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

Usage

To use local embedding: we support Bert and Jina

import embed_anything
data = embed_anything.embed_file("file_path.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
data = embed_anything.embed_directory("directory_path", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])

query = ["photo of a dog"]
query_embedding = np.array(embed_anything.embed_query(query, embeder= "Clip")[0].embedding)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import JinaConfig, EmbedConfig, AudioDecoderConfig
import time

start_time = time.time()

# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder_config = AudioDecoderConfig(
    decoder_model_id="openai/whisper-tiny.en",
    decoder_revision="main",
    model_type="tiny-en",
    quantized=False,
)
jina_config = JinaConfig(
    model_id="jinaai/jina-embeddings-v2-small-en", revision="main", chunk_size=100
)

config = EmbedConfig(jina=jina_config, audio_decoder=audio_decoder_config)
data = embed_anything.embed_file(
    "test_files/audio/samples_hp0.wav", embeder="Audio", config=config
)
print(data[0].metadata)
end_time = time.time()
print("Time taken: ", end_time - start_time)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.
    ✅ Markdown, PDFs, and Website
    ✅ WAV File
    ✅ JPG, PNG, webp
    ✅Add whisper for audio embeddings
    ✅Custom model upload, anything that is available in candle
    ✅Custom chunk size
    ✅Pinecone Adapter, to directly save it on it.
    ✅Zero-shot application

    Yet to do be done
    ☑️Vector Database: Add functionalities to integrate with any Vector Database
    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec

    ✔️ Code of Conduct:

    Please read our [Code of Conduct] to understand the expectations we have for all contributors participating in this project. By participating, you agree to abide by our Code of Conduct.

    Quick Start

    You can quickly get started with contributing by searching for issues with the labels "Good First Issue" or "Help Needed" in the [Issues Section]. If you think you can contribute, comment on the issue and we will assign it to you.

    To set up your development environment, please follow the steps mentioned below :

    1. Fork the repository from dev, We don't allow direct contribution to main

    Contributing Guidelines

    🔍 Reporting Bugs

    1. Title describing the issue clearly and concisely with relevant labels
    2. Provide a detailed description of the problem and the necessary steps to reproduce the issue.
    3. Include any relevant logs, screenshots, or other helpful information supporting the issue.

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything-0.2.2.tar.gz (871.2 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything-0.2.2-cp312-none-win_amd64.whl (10.9 MB view details)

    Uploaded CPython 3.12 Windows x86-64

    embed_anything-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl (14.3 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything-0.2.2-cp312-cp312-macosx_11_0_arm64.whl (7.3 MB view details)

    Uploaded CPython 3.12 macOS 11.0+ ARM64

    embed_anything-0.2.2-cp312-cp312-macosx_10_12_x86_64.whl (7.5 MB view details)

    Uploaded CPython 3.12 macOS 10.12+ x86-64

    embed_anything-0.2.2-cp311-none-win_amd64.whl (10.9 MB view details)

    Uploaded CPython 3.11 Windows x86-64

    embed_anything-0.2.2-cp311-cp311-manylinux_2_34_x86_64.whl (14.3 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything-0.2.2-cp311-cp311-macosx_11_0_arm64.whl (7.3 MB view details)

    Uploaded CPython 3.11 macOS 11.0+ ARM64

    embed_anything-0.2.2-cp311-cp311-macosx_10_12_x86_64.whl (7.5 MB view details)

    Uploaded CPython 3.11 macOS 10.12+ x86-64

    embed_anything-0.2.2-cp310-none-win_amd64.whl (10.9 MB view details)

    Uploaded CPython 3.10 Windows x86-64

    embed_anything-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl (14.3 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything-0.2.2-cp310-cp310-macosx_11_0_arm64.whl (7.3 MB view details)

    Uploaded CPython 3.10 macOS 11.0+ ARM64

    embed_anything-0.2.2-cp39-none-win_amd64.whl (10.9 MB view details)

    Uploaded CPython 3.9 Windows x86-64

    embed_anything-0.2.2-cp39-cp39-manylinux_2_34_x86_64.whl (14.3 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    embed_anything-0.2.2-cp39-cp39-macosx_11_0_arm64.whl (7.3 MB view details)

    Uploaded CPython 3.9 macOS 11.0+ ARM64

    embed_anything-0.2.2-cp38-none-win_amd64.whl (10.9 MB view details)

    Uploaded CPython 3.8 Windows x86-64

    File details

    Details for the file embed_anything-0.2.2.tar.gz.

    File metadata

    • Download URL: embed_anything-0.2.2.tar.gz
    • Upload date:
    • Size: 871.2 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? Yes
    • Uploaded via: maturin/1.7.0

    File hashes

    Hashes for embed_anything-0.2.2.tar.gz
    Algorithm Hash digest
    SHA256 70c3f41f764540bb764afce3b5bc63989540e0d6c61214cad9e929afccff7d6a
    MD5 ad6a1dc9bbadc4111d890a8f11c2b825
    BLAKE2b-256 47b05889a3da0cc3c75c5251400bf53d5ba610f2c60980bddb2332bd1bbdfbe1

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp312-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp312-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 370f7bed4d03f4fd762dfe6bfd58782fc0420e7c145e2a30f82104573a5694f5
    MD5 aab0172387459f3af13e913dbcfb5dff
    BLAKE2b-256 490c158dc556ce35defa81b04cf02b9f0c1c9ce9c8384ed5814aeb409befa52f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 ac13e42a2e15956fd8eb4ff85a0bfe68284cfaea33d8224d611e6c2e6c9d1603
    MD5 18311a5f7af96de1bb02b9d16b888de3
    BLAKE2b-256 c51b12a3720a2e0e28bcb2eb79b946a45f002d7d54516671b1e4e202e31ebe7e

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp312-cp312-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp312-cp312-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 c660fa74b4f51648f940dbd20002dcab51023c0b9d95cf0205e3198c487a52ff
    MD5 882d037e77c8206e96c6549d3d0ee161
    BLAKE2b-256 14c6a6bf29e7ac9be4b9aa00b8c3e6a3ab1fbc094af4e364809250ebc86e8173

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp312-cp312-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp312-cp312-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 7533adb1b081ae1d942cf7c5c2e198e51d8318ef209bc4dea91fb3aced9a3eee
    MD5 24fef3db6660736647e9ed836fce5a85
    BLAKE2b-256 14f7014d6d732a63cdf9e7386917f389461f7fa14eca126e43b3537a914b9bfe

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp311-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp311-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 df4da1315f4405e74b2384f4ddf51ecf95da27b5eb89c007ff3ef1b36d186d98
    MD5 af5547ee8a0027a59ee23d478e2bfb75
    BLAKE2b-256 a590a3a7e9df47a5f2711b04c1a7a6065419ef95589c32ac062bed1afefe8877

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 8b9756a3cc04be8a40315bca29ca47e72a03c086d17f521af0e73205604ee264
    MD5 1b3d642040ddce0903dd32411c82f1f9
    BLAKE2b-256 57a0617d92f8ddbe7c86ae612fbeb52ce68f9d6c107876935f9259d284a3f8b3

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp311-cp311-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp311-cp311-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 b72256a350d88fa142a78b4c56cdf253d306b5deca25b3b5d8faea03e83535ba
    MD5 613485adecd3500f8e9399334c28ab1c
    BLAKE2b-256 e603e62c6b699afe705b6c95a9d45b1ff0984b9d4be7595e22aa2684e5a4b9b5

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp311-cp311-macosx_10_12_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp311-cp311-macosx_10_12_x86_64.whl
    Algorithm Hash digest
    SHA256 01509cbe5839ae0a5b7bc735684216de55f15fb1c2f420c9f457d013865dd86c
    MD5 c4d96f3da111cd1f96436c3cf8aecddf
    BLAKE2b-256 fceaf3af368d8a3860b84bb9a4e201d7362a6a4dd96b2ab65568e342f791468f

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp310-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp310-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 e442a7408158919f0e2e571527e396539ac6ab54a3ab25e06b3918df93cbc30b
    MD5 a139149d5d0e1072769c7bbb51f4a3cc
    BLAKE2b-256 28d39115de9b57654c31ffda9cc799a7502956be46ebfc576841dcdae935919b

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 ea8609fc1bf4635211f63eb1910d5fe5b222a2969b4c3b07863f65e6ef43e1ff
    MD5 503cfb88789db34850507b75c57c8367
    BLAKE2b-256 932b71f84f6b9117a435b26c31723b78dda0daef69ba776869f54cefc475569b

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp310-cp310-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp310-cp310-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 7daf0a3d5b8e13de5fbc9dce6e413ccc0abf00917ccc42092b5cab88929b6a52
    MD5 5270db4f8d3f3e6e65fa8c2eb91736af
    BLAKE2b-256 d0e4e75cd728480452f2420bacc3914c3eb3def6719ffd674082e8944406bc64

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp39-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp39-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 0583d45bb1a0e73b86bf0f2a48e68a0adecc9c1693b899933cd7198c50e5ed9b
    MD5 d48966b5a57dd4b977a0a42e0e9e71ee
    BLAKE2b-256 cc98dc9e156bfa98f263992610098b87198971c4c99d8908b23d826a01820e6a

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 6a943088d97295e8d07c1d93c0935f0f93af462ab6d598ffcebf6f78b47ecee4
    MD5 1bccfe11bd3a8c6d9d75b45139c0fd6f
    BLAKE2b-256 7215c3af43d1cd4f3b64221204f063450fa66860bdcf277220caf4c75f13f1b1

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp39-cp39-macosx_11_0_arm64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp39-cp39-macosx_11_0_arm64.whl
    Algorithm Hash digest
    SHA256 fe23b7af0257912db788233dcc143720d868e137fecad2348a026de956679091
    MD5 ee5beb8f4c737f35ab76d540fa65a99f
    BLAKE2b-256 868e07481f346929a765202f4a2c1a907fe3ff5a62e0c416e40036ebfe3da981

    See more details on using hashes here.

    File details

    Details for the file embed_anything-0.2.2-cp38-none-win_amd64.whl.

    File metadata

    File hashes

    Hashes for embed_anything-0.2.2-cp38-none-win_amd64.whl
    Algorithm Hash digest
    SHA256 b4d4fd91f56b1ca1c245dd81b4ba55c9167f1b3e7282045f838cd7c01e1acf58
    MD5 346f6220ce618b7e43e90454d42aaad9
    BLAKE2b-256 e4b0fc3764ed369d26b4914506831b589d3e5b79f8e448c8d2c83f0d90c56f63

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page