Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab license package discord roadmap

Generate and stream your embeddings with minimalist and lightning fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Vector Streaming Adapters . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and streaming them to a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI and Cohere.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.
  • Vector Streaming: Continuously create and stream embeddings if you have low resource.

💡What is Vector Streaming

Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.

EmbedAnythingXWeaviate

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size And Semantic Chunking.

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

For Semantic Chunking

model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L12-v2"
)

# with semantic encoder
semantic_encoder = EmbeddingModel.from_pretrained_hf(WhichModel.Jina, model_id = "jinaai/jina-embeddings-v2-small-en")
config = TextEmbedConfig(chunk_size=256, batch_size=32, splitting_strategy = "semantic", semantic_encoder=semantic_encoder)

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

Usage

➡️ Usage For 0.3 and later version

To use local embedding: we support Bert and Jina

model = EmbeddingModel.from_pretrained_local(
    WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
    embed_anything.WhichModel.Clip,
    model_id="openai/clip-vit-base-patch16",
    # revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
    embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

Audio Embedding using Whisper

requirements: Audio .wav files.

import embed_anything
from embed_anything import (
    AudioDecoderModel,
    EmbeddingModel,
    embed_audio_file,
    TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
    embed_anything.WhichModel.Bert,
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
    "test_files/audio/samples_hp0.wav",
    audio_decoder=audio_decoder,
    embeder=embeder,
    text_embed_config=config,
)
print(data[0].metadata)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • 🏎️ RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.

    🖼️ Modalities and Source

    We’re excited to share that we've expanded our platform to support multiple modalities, including:

    • Audio files
    • Markdowns
    • Websites
    • Images
    • Custom model uploads

    This gives you the flexibility to work with various data types all in one place! 🌐

    💜 Product

    We’ve rolled out some major updates in version 0.3 to improve both functionality and performance. Here’s what’s new:

    • Semantic Chunking: Optimized chunking strategy for better Retrieval-Augmented Generation (RAG) workflows.

    • Streaming for Efficient Indexing: We’ve introduced streaming for memory-efficient indexing in vector databases. Want to know more? Check out our article on this feature here: https://www.analyticsvidhya.com/blog/2024/09/vector-streaming/

    • Zero-Shot Applications: Explore our zero-shot application demos to see the power of these updates in action.

    • Intuitive Functions: Version 0.3 includes a complete refactor for more intuitive functions, making the platform easier to use.

    • Chunkwise Streaming: Instead of file-by-file streaming, we now support chunkwise streaming, allowing for more flexible and efficient data processing.

    Check out the latest release : and see how these features can supercharge your GenerativeAI pipeline! ✨

    🚀Where are we heading

    ⚙️ Performance

    We've received quite a few questions about why we're using Candle, so here's a quick explanation:

    One of the main reasons is that Candle doesn't require any specific ONNX format models, which means it can work seamlessly with any Hugging Face model. This flexibility has been a key factor for us. However, we also recognize that we’ve been compromising a bit on speed in favor of that flexibility.

    What’s Next? To address this, we’re excited to announce that we’re introducing ORT support along with our previous framework on hugging-face ,

    ➡️ Significantly faster performance

    • Stay tuned for these exciting updates! 🚀

    🫐Embeddings:

    We had multimodality from day one for our infrastructure. We have already included it for websites, images and audios but we want to expand it further to.

    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Video Embedding
    ☑️ Yolo Clip

    🌊Expansion to other Vector Adapters

    We currently support a wide range of vector databases for streaming embeddings, including:

    • Elastic: thanks to amazing and active Elastic team for the contribution
    • Weaviate
    • Pinecone
    • Qdrant

    But we're not stopping there! We're actively working to expand this list.

    Want to Contribute? If you’d like to add support for your favorite vector database, we’d love to have your help! Check out our contribution.md for guidelines, or feel free to reach out directly starlight-search@proton.me. Let's build something amazing together! 💡

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything_gpu-0.4.9.tar.gz (931.6 kB view details)

    Uploaded Source

    Built Distributions

    embed_anything_gpu-0.4.9-cp312-cp312-manylinux_2_31_x86_64.whl (13.1 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.9-cp311-cp311-manylinux_2_31_x86_64.whl (13.1 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.9-cp310-cp310-manylinux_2_31_x86_64.whl (13.1 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.9-cp39-cp39-manylinux_2_31_x86_64.whl (13.1 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.31+ x86-64

    embed_anything_gpu-0.4.9-cp38-cp38-manylinux_2_31_x86_64.whl (13.1 MB view details)

    Uploaded CPython 3.8 manylinux: glibc 2.31+ x86-64

    File details

    Details for the file embed_anything_gpu-0.4.9.tar.gz.

    File metadata

    • Download URL: embed_anything_gpu-0.4.9.tar.gz
    • Upload date:
    • Size: 931.6 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? No
    • Uploaded via: maturin/1.7.4

    File hashes

    Hashes for embed_anything_gpu-0.4.9.tar.gz
    Algorithm Hash digest
    SHA256 de4d54ba6b9959c535c352033b53b251c6e3a498cd5de0b26ba4033b05276d44
    MD5 63812cd8cb9140d4bb8054a251ee0953
    BLAKE2b-256 b9e9ae26be97ba15965ffffcd4efeeb8cad643c41b7839e2c2e8e6c0e2b07b28

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.9-cp312-cp312-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.9-cp312-cp312-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 6f6e5dfc9d261ca46e5eb20db648f306ec51a7eb06efcc2a21de3db4bba92665
    MD5 a7f08febd041fbe54c1081a793e20190
    BLAKE2b-256 67919ca2305b4e0f6dc0f6f0470a8290f475f6e066727e07617fbfcd9d2a41fc

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.9-cp311-cp311-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.9-cp311-cp311-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 d3cadfec6ff87317866731ddcade1c7957adc7d44794f074b8595e4d2ff6cb2f
    MD5 f6b4d6d25fb5b53dcb2fce38177a914d
    BLAKE2b-256 77ed7494e69afb19ed1930f6abec34bab8e5e40f1659ab0cd0b5cb80e42188ef

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.9-cp310-cp310-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.9-cp310-cp310-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 c7d44dd75d83bdc3ccd6e8199c514353391e5f07e2446438599e84fef62b1c9c
    MD5 f1178a739f7553adef6b800eafe77970
    BLAKE2b-256 1fff10accbbc289177e53c5feec39cbfd0cd14cbc9c04ad337d81fa1b1878c69

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.9-cp39-cp39-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.9-cp39-cp39-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 54effd6f76ce4db0cbe3d6f6862151376396954b7d6bea6d1bfd28e6bcd446e4
    MD5 7dd9f16e85495a75c00ca67d6cdcb8bf
    BLAKE2b-256 add9bc9402030c79c9cf8a0a2d84b979ae2af3f470ef4730093da3b185a3fe21

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.4.9-cp38-cp38-manylinux_2_31_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.4.9-cp38-cp38-manylinux_2_31_x86_64.whl
    Algorithm Hash digest
    SHA256 89b06cd3c3254bf328418eeadc5d6dfb0251a7d8d740980ae0e4ac30df7ab804
    MD5 c981c468108be01ffe03f8028eb8b810
    BLAKE2b-256 fd5d4d3f7431602218e2cff38ee982b8b917feed98d280ab1afb4641393ee74b

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page