Skip to main content

Embed anything at lightning speed

Project description

Downloads Open in Colab license license license

Supercharge your embedding pipeline with minimalist and lightening fast framework built in rust 🦀
Explore the docs »

View Demo · Examples · Request Feature . Search in Audio Space

EmbedAnything is a minimalist yet highly performant, lightweight, lightening fast, multisource, multimodal and local embedding pipeline, built in rust. Whether you're working with text, images, audio, PDFs, websites, or other media, EmbedAnything simplifies the process of generating embeddings from various sources and storing them in a vector database.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. How to add custom model and chunk size

🚀 Key Features

  • Local Embedding : Works with local embedding models like BERT and JINA
  • Cloud Embedding Models:: Supports OpenAI. Mistral and Cohere Support coming soon.
  • MultiModality : Works with text sources like PDFs, txt, md, Images JPG and Audio, .WAV
  • Rust : All the file processing is done in rust for speed and efficiency
  • Candle : We have taken care of hardware acceleration as well, with Candle.
  • Python Interface: Packaged as a Python library for seamless integration into your existing projects.
  • Scalable: Store embeddings in a vector database for easy retrieval and scalability.

🦀 Why Embed Anything

➡️Faster execution.
➡️Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages
➡️True multithreading
➡️Running language models or embedding models locally and efficiently
➡️Candle allows inferences on CUDA-enabled GPUs right out of the box.
➡️Decrease the memory usage of EmbedAnything.

⭐ Supported Models

We support a range of models, that can be supported by Candle, We have given a set of tested models but if you have specific usecase do mention it in the issue.

How to add custom model and Chunk Size.

jina_config = JinaConfig(
    model_id="Custom link given below", revision="main", chunk_size=100
)
embed_config = EmbedConfig(jina=jina_config)
Model Custom link
Jina jinaai/jina-embeddings-v2-base-en
jinaai/jina-embeddings-v2-small-en
Bert sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-MiniLM-L6-v2
Clip openai/clip-vit-base-patch32
Whisper Most OpenAI Whisper from huggingface supported.

🧑‍🚀 Getting Started

💚 Installation

pip install embed-anything

Usage

To use local embedding: we support Bert and Jina

import embed_anything
data = embed_anything.embed_file("file_path.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])

For multimodal embedding: we support CLIP

Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

import embed_anything
data = embed_anything.embed_directory("directory_path", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])

query = ["photo of a dog"]
query_embedding = np.array(embed_anything.embed_query(query, embeder= "Clip")[0].embedding)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()

For OpenAI- Whisper

requirements: Audio .wav files.

import embed_anything
import time

start_time = time.time()
data = embed_anything.embed_file(
    "file_path.wav", embeder="Whisper-Bert"
)
print(data[0].metadata)
end_time = time.time()
print("Time taken: ", end_time - start_time)

🚧 Contributing to EmbedAnything

First of all, thank you for taking the time to contribute to this project. We truly appreciate your contributions, whether it's bug reports, feature suggestions, or pull requests. Your time and effort are highly valued in this project. 🚀

This document provides guidelines and best practices to help you to contribute effectively. These are meant to serve as guidelines, not strict rules. We encourage you to use your best judgment and feel comfortable proposing changes to this document through a pull request.

  • Roadmap
  • Quick Start
  • Guidelines
  • RoadMap

    One of the aims of EmbedAnything is to allow AI engineers to easily use state of the art embedding models on typical files and documents. A lot has already been accomplished here and these are the formats that we support right now and a few more have to be done.
    ✅ Markdown, PDFs, and Website
    ✅ WAV File
    ✅ JPG, PNG, webp
    ✅Add whisper for audio embeddings
    ✅Custom model upload, anything that is available in candle
    ✅Custom chunk size
    ✅Pinecone Adapter, to directly save it on it.
    ✅Zero-shot application

    Yet to do be done
    ☑️Vector Database: Add functionalities to integrate with any Vector Database
    ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec
    ☑️Asynchronous chunks training

    ✔️ Code of Conduct:

    Please read our [Code of Conduct] to understand the expectations we have for all contributors participating in this project. By participating, you agree to abide by our Code of Conduct.

    Quick Start

    You can quickly get started with contributing by searching for issues with the labels "Good First Issue" or "Help Needed" in the [Issues Section]. If you think you can contribute, comment on the issue and we will assign it to you.

    To set up your development environment, please follow the steps mentioned below :

    1. Fork the repository from dev, We don't allow direct contribution to main

    Contributing Guidelines

    🔍 Reporting Bugs

    1. Title describing the issue clearly and concisely with relevant labels
    2. Provide a detailed description of the problem and the necessary steps to reproduce the issue.
    3. Include any relevant logs, screenshots, or other helpful information supporting the issue.

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    embed_anything_gpu-0.1.22.tar.gz (12.6 MB view details)

    Uploaded Source

    Built Distributions

    embed_anything_gpu-0.1.22-cp312-cp312-manylinux_2_34_x86_64.whl (12.8 MB view details)

    Uploaded CPython 3.12 manylinux: glibc 2.34+ x86-64

    embed_anything_gpu-0.1.22-cp311-cp311-manylinux_2_34_x86_64.whl (12.8 MB view details)

    Uploaded CPython 3.11 manylinux: glibc 2.34+ x86-64

    embed_anything_gpu-0.1.22-cp310-cp310-manylinux_2_34_x86_64.whl (12.8 MB view details)

    Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

    embed_anything_gpu-0.1.22-cp39-cp39-manylinux_2_34_x86_64.whl (12.8 MB view details)

    Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

    File details

    Details for the file embed_anything_gpu-0.1.22.tar.gz.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.1.22.tar.gz
    Algorithm Hash digest
    SHA256 c4a3aa537a259171d299873fb46a5a2f0b98644c67e3b3efb8d370c8106b699a
    MD5 518a97a98871b927750926cf7913df74
    BLAKE2b-256 9d26a343fefacb82318be40458c5b91a4ce325c8a3d0fc7abdd26477219efae7

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.1.22-cp312-cp312-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.1.22-cp312-cp312-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 7d1ab96d0fe4da2e8e8953211771c9a0ee04844cfbd9a66e4debeb6369a877c1
    MD5 47f6248d82782773f7075c26db8766b7
    BLAKE2b-256 7ef46440a19792c97a2429537877c666c263dd35aae966d37fa804fa8611109d

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.1.22-cp311-cp311-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.1.22-cp311-cp311-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 dbe9273bed2291d11e6150e37271994ea1df38a032cf556f29c07cce013a0a21
    MD5 86c7fed3bc6d0e18e17f0cb532872027
    BLAKE2b-256 4308bc2844d4d9100d1bfc2406d9e6135ab1702b3846226c380bdf8f4a438a6f

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.1.22-cp310-cp310-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.1.22-cp310-cp310-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 6f4991fc15f114adeddd0f091cbd2657707f725dafbef42d0921dd107d5c88c9
    MD5 919dbf60352e7420eab1d2fe1d7345f3
    BLAKE2b-256 8333dccc7cd5da644129c235a24f330f7dea76456ca401f63ab3a7745560e12d

    See more details on using hashes here.

    File details

    Details for the file embed_anything_gpu-0.1.22-cp39-cp39-manylinux_2_34_x86_64.whl.

    File metadata

    File hashes

    Hashes for embed_anything_gpu-0.1.22-cp39-cp39-manylinux_2_34_x86_64.whl
    Algorithm Hash digest
    SHA256 46ee52d602bf5ce10a169826cc72962cf7cd8f225a9f29cabe1af2d14f60d2b8
    MD5 94062dc64e93d4e307801fbb6c8ee155
    BLAKE2b-256 858aafd2095d3ba79b088041d15c355cb678205511ee3e97228ab962526a7b9f

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page