A high-performance, distributed text embedding and reranking system based on FlagEmbedding

These details have not been verified by PyPI

Project links

Homepage

Project description

Generic Vectorizer

Generic Vectorizer is a high-performance, distributed text embedding and reranking system built with Python, gRPC, and ZeroMQ. It provides efficient processing of text embedding and reranking tasks using state-of-the-art models.

Features
Architecture
Installation
Usage
- AsyncEmbeddingClient
- Using the Generic Vectorizer Client
Configuration
Docker Support
Contributing
License

[Rest of the README content remains unchanged]

Features

Distributed processing of text embedding and reranking tasks
Support for multiple embedding and reranking models
High-performance communication using gRPC and ZeroMQ
Easy scaling of workers for different model types
Configurable settings for fine-tuning performance
Support for both dense and sparse embeddings
Efficient chunking and aggregation of long texts

Architecture

Generic Vectorizer consists of three main components: the gRPC server, the background workers, and the client. Here's a high-level overview of the architecture:

graph TD
    A[Client] -->|gRPC| B[gRPC Server]
    B -->|ZeroMQ| C[Broker]
    C -->|ZeroMQ| D[Embedding Router]
    C -->|ZeroMQ| E[Reranker Router]
    D -->|ZeroMQ| F[Embedding Worker 1]
    D -->|ZeroMQ| G[Embedding Worker 2]
    E -->|ZeroMQ| H[Reranker Worker]
    F -->|Process| I[Embedding Model 1]
    G -->|Process| J[Embedding Model 2]
    H -->|Process| K[Reranking Model]

The client sends requests to the gRPC server.
The gRPC server forwards requests to the broker using ZeroMQ.
The broker distributes tasks to appropriate routers based on the task type.
Routers manage communication with multiple workers.
Workers process tasks using various models (embedding, reranking, etc.).
Results are sent back through the same path to the client.

This architecture allows for efficient distribution of work and easy scaling of workers for different model types.

Installation

Clone the repository:

git clone https://github.com/yourusername/generic-vectorizer.git
cd generic-vectorizer

Install the required dependencies:

pip install -r requirements.txt

Compile the gRPC protobuf files:

./compile_grpc.sh

Usage

AsyncEmbeddingClient

The AsyncEmbeddingClient class provides the following methods:

get_embedding(text: str, embed_strategy: str, target_topic: str, chunk_size: int = 512, return_dense: bool = True, return_sparse: bool = False) -> Dict

Get the embedding for a single text.
get_batch_embedding(texts: List[str], embed_strategy: str, target_topic: str, chunk_size: int = 512, return_dense: bool = True, return_sparse: bool = False) -> List[Dict]

Get embeddings for a batch of texts.
get_rerank_scores(query: str, corpus: List[str], target_topic: str, normalize: bool = True) -> List[float]

Get rerank scores for a query and a list of documents.

Each method returns a dictionary or list of dictionaries containing the requested embeddings or scores.

For more detailed API information, including the structure of the request and response objects, please refer to the source code and comments in the generic_vectorizer/client/client.py file.

Using the Generic Vectorizer Client

Here's a basic example of how to use the Generic Vectorizer client:

import asyncio
from generic_vectorizer.client import AsyncEmbeddingClient

async def main():
    client = AsyncEmbeddingClient(grpc_server_address="localhost:1200")

    # Get embedding for a single text
    embedding = await client.get_embedding(
        text="Your text here",
        target_topic='bge_m3',
        chunk_size=512,
        return_dense=True,
        return_sparse=True
    )
    print("Single text embedding:", embedding)

    # Get embeddings for a batch of texts
    batch_embeddings = await client.get_batch_embedding(
        texts=["Text 1", "Text 2", "Text 3"],
        target_topic='bge_m3',
        chunk_size=512,
        return_dense=True,
        return_sparse=True
    )
    print("Batch embeddings:", batch_embeddings)

    # Get rerank scores
    rerank_scores = await client.get_rerank_scores(
        query="Your query here",
        corpus=["Document 1", "Document 2", "Document 3"],
        target_topic='bge_reranker',
        normalize=True
    )
    print("Rerank scores:", rerank_scores)

asyncio.run(main())

Configuration of the server

Generic Vectorizer can be configured using the EmbedderModelConfig class. Here's an example configuration:

from generic_vectorizer import Vectorizer
from generic_vectorizer.typing import EmbedderModelConfig, EmbedderModelType

embedder_model_configs = [
    EmbedderModelConfig(
        embedder_model_type=EmbedderModelType.BGE_M3_EMBEDDING_MODEL,
        target_topic='bge_m3',
        nb_instances=3,
        options={
            'model_name_or_path': 'BAAI/bge-m3',
            'device': 'cuda:0'
        }
    ),
    EmbedderModelConfig(
        embedder_model_type=EmbedderModelType.BGE_RERANKER_MODEL,
        target_topic='bge_reranker',
        nb_instances=1,
        options={
            'model_name_or_path': 'BAAI/bge-reranker-v2-m3',
            'device': 'cpu'
        },
        zmq_tcp_address='tcp://*:8500'
    )
]

vectorizer = Vectorizer(
    grpc_server_address='[::]:1200', 
    embedder_model_configs=embedder_model_configs,
    max_concurrent_requests=1024,
    request_timeout=30
)
vectorizer.listen()

Launching the Server

Generic Vectorizer now supports configuration via a JSON file. This allows for easy customization of server settings and model configurations.

Configuration File Structure

Create a config.json file with the following structure:

{
  "grpc_server_address": "[::]:5000",
  "max_concurrent_requests": 1024,
  "request_timeout": 30,
  "embedder_model_configs": [
    {
      "embedder_model_type": "BGE_M3_EMBEDDING_MODEL",
      "target_topic": "bge_m3",
      "nb_instances": 3,
      "options": {
        "model_name_or_path": "BAAI/bge-m3",
        "device": "cuda:0"
      }
    },
    {
      "embedder_model_type": "BGE_RERANKER_MODEL",
      "target_topic": "bge_reranker",
      "nb_instances": 1,
      "options": {
        "model_name_or_path": "BAAI/bge-reranker-v2-m3",
        "device": "cpu"
      },
      "zmq_tcp_address": "tcp://*:8500"
    }
  ]
}

Launching the Server

To launch the Generic Vectorizer server with your configuration:

python -m generic_vectorizer launch-engine --config path/to/your/config.json

This command will read the configuration from the specified JSON file and start the server with the provided settings.

Docker Support

Generic Vectorizer can be run in Docker containers, with support for both GPU and CPU environments.

Building Docker Images

For GPU support:

docker build -f Dockerfile.gpu -t generic-vectorizer:gpu .

For CPU-only:

docker build -f Dockerfile.cpu -t generic-vectorizer:cpu .

Running Docker Containers

To run the Generic Vectorizer in a Docker container, you need to mount your configuration file as a volume.

For GPU support:

docker run --gpus all -v /path/to/your/config.json:/home/solver/config.json -p 5000:5000 generic-vectorizer:gpu launch-engine --config /home/solver/config.json

For CPU-only:

docker run -v /path/to/your/config.json:/home/solver/config.json -p 5000:5000 generic-vectorizer:cpu launch-engine --config /home/solver/config.json

Replace /path/to/your/config.json with the actual path to your configuration file on the host machine.

Note: The GPU version requires the NVIDIA Container Toolkit to be installed on your host system.

Docker Compose (Optional)

For easier management, you can use Docker Compose. Create a docker-compose.yml file:

version: '3'
services:
  generic-vectorizer:
    image: generic-vectorizer:gpu  # or :cpu for CPU-only version
    ports:
      - "5000:5000"
    volumes:
      - ./config.json:/home/solver/config.json
    command: launch-engine --config /home/solver/config.json
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Then run:

docker-compose up

This setup assumes your config.json is in the same directory as the docker-compose.yml file.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Aug 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

generic_vectorizer-0.1.0.tar.gz (24.1 kB view details)

Uploaded Aug 19, 2024 Source

Built Distribution

generic_vectorizer-0.1.0-py3-none-any.whl (25.1 kB view details)

Uploaded Aug 19, 2024 Python 3

File details

Details for the file generic_vectorizer-0.1.0.tar.gz.

File metadata

Download URL: generic_vectorizer-0.1.0.tar.gz
Upload date: Aug 19, 2024
Size: 24.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.8.10

File hashes

Hashes for generic_vectorizer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5c0fca47504cf8d7e4f8fd26c626f1e4efb4a0836722e3202f60b50da5bb5906`
MD5	`e5c0aee69c6ab8a902a9b612433b87cb`
BLAKE2b-256	`ea65a027dff843d691e06941b72de288603e3edb7382877b83bb1a5cc6e57fab`

See more details on using hashes here.

File details

Details for the file generic_vectorizer-0.1.0-py3-none-any.whl.

File metadata

Download URL: generic_vectorizer-0.1.0-py3-none-any.whl
Upload date: Aug 19, 2024
Size: 25.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.8.10

File hashes

Hashes for generic_vectorizer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6333d0f2d91c5ece7cb0ef4a4872e877699425faa01dd6faec772aee88002e14`
MD5	`ff4213d1e26dd2d8a73a9b96dcbd0c30`
BLAKE2b-256	`89a6019b6b17e275fd011ccf410f2bd683c55e099d6183d834e0f90fbd73f0e0`