Skip to main content

A Python library for creating RAG with Clickhouse.

Project description

ClickhouseRAG

ClickhouseRAG is a Python package designed for efficient data access and management in Clickhouse. It provides an easy-to-use interface for connecting to Clickhouse, executing queries, and managing tables with support for Vectorizers and Retrieval-Augmented Generation (RAG) operations.

Features

  • Easy Clickhouse Connection: Seamlessly connect to your Clickhouse database.
  • Table Management: Effortlessly manage tables with CRUD operations.
  • Vectorization: Integrate with vectorizers for text and data embedding.
  • RAG Operations: Perform Retrieval-Augmented Generation tasks.
  • Backup and Restore: Backup your database to a file and restore it easily.
  • Cosine Similarity Search: Search data based on cosine similarity.

Installation

You can install ClickhouseRAG via pip:

pip install clickhouserag

Usage

Connecting to Clickhouse

Create a client to connect to your Clickhouse database.

from clickhouserag.data_access.clickhouse_client import ClickhouseConnectClient

client = ClickhouseConnectClient(
    host="localhost",
    port=9000,
    username="default",
    password="",
    database="default"
)
client.connect()

Defining Table Schema

Define the schema for your table in Clickhouse.

table_schema = {
    "id": "UInt32",
    "title": "String",
    "vector": "Array(Float64)"
}

Managing Tables

Create an instance of RAGManager to manage your table with the specified engine and schema.

from clickhouserag.rag.manager import RAGManager

rag_manager = RAGManager(client, "rag_table", table_schema, engine="MergeTree", order_by="id")

Creating and Adding Vectorizer

Create and add a Transformers vectorizer to the RAGManager.

import torch
from transformers import AutoModel, AutoTokenizer
from clickhouserag.vectorizers.base import VectorizerBase

class TransformersVectorizer(VectorizerBase):
    """Vectorizer that uses a Transformers model to convert text to vectors."""
    
    def __init__(self, model_name: str) -> None:
        """Initialize the TransformersVectorizer."""
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
    
    def vectorize(self, data: Any) -> List[float]:
        """Convert text data into a vector representation using a Transformers model."""
        if not isinstance(data, str):
            raise ValueError("Data should be a string for text vectorization.")
        
        inputs = self.tokenizer(data, return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
            vector = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
        
        return vector

    def bulk_vectorize(self, data: Any) -> List[List[float]]:
        """Convert listed text data into a vector representation using a Transformers model."""

        if not isinstance(data, List[str]):
            raise ValueError("Data should be a list of a strings for text vectorization.")

        inputs = self.tokenizer(
            data, return_tensors="pt", truncation=True, padding=True
        )
        with torch.no_grad():
            outputs = self.model(**inputs)
            vector = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()

        return vector

transformers_vectorizer = TransformersVectorizer(model_name="distilbert-base-uncased")
rag_manager.add_vectorizer("transformers", transformers_vectorizer)

Adding Data with Vectorization

Add individual data records with vectorization through Transformers.

data = {"id": 1, "title": "Sample text data for transformers"}
rag_manager.add_data(data, vectorizer_name="transformers")

Bulk Adding Data with Vectorization

Add multiple data records with vectorization through Transformers.

bulk_data = [
    {"id": 2, "title": "Sample text data 1 for transformers"},
    {"id": 3, "title": "Sample text data 2 for transformers"},
    {"id": 4, "title": "Sample text data 3 for transformers"}
]
rag_manager.add_bulk_data(bulk_data, vectorizer_name="transformers")

Retrieving Data by ID

Retrieve data from the RAG by ID.

data = rag_manager.get_data(1)
print("Data with ID 1:", data)

Updating Data with Vectorization

Update data with vectorization through Transformers.

updated_data = {"id": 1, "title": "Updated text data for transformers"}
rag_manager.update_data(1, updated_data, vectorizer_name="transformers")

Executing Text Search

Perform a text search on the RAG.

query = "SELECT * FROM rag_table WHERE title LIKE '%Sample%'"
search_results = rag_manager.search(query)
print("Search results:", search_results)

Executing Cosine Similarity Search

Perform a cosine similarity search on the RAG.

import numpy as np

embedding = np.random.rand(768)  # Example random vector
similarity_results = rag_manager.similarity_search(embedding, top_k=2, columns=["id", "title"])
print("Similarity search results:", similarity_results)

Deleting Data

Delete data from the RAG by ID.

rag_manager.delete_data(1)

Backing Up the Database

Backup the database to a JSON file.

rag_manager.backup_database("backup.json")

Resetting and Restoring the Database

Reset and restore the database from a backup file.

rag_manager.reset_database()
rag_manager.restore_database("backup.json", table_schema=table_schema)

Closing the Database Connection

Close the connection to the Clickhouse database.

client.close()

Contributing

Contributions are welcome! Please read the contribution guidelines first.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any questions or inquiries, please contact Leonid Chesnikov at leonid.chesnikov@gmail.com.

Project Structure

  • clickhouserag.data_access: Contains modules for managing Clickhouse connections and tables.
  • clickhouserag.rag: Contains modules for RAG operations and vectorizers.

Requirements

  • clickhouse-driver
  • numpy

These dependencies are automatically installed when you install the package via pip.

Development

To contribute to this project, follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Make your changes and commit them (git commit -am 'Add new feature').
  4. Push to the branch (git push origin feature-branch).
  5. Create a new Pull Request.

We appreciate your contributions and efforts in improving this project!

Keywords

  • Clickhouse
  • Data Access
  • Table Management
  • Vectorizer
  • RAG (Retrieval-Augmented Generation)

GitHub Repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clickhouserag-0.2.0.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

clickhouserag-0.2.0-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file clickhouserag-0.2.0.tar.gz.

File metadata

  • Download URL: clickhouserag-0.2.0.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.9 Linux/6.5.0-1023-azure

File hashes

Hashes for clickhouserag-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a57ede469004c48305e38dfb01fd0371675e6d8a3cf6cf71bf7ca080032a5346
MD5 afbbb9b9b85a1ce1e1db6ca6a8ee859d
BLAKE2b-256 95ce72c94b2e14739d2171e1253fa018757a13b7e36d7440c7ec6fa42e4c3000

See more details on using hashes here.

File details

Details for the file clickhouserag-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: clickhouserag-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.9 Linux/6.5.0-1023-azure

File hashes

Hashes for clickhouserag-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a6d9ad0affe27a9c1150da907adec6404e402586d4de5cf299a3cf2461d3f0ca
MD5 dba71daa3a0704f92cb6a7159c10f030
BLAKE2b-256 b67792d1f266e329c2b0091c9a55e3c02f6c043acda00f864ac8e4306ced7b5f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page