A Python library for creating RAG with Clickhouse.

These details have not been verified by PyPI

Project description

ClickhouseRAG

ClickhouseRAG is a Python package designed for efficient data access and management in Clickhouse. It provides an easy-to-use interface for connecting to Clickhouse, executing queries, and managing tables with support for Vectorizers and Retrieval-Augmented Generation (RAG) operations.

Features

Easy Clickhouse Connection: Seamlessly connect to your Clickhouse database.
Table Management: Effortlessly manage tables with CRUD operations.
Vectorization: Integrate with vectorizers for text and data embedding.
RAG Operations: Perform Retrieval-Augmented Generation tasks.
Backup and Restore: Backup your database to a file and restore it easily.
Cosine Similarity Search: Search data based on cosine similarity.

Installation

You can install ClickhouseRAG via pip:

pip install clickhouserag

Usage

Connecting to Clickhouse

Create a client to connect to your Clickhouse database.

from clickhouserag.data_access.clickhouse_client import ClickhouseConnectClient

client = ClickhouseConnectClient(
    host="localhost",
    port=9000,
    username="default",
    password="",
    database="default"
)
client.connect()

Defining Table Schema

Define the schema for your table in Clickhouse.

table_schema = {
    "id": "UInt32",
    "title": "String",
    "vector": "Array(Float64)"
}

Managing Tables

Create an instance of RAGManager to manage your table with the specified engine and schema.

from clickhouserag.rag.manager import RAGManager

rag_manager = RAGManager(client, "rag_table", table_schema, engine="MergeTree", order_by="id")

Creating and Adding Vectorizer

Create and add a Transformers vectorizer to the RAGManager.

import torch
from transformers import AutoModel, AutoTokenizer
from clickhouserag.vectorizers.base import VectorizerBase

class TransformersVectorizer(VectorizerBase):
    """Vectorizer that uses a Transformers model to convert text to vectors."""
    
    def __init__(self, model_name: str) -> None:
        """Initialize the TransformersVectorizer."""
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
    
    def vectorize(self, data: Any) -> List[float]:
        """Convert text data into a vector representation using a Transformers model."""
        if not isinstance(data, str):
            raise ValueError("Data should be a string for text vectorization.")
        
        inputs = self.tokenizer(data, return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
            vector = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
        
        return vector

    def bulk_vectorize(self, data: Any) -> List[List[float]]:
        """Convert listed text data into a vector representation using a Transformers model."""

        if not isinstance(data, List[str]):
            raise ValueError("Data should be a list of a strings for text vectorization.")

        inputs = self.tokenizer(
            data, return_tensors="pt", truncation=True, padding=True
        )
        with torch.no_grad():
            outputs = self.model(**inputs)
            vector = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()

        return vector

transformers_vectorizer = TransformersVectorizer(model_name="distilbert-base-uncased")
rag_manager.add_vectorizer("transformers", transformers_vectorizer)

Adding Data with Vectorization

Add individual data records with vectorization through Transformers.

data = {"id": 1, "title": "Sample text data for transformers"}
rag_manager.add_data(data, vectorizer_name="transformers")

Bulk Adding Data with Vectorization

Add multiple data records with vectorization through Transformers.

bulk_data = [
    {"id": 2, "title": "Sample text data 1 for transformers"},
    {"id": 3, "title": "Sample text data 2 for transformers"},
    {"id": 4, "title": "Sample text data 3 for transformers"}
]
rag_manager.add_bulk_data(bulk_data, vectorizer_name="transformers")

Retrieving Data by ID

Retrieve data from the RAG by ID.

data = rag_manager.get_data(1)
print("Data with ID 1:", data)

Updating Data with Vectorization

Update data with vectorization through Transformers.

updated_data = {"id": 1, "title": "Updated text data for transformers"}
rag_manager.update_data(1, updated_data, vectorizer_name="transformers")

Executing Text Search

Perform a text search on the RAG.

query = "SELECT * FROM rag_table WHERE title LIKE '%Sample%'"
search_results = rag_manager.search(query)
print("Search results:", search_results)

Executing Cosine Similarity Search

Perform a cosine similarity search on the RAG.

import numpy as np

embedding = np.random.rand(768)  # Example random vector
similarity_results = rag_manager.similarity_search(embedding, top_k=2, columns=["id", "title"])
print("Similarity search results:", similarity_results)

Deleting Data

Delete data from the RAG by ID.

rag_manager.delete_data(1)

Backing Up the Database

Backup the database to a JSON file.

rag_manager.backup_database("backup.json")

Resetting and Restoring the Database

Reset and restore the database from a backup file.

rag_manager.reset_database()
rag_manager.restore_database("backup.json", table_schema=table_schema)

Closing the Database Connection

Close the connection to the Clickhouse database.

client.close()

Contributing

Contributions are welcome! Please read the contribution guidelines first.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any questions or inquiries, please contact Leonid Chesnikov at leonid.chesnikov@gmail.com.

Project Structure

clickhouserag.data_access: Contains modules for managing Clickhouse connections and tables.
clickhouserag.rag: Contains modules for RAG operations and vectorizers.

Requirements

clickhouse-driver
numpy

These dependencies are automatically installed when you install the package via pip.

Development

To contribute to this project, follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature-branch).
Make your changes and commit them (git commit -am 'Add new feature').
Push to the branch (git push origin feature-branch).
Create a new Pull Request.

We appreciate your contributions and efforts in improving this project!

Keywords

Clickhouse
Data Access
Table Management
Vectorizer
RAG (Retrieval-Augmented Generation)

GitHub Repository

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jul 22, 2024

0.1.0

Jul 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clickhouserag-0.2.0.tar.gz (11.4 kB view details)

Uploaded Jul 22, 2024 Source

Built Distribution

clickhouserag-0.2.0-py3-none-any.whl (14.3 kB view details)

Uploaded Jul 22, 2024 Python 3

File details

Details for the file clickhouserag-0.2.0.tar.gz.

File metadata

Download URL: clickhouserag-0.2.0.tar.gz
Upload date: Jul 22, 2024
Size: 11.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.9 Linux/6.5.0-1023-azure

File hashes

Hashes for clickhouserag-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a57ede469004c48305e38dfb01fd0371675e6d8a3cf6cf71bf7ca080032a5346`
MD5	`afbbb9b9b85a1ce1e1db6ca6a8ee859d`
BLAKE2b-256	`95ce72c94b2e14739d2171e1253fa018757a13b7e36d7440c7ec6fa42e4c3000`

See more details on using hashes here.

File details

Details for the file clickhouserag-0.2.0-py3-none-any.whl.

File metadata

Download URL: clickhouserag-0.2.0-py3-none-any.whl
Upload date: Jul 22, 2024
Size: 14.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.9 Linux/6.5.0-1023-azure

File hashes

Hashes for clickhouserag-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a6d9ad0affe27a9c1150da907adec6404e402586d4de5cf299a3cf2461d3f0ca`
MD5	`dba71daa3a0704f92cb6a7159c10f030`
BLAKE2b-256	`b67792d1f266e329c2b0091c9a55e3c02f6c043acda00f864ac8e4306ced7b5f`

See more details on using hashes here.

clickhouserag 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

ClickhouseRAG

Features

Installation

Usage

Connecting to Clickhouse

Defining Table Schema

Managing Tables

Creating and Adding Vectorizer

Adding Data with Vectorization

Bulk Adding Data with Vectorization

Retrieving Data by ID

Updating Data with Vectorization

Executing Text Search

Executing Cosine Similarity Search

Deleting Data

Backing Up the Database

Resetting and Restoring the Database

Closing the Database Connection

Contributing

License

Contact

Project Structure

Requirements

Development

Keywords

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes