A Python library for creating RAG with Clickhouse.
Project description
ClickhouseRAG
ClickhouseRAG is a Python package designed for efficient data access and management in Clickhouse. It provides an easy-to-use interface for connecting to Clickhouse, executing queries, and managing tables with support for Vectorizers and Retrieval-Augmented Generation (RAG) operations.
Features
- Easy Clickhouse Connection: Seamlessly connect to your Clickhouse database.
- Table Management: Effortlessly manage tables with CRUD operations.
- Vectorization: Integrate with vectorizers for text and data embedding.
- RAG Operations: Perform Retrieval-Augmented Generation tasks.
- Backup and Restore: Backup your database to a file and restore it easily.
- Cosine Similarity Search: Search data based on cosine similarity.
Installation
You can install ClickhouseRAG via pip:
pip install clickhouserag
Usage
Connecting to Clickhouse
Create a client to connect to your Clickhouse database.
from clickhouserag.data_access.clickhouse_client import ClickhouseConnectClient
client = ClickhouseConnectClient(
host="localhost",
port=9000,
username="default",
password="",
database="default"
)
client.connect()
Defining Table Schema
Define the schema for your table in Clickhouse.
table_schema = {
"id": "UInt32",
"title": "String",
"vector": "Array(Float64)"
}
Managing Tables
Create an instance of RAGManager
to manage your table with the specified engine and schema.
from clickhouserag.rag.manager import RAGManager
rag_manager = RAGManager(client, "rag_table", table_schema, engine="MergeTree", order_by="id")
Creating and Adding Vectorizer
Create and add a Transformers vectorizer to the RAGManager.
import torch
from transformers import AutoModel, AutoTokenizer
from clickhouserag.vectorizers.base import VectorizerBase
class TransformersVectorizer(VectorizerBase):
"""Vectorizer that uses a Transformers model to convert text to vectors."""
def __init__(self, model_name: str) -> None:
"""Initialize the TransformersVectorizer."""
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
def vectorize(self, data: Any) -> List[float]:
"""Convert text data into a vector representation using a Transformers model."""
if not isinstance(data, str):
raise ValueError("Data should be a string for text vectorization.")
inputs = self.tokenizer(data, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = self.model(**inputs)
vector = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
return vector
def bulk_vectorize(self, data: Any) -> List[List[float]]:
"""Convert listed text data into a vector representation using a Transformers model."""
if not isinstance(data, List[str]):
raise ValueError("Data should be a list of a strings for text vectorization.")
inputs = self.tokenizer(
data, return_tensors="pt", truncation=True, padding=True
)
with torch.no_grad():
outputs = self.model(**inputs)
vector = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
return vector
transformers_vectorizer = TransformersVectorizer(model_name="distilbert-base-uncased")
rag_manager.add_vectorizer("transformers", transformers_vectorizer)
Adding Data with Vectorization
Add individual data records with vectorization through Transformers.
data = {"id": 1, "title": "Sample text data for transformers"}
rag_manager.add_data(data, vectorizer_name="transformers")
Bulk Adding Data with Vectorization
Add multiple data records with vectorization through Transformers.
bulk_data = [
{"id": 2, "title": "Sample text data 1 for transformers"},
{"id": 3, "title": "Sample text data 2 for transformers"},
{"id": 4, "title": "Sample text data 3 for transformers"}
]
rag_manager.add_bulk_data(bulk_data, vectorizer_name="transformers")
Retrieving Data by ID
Retrieve data from the RAG by ID.
data = rag_manager.get_data(1)
print("Data with ID 1:", data)
Updating Data with Vectorization
Update data with vectorization through Transformers.
updated_data = {"id": 1, "title": "Updated text data for transformers"}
rag_manager.update_data(1, updated_data, vectorizer_name="transformers")
Executing Text Search
Perform a text search on the RAG.
query = "SELECT * FROM rag_table WHERE title LIKE '%Sample%'"
search_results = rag_manager.search(query)
print("Search results:", search_results)
Executing Cosine Similarity Search
Perform a cosine similarity search on the RAG.
import numpy as np
embedding = np.random.rand(768) # Example random vector
similarity_results = rag_manager.similarity_search(embedding, top_k=2, columns=["id", "title"])
print("Similarity search results:", similarity_results)
Deleting Data
Delete data from the RAG by ID.
rag_manager.delete_data(1)
Backing Up the Database
Backup the database to a JSON file.
rag_manager.backup_database("backup.json")
Resetting and Restoring the Database
Reset and restore the database from a backup file.
rag_manager.reset_database()
rag_manager.restore_database("backup.json", table_schema=table_schema)
Closing the Database Connection
Close the connection to the Clickhouse database.
client.close()
Contributing
Contributions are welcome! Please read the contribution guidelines first.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For any questions or inquiries, please contact Leonid Chesnikov at leonid.chesnikov@gmail.com.
Project Structure
- clickhouserag.data_access: Contains modules for managing Clickhouse connections and tables.
- clickhouserag.rag: Contains modules for RAG operations and vectorizers.
Requirements
clickhouse-driver
numpy
These dependencies are automatically installed when you install the package via pip.
Development
To contribute to this project, follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes and commit them (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Create a new Pull Request.
We appreciate your contributions and efforts in improving this project!
Keywords
- Clickhouse
- Data Access
- Table Management
- Vectorizer
- RAG (Retrieval-Augmented Generation)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file clickhouserag-0.2.0.tar.gz
.
File metadata
- Download URL: clickhouserag-0.2.0.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Linux/6.5.0-1023-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a57ede469004c48305e38dfb01fd0371675e6d8a3cf6cf71bf7ca080032a5346 |
|
MD5 | afbbb9b9b85a1ce1e1db6ca6a8ee859d |
|
BLAKE2b-256 | 95ce72c94b2e14739d2171e1253fa018757a13b7e36d7440c7ec6fa42e4c3000 |
File details
Details for the file clickhouserag-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: clickhouserag-0.2.0-py3-none-any.whl
- Upload date:
- Size: 14.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Linux/6.5.0-1023-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6d9ad0affe27a9c1150da907adec6404e402586d4de5cf299a3cf2461d3f0ca |
|
MD5 | dba71daa3a0704f92cb6a7159c10f030 |
|
BLAKE2b-256 | b67792d1f266e329c2b0091c9a55e3c02f6c043acda00f864ac8e4306ced7b5f |