An integration package connecting ClickZetta and LangChain
Project description
LangChain ClickZetta
🚀 Enterprise-grade LangChain integration for ClickZetta - Unlock the power of cloud-native lakehouse with AI-driven SQL queries, high-performance vector search, and intelligent full-text retrieval in a unified platform.
📖 Table of Contents
- Why ClickZetta + LangChain?
- Core Features
- Installation
- Quick Start
- Storage Services
- Comparison with Alternatives
- Advanced Usage
- Testing
- Development
- Contributing
🚀 Why ClickZetta + LangChain?
🏆 Unique Advantages
1. Native Lakehouse Architecture
- ClickZetta's cloud-native lakehouse provides 10x performance improvement over traditional Spark-based architectures
- Unified storage and compute for all data types (structured, semi-structured, unstructured)
- Real-time incremental processing capabilities
2. True Hybrid Search in Single Table
- Industry-first single-table hybrid search combining vector and full-text indexes
- No complex joins or multiple tables needed - everything in one place
- Atomic MERGE operations for consistent data updates
3. Enterprise-Grade Storage Services
- Complete LangChain BaseStore implementation with sync/async support
- Native Volume integration for binary file storage (models, embeddings)
- SQL-queryable document storage with JSON metadata
- Atomic UPSERT operations using ClickZetta's MERGE INTO
4. Advanced Chinese Language Support
- Built-in Chinese text analyzers (IK, standard, keyword)
- Optimized for bilingual (Chinese/English) AI applications
- DashScope integration for state-of-the-art Chinese embeddings
5. Production-Ready Features
- Connection pooling and query optimization
- Comprehensive error handling and logging
- Full test coverage (unit + integration)
- Type-safe operations throughout
🛠️ Core Features
🧠 AI-Powered Query Interface
- Natural Language to SQL: Convert questions to optimized ClickZetta SQL
- Context-Aware: Understands table schemas and relationships
- Bilingual Support: Works seamlessly with Chinese and English queries
🔍 Advanced Search Capabilities
- Vector Search: High-performance embedding-based similarity search
- Full-Text Search: Enterprise-grade inverted index with multiple analyzers
- True Hybrid Search: Single-table combined vector + text search (industry first)
- Metadata Filtering: Complex filtering with JSON metadata support
💾 Enterprise Storage Solutions
- ClickZettaStore: High-performance key-value storage using SQL tables
- ClickZettaDocumentStore: Structured document storage with queryable metadata
- ClickZettaFileStore: Binary file storage using native ClickZetta Volume
- ClickZettaVolumeStore: Direct Volume integration for maximum performance
🔄 Production-Grade Operations
- Atomic UPSERT: MERGE INTO operations for data consistency
- Batch Processing: Efficient bulk operations for large datasets
- Connection Management: Pooling and automatic reconnection
- Type Safety: Full type annotations and runtime validation
🎯 LangChain Compatibility
- BaseStore Interface: 100% compatible with LangChain storage standards
- Async Support: Full async/await pattern implementation
- Chain Integration: Seamless integration with LangChain chains and agents
- Memory Systems: Persistent chat history and conversation memory
Installation
From PyPI (Recommended)
pip install langchain-clickzetta
Development Installation
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta
pip install -e ".[dev]"
Local Installation from Source
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta
pip install .
Quick Start
Basic Setup
from langchain_clickzetta import ClickZettaEngine, ClickZettaSQLChain, ClickZettaVectorStore
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_community.llms import Tongyi
# Initialize ClickZetta engine
# ClickZetta requires exactly 7 connection parameters
engine = ClickZettaEngine(
service="your-service",
instance="your-instance",
workspace="your-workspace",
schema="your-schema",
username="your-username",
password="your-password",
vcluster="your-vcluster"
)
# Initialize embeddings (DashScope recommended for Chinese/English support)
embeddings = DashScopeEmbeddings(
dashscope_api_key="your-dashscope-api-key",
model="text-embedding-v4"
)
# Initialize LLM
llm = Tongyi(dashscope_api_key="your-dashscope-api-key")
SQL Queries
# Create SQL chain
sql_chain = ClickZettaSQLChain.from_engine(
engine=engine,
llm=llm,
return_sql=True
)
# Ask questions in natural language
result = sql_chain.invoke({
"query": "How many users do we have in the database?"
})
print(result["result"]) # Natural language answer
print(result["sql_query"]) # Generated SQL query
Vector Storage
from langchain_core.documents import Document
# Create vector store
vector_store = ClickZettaVectorStore(
engine=engine,
embeddings=embeddings,
table_name="my_vectors",
vector_element_type="float" # Options: float, int, tinyint
)
# Add documents
documents = [
Document(
page_content="ClickZetta is a high-performance analytics database.",
metadata={"category": "database", "type": "analytics"}
),
Document(
page_content="LangChain enables building applications with LLMs.",
metadata={"category": "framework", "type": "ai"}
)
]
vector_store.add_documents(documents)
# Search for similar documents
results = vector_store.similarity_search(
"What is ClickZetta?",
k=2
)
for doc in results:
print(doc.page_content)
Full-text Search
from langchain_clickzetta.retrievers import ClickZettaFullTextRetriever
# Create full-text retriever
retriever = ClickZettaFullTextRetriever(
engine=engine,
table_name="my_documents",
search_type="phrase",
k=5
)
# Add documents to search index
retriever.add_documents(documents)
# Search documents
results = retriever.get_relevant_documents("ClickZetta database")
for doc in results:
print(f"Score: {doc.metadata.get('relevance_score', 'N/A')}")
print(f"Content: {doc.page_content}")
True Hybrid Search (Single Table)
from langchain_clickzetta import ClickZettaHybridStore, ClickZettaUnifiedRetriever
# Create true hybrid store (single table with both vector + inverted indexes)
hybrid_store = ClickZettaHybridStore(
engine=engine,
embeddings=embeddings,
table_name="hybrid_docs",
text_analyzer="ik", # Chinese text analyzer
distance_metric="cosine"
)
# Add documents to hybrid store
documents = [
Document(page_content="云器 Lakehouse 是由云器科技完全自主研发的新一代云湖仓。使用增量计算的数据计算引擎,性能可以提升至传统开源架构例如Spark的 10倍,实现了海量数据的全链路-低成本-实时化处理,为AI 创新提供了支持全类型数据整合、存储与计算的平台,帮助企业从传统的开源 Spark 体系升级到 AI 时代的数据基础设施。"),
Document(page_content="LangChain enables building LLM applications")
]
hybrid_store.add_documents(documents)
# Create unified retriever for hybrid search
retriever = ClickZettaUnifiedRetriever(
hybrid_store=hybrid_store,
search_type="hybrid", # "vector", "fulltext", or "hybrid"
alpha=0.5, # Balance between vector and full-text search
k=5
)
# Search using hybrid approach
results = retriever.invoke("analytics database")
for doc in results:
print(f"Content: {doc.page_content}")
Chat Message History
from langchain_clickzetta import ClickZettaChatMessageHistory
from langchain_core.messages import HumanMessage, AIMessage
# Create chat history
chat_history = ClickZettaChatMessageHistory(
engine=engine,
session_id="user_123",
table_name="chat_sessions"
)
# Add messages
chat_history.add_message(HumanMessage(content="Hello!"))
chat_history.add_message(AIMessage(content="Hi there! How can I help you?"))
# Retrieve conversation history
messages = chat_history.messages
for message in messages:
print(f"{message.__class__.__name__}: {message.content}")
Configuration
Environment Variables
You can configure ClickZetta connection using environment variables:
export CLICKZETTA_SERVICE="your-service"
export CLICKZETTA_INSTANCE="your-instance"
export CLICKZETTA_WORKSPACE="your-workspace"
export CLICKZETTA_SCHEMA="your-schema"
export CLICKZETTA_USERNAME="your-username"
export CLICKZETTA_PASSWORD="your-password"
export CLICKZETTA_VCLUSTER="your-vcluster" # Required
Connection Options
engine = ClickZettaEngine(
service="your-service",
instance="your-instance",
workspace="your-workspace",
schema="your-schema",
username="your-username",
password="your-password",
vcluster="your-vcluster", # Required parameter
connection_timeout=30, # Connection timeout in seconds
query_timeout=300, # Query timeout in seconds
hints={ # Custom query hints
"sdk.job.timeout": 600,
"query_tag": "My Application"
}
)
Advanced Usage
Custom SQL Prompts
from langchain_core.prompts import PromptTemplate
custom_prompt = PromptTemplate(
input_variables=["input", "table_info", "dialect"],
template="""
You are a ClickZetta SQL expert. Given the input question and table information,
write a syntactically correct {dialect} query.
Tables: {table_info}
Question: {input}
SQL Query:"""
)
sql_chain = ClickZettaSQLChain(
engine=engine,
llm=llm,
sql_prompt=custom_prompt
)
Vector Store with Custom Distance Metrics
vector_store = ClickZettaVectorStore(
engine=engine,
embeddings=embeddings,
distance_metric="euclidean", # or "cosine", "manhattan"
vector_dimension=1536,
vector_element_type="float" # or "int", "tinyint"
)
Metadata Filtering
# Search with metadata filters
results = vector_store.similarity_search(
"machine learning",
k=5,
filter={"category": "tech", "year": 2024}
)
# Full-text search with metadata
retriever = ClickZettaFullTextRetriever(
engine=engine,
table_name="research_docs"
)
results = retriever.get_relevant_documents(
"artificial intelligence",
filter={"type": "research"}
)
Testing
Run the test suite:
# Navigate to package directory
cd libs/clickzetta
# Install test dependencies
pip install -e ".[dev]"
# Run unit tests
make test-unit
# Run integration tests
make test-integration
# Run all tests
make test
Integration Tests
To run integration tests against a real ClickZetta instance:
- Configure your connection in
~/.clickzetta/connections.jsonwith a UAT connection - Add DashScope API key to the configuration
- Run integration tests:
cd libs/clickzetta
make integration
make integration-dashscope
Development
Setup Development Environment
# Clone the repository
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta
# Install in development mode
pip install -e ".[dev]"
# Install pre-commit hooks (if configured)
pre-commit install
Code Quality
# Navigate to the package directory
cd libs/clickzetta
# Format code (auto-fixes many issues)
make format
# Linting (significantly improved)
make lint # ✅ Reduced from 358 to 65 errors - 82% improvement!
# Core functionality testing
# Use project virtual environment for best results:
source .venv/bin/activate
make test-unit # ✅ Core unit tests (LangChain compatibility verified)
make test-integration # Integration tests
# Type checking (in progress)
make typecheck # Some LangChain compatibility issues being resolved
Recent Improvements ✨:
- ✅ Ruff configuration updated to modern format
- ✅ 155 typing issues auto-fixed (Dict→dict, Optional→|None)
- ✅ Method signatures fixed for LangChain BaseStore compatibility
- ✅ Bare except clauses improved with proper exception handling
- ✅ Code formatting standardized with black
Current Status: Core functionality fully working with significantly improved code quality (82% reduction in lint errors). All LangChain BaseStore compatibility tests pass.
📦 Storage Services
LangChain ClickZetta provides comprehensive storage services that implement the LangChain BaseStore interface with enterprise-grade features:
🔑 Key Advantages of ClickZetta Storage
🚀 Performance Benefits
- 10x Faster: ClickZetta's optimized lakehouse architecture
- Atomic Operations: MERGE INTO for consistent UPSERT operations
- Batch Processing: Efficient handling of large datasets
- Connection Pooling: Optimized database connections
🏗️ Architecture Benefits
- Native Integration: Direct ClickZetta Volume support for binary data
- SQL Queryability: Full SQL access to stored documents and metadata
- Unified Storage: Single platform for all data types
- Schema Evolution: Flexible metadata storage with JSON support
🔒 Enterprise Features
- ACID Compliance: Full transaction support
- Type Safety: Runtime validation and type checking
- Error Handling: Comprehensive error recovery and logging
- Monitoring: Built-in query performance tracking
Key-Value Store
from langchain_clickzetta import ClickZettaStore
# Basic key-value storage
store = ClickZettaStore(engine=engine, table_name="cache")
store.mset([("key1", b"value1"), ("key2", b"value2")])
values = store.mget(["key1", "key2"])
Document Store
from langchain_clickzetta import ClickZettaDocumentStore
# Document storage with metadata
doc_store = ClickZettaDocumentStore(engine=engine, table_name="documents")
doc_store.store_document("doc1", "content", {"author": "user"})
content, metadata = doc_store.get_document("doc1")
File Store
from langchain_clickzetta import ClickZettaFileStore
# Binary file storage using ClickZetta Volume
file_store = ClickZettaFileStore(
engine=engine,
volume_type="user",
subdirectory="models"
)
file_store.store_file("model.bin", binary_data, "application/octet-stream")
content, mime_type = file_store.get_file("model.bin")
Volume Store (Native ClickZetta Volume)
from langchain_clickzetta import ClickZettaUserVolumeStore
# Native Volume integration
volume_store = ClickZettaUserVolumeStore(engine=engine, subdirectory="data")
volume_store.mset([("config.json", b'{"key": "value"}')])
config = volume_store.mget(["config.json"])[0]
📊 Comparison with Alternatives
ClickZetta vs. Traditional Vector Databases
| Feature | ClickZetta + LangChain | Pinecone/Weaviate | Chroma/FAISS |
|---|---|---|---|
| Hybrid Search | ✅ Single table | ❌ Multiple systems | ❌ Separate tools |
| SQL Queryability | ✅ Full SQL support | ❌ Limited | ❌ No SQL |
| Lakehouse Integration | ✅ Native | ❌ External | ❌ External |
| Chinese Language | ✅ Optimized | ⚠️ Basic | ⚠️ Basic |
| Enterprise Features | ✅ ACID, Transactions | ⚠️ Limited | ❌ Basic |
| Storage Services | ✅ Full LangChain API | ❌ Custom | ❌ Limited |
| Performance | ✅ 10x improvement | ⚠️ Variable | ⚠️ Memory limited |
ClickZetta vs. Other LangChain Integrations
| Integration | Vector Search | Full-Text | Hybrid | Storage API | SQL Queries |
|---|---|---|---|---|---|
| ClickZetta | ✅ | ✅ | ✅ | ✅ | ✅ |
| Elasticsearch | ✅ | ✅ | ⚠️ | ❌ | ❌ |
| PostgreSQL/pgvector | ✅ | ⚠️ | ❌ | ⚠️ | ✅ |
| MongoDB | ✅ | ⚠️ | ❌ | ⚠️ | ❌ |
| Redis | ✅ | ❌ | ❌ | ✅ | ❌ |
Key Differentiators
🎯 Single Platform Solution
- No need to manage multiple systems (vector DB + full-text + SQL + storage)
- Unified data governance and security model
- Simplified architecture and reduced operational complexity
🚀 Performance at Scale
- ClickZetta's incremental computing engine
- Optimized for both analytical and operational workloads
- Native lakehouse storage with separation of compute and storage
🌏 Chinese Market Focus
- Deep integration with Chinese AI ecosystem (DashScope, Tongyi)
- Optimized text processing for Chinese language
- Compliance with Chinese data regulations
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for your changes
- Ensure all tests pass (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Create a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
- Documentation: [Link to detailed docs]
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Acknowledgments
- LangChain for the foundational framework
- ClickZetta for the powerful analytics lakehouse
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_clickzetta-0.1.3.tar.gz.
File metadata
- Download URL: langchain_clickzetta-0.1.3.tar.gz
- Upload date:
- Size: 76.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82fa8f52f83dbe7720f0f39cbd7ff7181f672d40cc8b9d2af24d10f46224cb9a
|
|
| MD5 |
20134c35f2a1282ebd5bc34dea771a61
|
|
| BLAKE2b-256 |
5889c6087e82c6d4c1e679550e4639634e2e9218d8c650867bcb32a35b404eb6
|
File details
Details for the file langchain_clickzetta-0.1.3-py3-none-any.whl.
File metadata
- Download URL: langchain_clickzetta-0.1.3-py3-none-any.whl
- Upload date:
- Size: 41.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e7f6e35eceeb38809aaa48e19c04ad05e366337b2dd31f854d299af9cebb255
|
|
| MD5 |
2ea416932905373dc047801d673b3a2d
|
|
| BLAKE2b-256 |
5bfbe24ddec5783ea31c239f6a176ff6f9f9394663d069d76f19d25f68431cb6
|