Skip to main content

openGauss vector store integration for LangChain

Project description

openGauss Vector Store for LangChain

License: MIT

openGauss integration for LangChain providing scalable vector storage and search capabilities, powered by openGauss.

Features

  • 🚀 Multi-Index Support - HNSW and IVFFLAT vector indexing algorithms
  • 📐 Multiple Distance Metrics - EUCLIDEAN/COSINE/MANHATTAN/NEGATIVE_INNER_PRODUCT
  • 🔧 Auto-Schema Management - Automatic table creation and validation
  • 🧮 Dimension Validation - Type-safe dimension constraints for different vector types
  • 🛡️ ACID Compliance - Transaction-safe operations with connection pooling
  • 🔀 Hybrid Search - Combine vector similarity with metadata filtering
  • 😀 openGauss age Graph Support - Graph store implementation for openGauss age

Installation

pip install langchain-opengauss

Prerequisites:

  • openGauss >= 7.0.0
  • Python 3.8+
  • psycopg2-binary

Quick Start

1. Start openGauss Container

docker run --name opengauss \
  --privileged=true \
  -d \
  -e GS_PASSWORD=MyStrongPass@123 \
  -p 8888:5432 \
  opengauss/opengauss-server:latest

2. Basic Usage

from langchain_opengauss import OpenGauss, OpenGaussSettings
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# Configuration with validation
config = OpenGaussSettings(
    table_name="research_papers",
    embedding_dimension=1536,
    index_type="HNSW",
    distance_strategy="COSINE",
)

# Initialize with OpenAI embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_store = OpenGauss(embedding=embeddings, config=config)

# Insert documents
docs = [
    Document(page_content="Quantum computing basics", metadata={"field": "physics"}),
    Document(page_content="Neural network architectures", metadata={"field": "ai"})
]
vector_store.add_documents(docs)

# Semantic search
results = vector_store.similarity_search("deep learning models", k=1)
print(f"Found {len(results)} relevant documents")

Configuration Guide

Connection Settings

Parameter Default Description
host localhost Database server address
port 8888 Database connection port
user gaussdb Database username
password - Complex password string
database postgres Default database name
min_connections 1 Connection pool minimum size
max_connections 5 Connection pool maximum size
table_name langchain_docs Name of the table for storing vector data and metadata
index_type IndexType.HNSW Vector index algorithm type. Options: HNSW or IVFFLAT\nDefault is HNSW.
vector_type VectorType.vector Type of vector representation to use. Default is Vector.
distance_strategy DistanceStrategy.COSINE Vector similarity metric to use for retrieval. Options: euclidean (L2 distance), cosine (angular distance, ideal for text embeddings), manhattan (L1 distance for sparse data), negative_inner_product (dot product for normalized vectors).\n Default is cosine.
embedding_dimension 1536 Dimensionality of the vector embeddings.

Vector Configuration

class OpenGaussSettings(BaseModel):
    index_type: IndexType = IndexType.HNSW  # HNSW or IVFFLAT
    vector_type: VectorType = VectorType.vector  # Currently supports float vectors
    distance_strategy: DistanceStrategy = DistanceStrategy.COSINE
    embedding_dimension: int = 1536  # Max 2000 for vector type

Supported Combinations

Vector Type Dimensions Index Types Supported Distance Strategies
vector ≤2000 HNSW/IVFFLAT COSINE/EUCLIDEAN/MANHATTAN/INNER_PROD

Advanced Usage

Hybrid Search with Metadata

# Filter by metadata with vector search
results = vector_store.similarity_search(
    query="machine learning",
    k=3,
    filter={"publish_year": 2023, "category": "research"},
)

Index Management

# Create optimized HNSW index
vector_store.create_hnsw_index(
    m=24,  # Number of bi-directional links
    ef_construction=128,  # Search scope during build
    ef=64,  # Search scope during queries
)

API Reference

Core Methods

Method Description
add_documents Insert documents with automatic embedding
similarity_search Basic vector similarity search
similarity_search_with_score Return (document, similarity_score) tuples
delete Remove documents by ID list
drop_table Delete entire collection

Performance Tips

1. Index Tuning

HNSW Index Optimization

  • m (max connections per layer)

    • Default: 16
    • Range: 2~100
    • Tradeoff: Higher values improve recall but increase index build time and memory usage
  • ef_construction (construction search scope)

    • Default: 64
    • Range: 4~1000 (must ≥ 2*m)
# Example HNSW configuration
vector_store.create_hnsw_index(
    m=16,  # Balance between recall and performance
    ef_construction=64,  # Ensure >2*m (48) and >ef_search
)

IVFFLAT Index Optimization

  • lists
    • Calculation:
      # Recommended formula
      lists = min(int(math.sqrt(total_rows)) if total_rows > 1e6 else int(total_rows / 1000),
           2000,  # openGauss maximum
      )
      
    • Adjustment Guide:
      • Start with 1000 lists for 1M vectors
      • 2000 lists for 10M+ vectors
      • Monitor recall rate and adjust

2. Connection Pooling

OpenGaussSettings(
 min_connections=3,
 max_connections=20
)

Limitations

  • Vector type bit and sparsevec currently under development

3. Start with openGaussAGEGraph

3.1. Create extension age in openGauss

#Enter docker container
docker exec -it opengauss bash

#Switch to omm user
su omm

#Connect to the database, and the OMM database is used by default
gsql -r

#Create the age plug-in on the OMM database
create extension age;

#Exit database connecting
\q

3.2. Basic Usage

from langchain_core.documents import Document
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_opengauss import openGaussAGEGraph, OpenGaussSettings
from langchain_community.llms import Tongyi
from langchain_core.prompts import PromptTemplate
from langchain.chains import GraphCypherQAChain
from langchain_core.output_parsers import StrOutputParser
import os

#set api-key
os.environ["DASHSCOPE_API_KEY"] = "sk-**"
graph_llm =Tongyi(model="qwen-plus", temperature=0, base_url="https://dashscope.aliyuncs.com/compatible-mode/v1")

llm_transformer = LLMGraphTransformer(
    llm=graph_llm,
    allowed_nodes=["Person", "Organization", "Location", "Award", "ResearchField"],
    allowed_relationships = ["SPOUSE", "AWARD", "FIELD_OF_RESEARCH", "WORKS_AT", "IN_LOCATION"],
)

text = """
Marie Curie, 7 November 1867 – 4 July 1934, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.
"""

documents = [Document(page_content=text)]
graph_documents = llm_transformer.convert_to_graph_documents(documents)

conf = OpenGaussSettings{
    database = "omm",				#Default database name
    user = "gaussdb",				#Database username
    password = "YourPassoword",	    #Password with complexity requirements
    host = "Your IP",				#Database server address
    port = 8888					#Database server port
}
graph=openGaussAGEGraph(graph_name='graphtest',conf=conf,create=True)
graph.add_graph_documents(graph_documents)
graph.refresh_schema()

cypher_prompt = PromptTemplate(
    template="""You are an expert in generating AGE Cypher queries.Use the following schema to generate a Cypher query to answer the given question.Do not include name, properties, or cypher.
    Schema:{schema}
    Question: {question}
    Cypher Query:""",
    input_variables=["schema", "question"],
)

chain = GraphCypherQAChain.from_llm(
    graph_llm, graph=graph, verbose=True, allow_dangerous_requests=True, cypher_validation=True, return_intermediate_steps=True,cypher_prompt=cypher_prompt
)

question = "Who get Nobel Prize ?"
result = chain.invoke({"query": question})

prompt = PromptTemplate(
    template="""You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context from a graph database to answer the question. If you don't know the answer, just say that you don't know. 
    Use two sentences maximum and keep the answer concise:
    Question: {question} 
    Graph Context: {graph_context}
    Answer: 
    """,
    input_variables=["question", "graph_context"],
)

composite_chain = prompt | graph_llm |StrOutputParser()

answer = composite_chain.invoke(
    {"question": question, "graph_context": result}
)
print(answer)

3.3 API Reference

Core Methods

Method Description
__init__(graph_name, conf, create) Create object of openGaussAGEGraph
_wrap_query(query: str, graph_name: str) Convert a Cyper query to an openGauss Age compatible Sql Query.
add_graph_documents(graph_documents, include_source) insert a list of graph documents into the graph
refresh_schema() Refresh the graph schema information by updating the available labels, relationships, and properties

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_opengauss-0.1.5.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_opengauss-0.1.5-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file langchain_opengauss-0.1.5.tar.gz.

File metadata

  • Download URL: langchain_opengauss-0.1.5.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.4

File hashes

Hashes for langchain_opengauss-0.1.5.tar.gz
Algorithm Hash digest
SHA256 d90cc615fe7543d61c6530a2af52d752f7cd8f5708549c4d0a1132b2b050839c
MD5 d30f3588d0dcabb798033258d051d213
BLAKE2b-256 c6f86b0e3a46ca34cb063613162dd6145b8d3edcec6de075e205b9888929369d

See more details on using hashes here.

File details

Details for the file langchain_opengauss-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_opengauss-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 f06ae3cd542601abb9622d54c091ccc7542081666c36935bcda274c0d6a6482f
MD5 fad15c5ee407a3e043f523f47bbf70f3
BLAKE2b-256 5affe7ef0b6d4d40de809cd3223b5e610f80e845a3511c03ee4999696e5f124e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page