Skip to main content

LangChain VectorStore for hybrid search with pgvector (dense) and pg_textsearch (BM25)

Project description

PGVecTextSearch

LangChain VectorStore implementation for hybrid search combining pgvector (dense vectors) and pg_textsearch (BM25 sparse search).

Features

  • Dense Search: pgvector HNSW index for semantic similarity search
  • Sparse Search: pg_textsearch BM25 index for keyword-based search
  • Hybrid Search: Combines dense and sparse results using RRF (Reciprocal Rank Fusion)
  • Type-safe Filtering: LlamaIndex-style MetadataFilter/MetadataFilters for metadata filtering
  • Multiple Distance Strategies: Cosine distance, Euclidean distance, Inner product

Installation

pip install langchain-pgvec-textsearch

Database Requirements

  • PostgreSQL 17 or 18
  • pgvector extension
  • pg_textsearch extension (for BM25 support)

Quick Start

import asyncio
from langchain_pgvec_textsearch import (
    PGVecTextSearchStore,
    PGVecTextSearchEngine,
    HybridSearchConfig,
    DistanceStrategy,
    HNSWIndex,
    BM25Index,
    Column,
)
from langchain_openai import OpenAIEmbeddings

DATABASE_URL = "postgresql+asyncpg://user:password@localhost:5432/dbname"

async def main():
    # Create engine
    engine = PGVecTextSearchEngine.from_connection_string_async(DATABASE_URL)

    # Create table with indexes
    await engine.ainit_hybrid_vectorstore_table(
        table_name="documents",
        vector_size=1536,
        metadata_columns=[
            Column("category", "TEXT"),
            Column("year", "INTEGER"),
        ],
        hnsw_index=HNSWIndex(
            name="idx_documents_hnsw",
            distance_strategy=DistanceStrategy.COSINE_DISTANCE,
        ),
        bm25_index=BM25Index(
            name="idx_documents_bm25",
            text_config="english",  # or "public.korean" for Korean
        ),
    )

    # Create vectorstore
    embeddings = OpenAIEmbeddings()
    store = await PGVecTextSearchStore.create(
        engine=engine,
        embedding_service=embeddings,
        table_name="documents",
        metadata_columns=["category", "year"],
        hybrid_search_config=HybridSearchConfig(
            enable_dense=True,
            enable_sparse=True,
        ),
    )

    # Add documents
    from langchain_core.documents import Document
    docs = [
        Document(page_content="AI is transforming industries", metadata={"category": "tech", "year": 2024}),
        Document(page_content="Machine learning models require data", metadata={"category": "tech", "year": 2023}),
    ]
    await store.aadd_documents(docs)

    # Search
    results = await store.asimilarity_search("artificial intelligence", k=5)
    for doc in results:
        print(doc.page_content)

asyncio.run(main())

Search Modes

Dense Search Only (Semantic Similarity)

Uses pgvector HNSW index for embedding-based similarity search.

store = await PGVecTextSearchStore.create(
    engine=engine,
    embedding_service=embeddings,
    table_name="documents",
    hybrid_search_config=HybridSearchConfig(
        enable_dense=True,
        enable_sparse=False,  # Disable BM25
    ),
)

# Search by semantic similarity
results = await store.asimilarity_search("artificial intelligence", k=5)

Sparse Search Only (BM25 Keyword Search)

Uses pg_textsearch BM25 index for keyword-based search.

store = await PGVecTextSearchStore.create(
    engine=engine,
    embedding_service=embeddings,
    table_name="documents",
    hybrid_search_config=HybridSearchConfig(
        enable_dense=False,  # Disable vector search
        enable_sparse=True,
        bm25_index_name="idx_documents_bm25",  # Must specify BM25 index name
    ),
)

# Search by keywords (BM25)
results = await store.asimilarity_search("machine learning data", k=5)

Hybrid Search (Dense + Sparse with RRF)

Combines both search methods using Reciprocal Rank Fusion.

from langchain_pgvec_textsearch import reciprocal_rank_fusion, weighted_sum_ranking

store = await PGVecTextSearchStore.create(
    engine=engine,
    embedding_service=embeddings,
    table_name="documents",
    hybrid_search_config=HybridSearchConfig(
        enable_dense=True,
        enable_sparse=True,
        dense_top_k=20,   # Fetch top 20 from dense search
        sparse_top_k=20,  # Fetch top 20 from sparse search
        fusion_function=reciprocal_rank_fusion,  # Default
        fusion_function_parameters={"rrf_k": 60},
        bm25_index_name="idx_documents_bm25",
    ),
)

# Hybrid search combines semantic and keyword matching
results = await store.asimilarity_search_with_score("AI machine learning", k=10)
for doc, score in results:
    print(f"Score: {score:.4f}, Content: {doc.page_content}")

Custom Fusion Function

You can use weighted sum ranking instead of RRF:

store = await PGVecTextSearchStore.create(
    engine=engine,
    embedding_service=embeddings,
    table_name="documents",
    hybrid_search_config=HybridSearchConfig(
        enable_dense=True,
        enable_sparse=True,
        fusion_function=weighted_sum_ranking,
        fusion_function_parameters={
            "dense_weight": 0.7,
            "sparse_weight": 0.3,
        },
    ),
)

Table Structure

Each table has only 4 columns:

Column Type Description
langchain_id UUID Document ID
content TEXT Document content
embedding vector Dense vector embedding
langchain_metadata JSON All document metadata

Metadata Storage

All Document metadata is stored in the langchain_metadata JSON column. This provides a simple and flexible storage model:

# Document with any metadata
doc = Document(
    page_content="AI is transforming industries",
    metadata={"category": "tech", "year": 2024, "tags": ["ai", "ml"]}
)
await store.aadd_documents([doc])

# Metadata is stored as JSON:
# langchain_metadata = {"category": "tech", "year": 2024, "tags": ["ai", "ml"]}

Metadata Filtering

PGVecTextSearch uses type-safe MetadataFilter and MetadataFilters classes for filtering. All filters query the langchain_metadata JSON column.

Basic Filter (Single Condition)

from langchain_pgvec_textsearch import MetadataFilter, FilterOperator

# Filter: category == "tech"
filter_obj = MetadataFilter(
    key="category",
    value="tech",
    operator=FilterOperator.EQ
)
results = await store.asimilarity_search("AI", k=5, filter=filter_obj)

Filter Operators

Operator Description Example Value
EQ Equals "tech"
NE Not equals "tech"
GT Greater than 4.5
GTE Greater than or equal 2024
LT Less than 2024
LTE Less than or equal 2023
IN Value in list ["tech", "science"]
NIN Value not in list ["sports", "food"]
BETWEEN Value between range [2020, 2024]
TEXT_MATCH LIKE pattern (case-sensitive) "data%"
TEXT_MATCH_INSENSITIVE ILIKE pattern (case-insensitive) "%search%"
EXISTS Field exists (not null) True
IS_EMPTY Field is null or empty True
ANY Array contains any ["a", "b"]
ALL Array contains all ["a", "b"]
CONTAINS Array contains value "tag"

Multiple Filters with AND

from langchain_pgvec_textsearch import MetadataFilters, FilterCondition

# Filter: category == "tech" AND year >= 2024
filter_obj = MetadataFilters(
    filters=[
        MetadataFilter(key="category", value="tech", operator=FilterOperator.EQ),
        MetadataFilter(key="year", value=2024, operator=FilterOperator.GTE),
    ],
    condition=FilterCondition.AND
)
results = await store.asimilarity_search("AI", k=5, filter=filter_obj)

Multiple Filters with OR

# Filter: category == "tech" OR category == "science"
filter_obj = MetadataFilters(
    filters=[
        MetadataFilter(key="category", value="tech", operator=FilterOperator.EQ),
        MetadataFilter(key="category", value="science", operator=FilterOperator.EQ),
    ],
    condition=FilterCondition.OR
)
results = await store.asimilarity_search("research", k=5, filter=filter_obj)

NOT Condition

# Filter: NOT (category == "sports")
filter_obj = MetadataFilters(
    filters=[
        MetadataFilter(key="category", value="sports", operator=FilterOperator.EQ),
    ],
    condition=FilterCondition.NOT
)
results = await store.asimilarity_search("news", k=5, filter=filter_obj)

Nested Filters (Complex Logic)

# Filter: (category == "tech" AND rating >= 4.5) OR category == "science"
filter_obj = MetadataFilters(
    filters=[
        MetadataFilters(
            filters=[
                MetadataFilter(key="category", value="tech", operator=FilterOperator.EQ),
                MetadataFilter(key="rating", value=4.5, operator=FilterOperator.GTE),
            ],
            condition=FilterCondition.AND
        ),
        MetadataFilter(key="category", value="science", operator=FilterOperator.EQ),
    ],
    condition=FilterCondition.OR
)
results = await store.asimilarity_search("research", k=5, filter=filter_obj)

Filter with IN Operator

# Filter: category IN ["tech", "science", "health"]
filter_obj = MetadataFilter(
    key="category",
    value=["tech", "science", "health"],
    operator=FilterOperator.IN
)
results = await store.asimilarity_search("AI", k=5, filter=filter_obj)

Filter with BETWEEN Operator

# Filter: year BETWEEN 2020 AND 2024
filter_obj = MetadataFilter(
    key="year",
    value=[2020, 2024],
    operator=FilterOperator.BETWEEN
)
results = await store.asimilarity_search("AI", k=5, filter=filter_obj)

Filter with Text Pattern Matching

# Filter: title LIKE "Introduction%"
filter_obj = MetadataFilter(
    key="title",
    value="Introduction%",
    operator=FilterOperator.TEXT_MATCH
)

# Case-insensitive: title ILIKE "%machine learning%"
filter_obj = MetadataFilter(
    key="title",
    value="%machine learning%",
    operator=FilterOperator.TEXT_MATCH_INSENSITIVE
)

Distance Strategies

from langchain_pgvec_textsearch import DistanceStrategy

# Cosine Distance (default) - good for normalized embeddings
DistanceStrategy.COSINE_DISTANCE

# Euclidean Distance - L2 distance
DistanceStrategy.EUCLIDEAN_DISTANCE

# Inner Product - dot product similarity
DistanceStrategy.INNER_PRODUCT

Configure in HNSWIndex:

hnsw_index = HNSWIndex(
    name="idx_documents_hnsw",
    distance_strategy=DistanceStrategy.COSINE_DISTANCE,
    m=16,           # Max connections per node
    ef_construction=64,  # Size of dynamic candidate list
)

Index Configuration

Index Naming

When table names contain special characters (hyphens, spaces) or uppercase letters, the index names are automatically sanitized:

  • Hyphens and spaces are replaced with underscores
  • All letters are lowercased

This is required for pg_textsearch's internal index lookup mechanism, which is case-sensitive.

Table Name Auto-generated Index Names
documents idx_documents_hnsw, idx_documents_bm25
Ko-StrategyQA-docs idx_ko_strategyqa_docs_hnsw, idx_ko_strategyqa_docs_bm25
my table name idx_my_table_name_hnsw, idx_my_table_name_bm25

You can also specify explicit index names:

# Explicit index names (recommended for special table names)
await engine.ainit_hybrid_vectorstore_table(
    table_name="Ko-StrategyQA-documents",
    vector_size=1536,
    hnsw_index=HNSWIndex(
        name="idx_ko_strategyqa_hnsw",  # Explicit name
        distance_strategy=DistanceStrategy.COSINE_DISTANCE,
    ),
    bm25_index=BM25Index(
        name="idx_ko_strategyqa_bm25",  # Explicit name
        text_config="public.korean",
    ),
)

# When using explicit BM25 index name, also specify in HybridSearchConfig
store = await PGVecTextSearchStore.create(
    engine=engine,
    embedding_service=embeddings,
    table_name="Ko-StrategyQA-documents",
    hybrid_search_config=HybridSearchConfig(
        enable_dense=True,
        enable_sparse=True,
        bm25_index_name="idx_ko_strategyqa_bm25",  # Must match!
    ),
)

HNSW Index (Dense Vectors)

hnsw_index = HNSWIndex(
    name="idx_documents_hnsw",
    distance_strategy=DistanceStrategy.COSINE_DISTANCE,
    m=16,               # Max connections per layer (default: 16)
    ef_construction=64, # Construction time quality (default: 64)
)

HNSW Query Parameters

Configure search-time parameters for better recall or performance:

from langchain_pgvec_textsearch import HybridSearchConfig, IterativeScanMode

store = await PGVecTextSearchStore.create(
    engine=engine,
    embedding_service=embeddings,
    table_name="documents",
    hybrid_search_config=HybridSearchConfig(
        enable_dense=True,
        enable_sparse=True,
        # HNSW query parameters
        ef_search=100,  # Higher = better recall, slower (default: 40)
        iterative_scan=IterativeScanMode.RELAXED_ORDER,  # For filtered queries (pgvector 0.8.0+)
    ),
)
Parameter Default Description
ef_search 40 Size of dynamic candidate list. Higher values improve recall at cost of speed.
iterative_scan None Iterative scan mode for filtered queries. Options: OFF, RELAXED_ORDER, STRICT_ORDER

When to increase ef_search:

  • When recall is more important than speed
  • When using filters that may reduce result count
  • Recommended range: 40-200

Iterative scan modes (pgvector 0.8.0+):

  • OFF: Disabled (default)
  • RELAXED_ORDER: Better performance, may slightly reorder results
  • STRICT_ORDER: Maintains strict distance ordering

IVFFlat Index (Alternative Dense Index)

from langchain_pgvec_textsearch import IVFFlatIndex

ivfflat_index = IVFFlatIndex(
    name="idx_documents_ivfflat",
    distance_strategy=DistanceStrategy.COSINE_DISTANCE,
    lists=100,  # Number of clusters
)

BM25 Index (Sparse Search)

bm25_index = BM25Index(
    name="idx_documents_bm25",
    text_config="english",  # PostgreSQL text search config
    k1=1.2,                 # Term frequency saturation (default: 1.2)
    b=0.75,                 # Length normalization (default: 0.75)
)

For Korean text:

bm25_index = BM25Index(
    name="idx_documents_bm25",
    text_config="public.korean",  # Korean text search config
)

Additional Operations

Delete Documents

# Delete by IDs
await store.adelete(ids=["id1", "id2", "id3"])

# Delete by filter
filter_obj = MetadataFilter(key="category", value="outdated", operator=FilterOperator.EQ)
await store.adelete(filter=filter_obj)

MMR Search (Maximal Marginal Relevance)

# Get diverse results
results = await store.amax_marginal_relevance_search(
    query="machine learning",
    k=5,
    fetch_k=20,
    lambda_mult=0.5,  # 0 = max diversity, 1 = max relevance
    filter=filter_obj,
)

Get Documents by IDs

docs = await store.aget_by_ids(["id1", "id2", "id3"])

API Reference

PGVecTextSearchEngine

Method Description
from_connection_string_async(url) Create engine from connection string
ainit_hybrid_vectorstore_table(...) Create table with indexes
adrop_table(table_name) Drop a table

PGVecTextSearchStore

Method Description
create(engine, embedding_service, table_name, ...) Create store instance
aadd_documents(documents, ids) Add documents
aadd_texts(texts, metadatas, ids) Add texts with metadata
asimilarity_search(query, k, filter) Search by query
asimilarity_search_with_score(query, k, filter) Search with scores
amax_marginal_relevance_search(query, k, fetch_k, lambda_mult, filter) MMR search
adelete(ids, filter) Delete documents
aget_by_ids(ids) Get documents by IDs

HybridSearchConfig

Parameter Type Default Description
enable_dense bool True Enable vector search
enable_sparse bool True Enable BM25 search
dense_top_k int 20 Top K for dense search
sparse_top_k int 20 Top K for sparse search
fusion_function callable reciprocal_rank_fusion Fusion function
fusion_function_parameters dict {} Fusion function params
bm25_index_name str None BM25 index name (auto-sanitized: hyphens/spaces become underscores, lowercased)
ef_search int None HNSW ef_search parameter. Higher = better recall, slower. (default: 40 in pgvector)
iterative_scan IterativeScanMode None HNSW iterative scan mode for filtered queries (pgvector 0.8.0+)

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_pgvec_textsearch-0.1.4.tar.gz (230.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_pgvec_textsearch-0.1.4-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file langchain_pgvec_textsearch-0.1.4.tar.gz.

File metadata

  • Download URL: langchain_pgvec_textsearch-0.1.4.tar.gz
  • Upload date:
  • Size: 230.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for langchain_pgvec_textsearch-0.1.4.tar.gz
Algorithm Hash digest
SHA256 b71741a1fa1ab248bf6bacea461360c6b4f3f1afe8f535d32a42a036a29a4267
MD5 3018ce1af61edc25d4a8533ec2b5972a
BLAKE2b-256 3123c7d0a3214ede00406af4fb9b40cb92c2ca690db05f9976b9d98cf75f9e90

See more details on using hashes here.

File details

Details for the file langchain_pgvec_textsearch-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: langchain_pgvec_textsearch-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for langchain_pgvec_textsearch-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c72f2909c5ae44071690b154a9c2315b01df8361e71c368cbed46fc15c509c2d
MD5 6991305e2d36b3e282f42144c1c05a76
BLAKE2b-256 5f15e00ccffb9c70176073190ba76b2701ba28995ab640a3694ae2bd887c2f7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page