LangChain VectorStore for hybrid search with pgvector (dense) and pg_textsearch (BM25)
Project description
PGVecTextSearch
LangChain VectorStore implementation for hybrid search combining pgvector (dense vectors) and pg_textsearch (BM25 sparse search).
Features
- Dense Search: pgvector HNSW index for semantic similarity search
- Sparse Search: pg_textsearch BM25 index for keyword-based search
- Hybrid Search: Combines dense and sparse results using RRF (Reciprocal Rank Fusion)
- Type-safe Filtering: LlamaIndex-style
MetadataFilter/MetadataFiltersfor metadata filtering - Multiple Distance Strategies: Cosine distance, Euclidean distance, Inner product
Installation
pip install langchain-pgvec-textsearch
Database Requirements
- PostgreSQL 17 or 18
- pgvector extension
- pg_textsearch extension (for BM25 support)
Quick Start
import asyncio
from langchain_pgvec_textsearch import (
PGVecTextSearchStore,
PGVecTextSearchEngine,
HybridSearchConfig,
DistanceStrategy,
HNSWIndex,
BM25Index,
Column,
)
from langchain_openai import OpenAIEmbeddings
DATABASE_URL = "postgresql+asyncpg://user:password@localhost:5432/dbname"
async def main():
# Create engine
engine = PGVecTextSearchEngine.from_connection_string_async(DATABASE_URL)
# Create table with indexes
await engine.ainit_hybrid_vectorstore_table(
table_name="documents",
vector_size=1536,
metadata_columns=[
Column("category", "TEXT"),
Column("year", "INTEGER"),
],
hnsw_index=HNSWIndex(
name="idx_documents_hnsw",
distance_strategy=DistanceStrategy.COSINE_DISTANCE,
),
bm25_index=BM25Index(
name="idx_documents_bm25",
text_config="english", # or "public.korean" for Korean
),
)
# Create vectorstore
embeddings = OpenAIEmbeddings()
store = await PGVecTextSearchStore.create(
engine=engine,
embedding_service=embeddings,
table_name="documents",
metadata_columns=["category", "year"],
hybrid_search_config=HybridSearchConfig(
enable_dense=True,
enable_sparse=True,
),
)
# Add documents
from langchain_core.documents import Document
docs = [
Document(page_content="AI is transforming industries", metadata={"category": "tech", "year": 2024}),
Document(page_content="Machine learning models require data", metadata={"category": "tech", "year": 2023}),
]
await store.aadd_documents(docs)
# Search
results = await store.asimilarity_search("artificial intelligence", k=5)
for doc in results:
print(doc.page_content)
asyncio.run(main())
Search Modes
Dense Search Only (Semantic Similarity)
Uses pgvector HNSW index for embedding-based similarity search.
store = await PGVecTextSearchStore.create(
engine=engine,
embedding_service=embeddings,
table_name="documents",
hybrid_search_config=HybridSearchConfig(
enable_dense=True,
enable_sparse=False, # Disable BM25
),
)
# Search by semantic similarity
results = await store.asimilarity_search("artificial intelligence", k=5)
Sparse Search Only (BM25 Keyword Search)
Uses pg_textsearch BM25 index for keyword-based search.
store = await PGVecTextSearchStore.create(
engine=engine,
embedding_service=embeddings,
table_name="documents",
hybrid_search_config=HybridSearchConfig(
enable_dense=False, # Disable vector search
enable_sparse=True,
bm25_index_name="idx_documents_bm25", # Must specify BM25 index name
),
)
# Search by keywords (BM25)
results = await store.asimilarity_search("machine learning data", k=5)
Hybrid Search (Dense + Sparse with RRF)
Combines both search methods using Reciprocal Rank Fusion.
from langchain_pgvec_textsearch import reciprocal_rank_fusion, weighted_sum_ranking
store = await PGVecTextSearchStore.create(
engine=engine,
embedding_service=embeddings,
table_name="documents",
hybrid_search_config=HybridSearchConfig(
enable_dense=True,
enable_sparse=True,
dense_top_k=20, # Fetch top 20 from dense search
sparse_top_k=20, # Fetch top 20 from sparse search
fusion_function=reciprocal_rank_fusion, # Default
fusion_function_parameters={"rrf_k": 60},
bm25_index_name="idx_documents_bm25",
),
)
# Hybrid search combines semantic and keyword matching
results = await store.asimilarity_search_with_score("AI machine learning", k=10)
for doc, score in results:
print(f"Score: {score:.4f}, Content: {doc.page_content}")
Custom Fusion Function
You can use weighted sum ranking instead of RRF:
store = await PGVecTextSearchStore.create(
engine=engine,
embedding_service=embeddings,
table_name="documents",
hybrid_search_config=HybridSearchConfig(
enable_dense=True,
enable_sparse=True,
fusion_function=weighted_sum_ranking,
fusion_function_parameters={
"dense_weight": 0.7,
"sparse_weight": 0.3,
},
),
)
Table Structure
Each table has only 4 columns:
| Column | Type | Description |
|---|---|---|
langchain_id |
UUID | Document ID |
content |
TEXT | Document content |
embedding |
vector | Dense vector embedding |
langchain_metadata |
JSON | All document metadata |
Metadata Storage
All Document metadata is stored in the langchain_metadata JSON column. This provides a simple and flexible storage model:
# Document with any metadata
doc = Document(
page_content="AI is transforming industries",
metadata={"category": "tech", "year": 2024, "tags": ["ai", "ml"]}
)
await store.aadd_documents([doc])
# Metadata is stored as JSON:
# langchain_metadata = {"category": "tech", "year": 2024, "tags": ["ai", "ml"]}
Metadata Filtering
PGVecTextSearch uses type-safe MetadataFilter and MetadataFilters classes for filtering.
All filters query the langchain_metadata JSON column.
Basic Filter (Single Condition)
from langchain_pgvec_textsearch import MetadataFilter, FilterOperator
# Filter: category == "tech"
filter_obj = MetadataFilter(
key="category",
value="tech",
operator=FilterOperator.EQ
)
results = await store.asimilarity_search("AI", k=5, filter=filter_obj)
Filter Operators
| Operator | Description | Example Value |
|---|---|---|
EQ |
Equals | "tech" |
NE |
Not equals | "tech" |
GT |
Greater than | 4.5 |
GTE |
Greater than or equal | 2024 |
LT |
Less than | 2024 |
LTE |
Less than or equal | 2023 |
IN |
Value in list | ["tech", "science"] |
NIN |
Value not in list | ["sports", "food"] |
BETWEEN |
Value between range | [2020, 2024] |
TEXT_MATCH |
LIKE pattern (case-sensitive) | "data%" |
TEXT_MATCH_INSENSITIVE |
ILIKE pattern (case-insensitive) | "%search%" |
EXISTS |
Field exists (not null) | True |
IS_EMPTY |
Field is null or empty | True |
ANY |
Array contains any | ["a", "b"] |
ALL |
Array contains all | ["a", "b"] |
CONTAINS |
Array contains value | "tag" |
Multiple Filters with AND
from langchain_pgvec_textsearch import MetadataFilters, FilterCondition
# Filter: category == "tech" AND year >= 2024
filter_obj = MetadataFilters(
filters=[
MetadataFilter(key="category", value="tech", operator=FilterOperator.EQ),
MetadataFilter(key="year", value=2024, operator=FilterOperator.GTE),
],
condition=FilterCondition.AND
)
results = await store.asimilarity_search("AI", k=5, filter=filter_obj)
Multiple Filters with OR
# Filter: category == "tech" OR category == "science"
filter_obj = MetadataFilters(
filters=[
MetadataFilter(key="category", value="tech", operator=FilterOperator.EQ),
MetadataFilter(key="category", value="science", operator=FilterOperator.EQ),
],
condition=FilterCondition.OR
)
results = await store.asimilarity_search("research", k=5, filter=filter_obj)
NOT Condition
# Filter: NOT (category == "sports")
filter_obj = MetadataFilters(
filters=[
MetadataFilter(key="category", value="sports", operator=FilterOperator.EQ),
],
condition=FilterCondition.NOT
)
results = await store.asimilarity_search("news", k=5, filter=filter_obj)
Nested Filters (Complex Logic)
# Filter: (category == "tech" AND rating >= 4.5) OR category == "science"
filter_obj = MetadataFilters(
filters=[
MetadataFilters(
filters=[
MetadataFilter(key="category", value="tech", operator=FilterOperator.EQ),
MetadataFilter(key="rating", value=4.5, operator=FilterOperator.GTE),
],
condition=FilterCondition.AND
),
MetadataFilter(key="category", value="science", operator=FilterOperator.EQ),
],
condition=FilterCondition.OR
)
results = await store.asimilarity_search("research", k=5, filter=filter_obj)
Filter with IN Operator
# Filter: category IN ["tech", "science", "health"]
filter_obj = MetadataFilter(
key="category",
value=["tech", "science", "health"],
operator=FilterOperator.IN
)
results = await store.asimilarity_search("AI", k=5, filter=filter_obj)
Filter with BETWEEN Operator
# Filter: year BETWEEN 2020 AND 2024
filter_obj = MetadataFilter(
key="year",
value=[2020, 2024],
operator=FilterOperator.BETWEEN
)
results = await store.asimilarity_search("AI", k=5, filter=filter_obj)
Filter with Text Pattern Matching
# Filter: title LIKE "Introduction%"
filter_obj = MetadataFilter(
key="title",
value="Introduction%",
operator=FilterOperator.TEXT_MATCH
)
# Case-insensitive: title ILIKE "%machine learning%"
filter_obj = MetadataFilter(
key="title",
value="%machine learning%",
operator=FilterOperator.TEXT_MATCH_INSENSITIVE
)
Distance Strategies
from langchain_pgvec_textsearch import DistanceStrategy
# Cosine Distance (default) - good for normalized embeddings
DistanceStrategy.COSINE_DISTANCE
# Euclidean Distance - L2 distance
DistanceStrategy.EUCLIDEAN_DISTANCE
# Inner Product - dot product similarity
DistanceStrategy.INNER_PRODUCT
Configure in HNSWIndex:
hnsw_index = HNSWIndex(
name="idx_documents_hnsw",
distance_strategy=DistanceStrategy.COSINE_DISTANCE,
m=16, # Max connections per node
ef_construction=64, # Size of dynamic candidate list
)
Index Configuration
Index Naming
When table names contain special characters (hyphens, spaces) or uppercase letters, the index names are automatically sanitized:
- Hyphens and spaces are replaced with underscores
- All letters are lowercased
This is required for pg_textsearch's internal index lookup mechanism, which is case-sensitive.
| Table Name | Auto-generated Index Names |
|---|---|
documents |
idx_documents_hnsw, idx_documents_bm25 |
Ko-StrategyQA-docs |
idx_ko_strategyqa_docs_hnsw, idx_ko_strategyqa_docs_bm25 |
my table name |
idx_my_table_name_hnsw, idx_my_table_name_bm25 |
You can also specify explicit index names:
# Explicit index names (recommended for special table names)
await engine.ainit_hybrid_vectorstore_table(
table_name="Ko-StrategyQA-documents",
vector_size=1536,
hnsw_index=HNSWIndex(
name="idx_ko_strategyqa_hnsw", # Explicit name
distance_strategy=DistanceStrategy.COSINE_DISTANCE,
),
bm25_index=BM25Index(
name="idx_ko_strategyqa_bm25", # Explicit name
text_config="public.korean",
),
)
# When using explicit BM25 index name, also specify in HybridSearchConfig
store = await PGVecTextSearchStore.create(
engine=engine,
embedding_service=embeddings,
table_name="Ko-StrategyQA-documents",
hybrid_search_config=HybridSearchConfig(
enable_dense=True,
enable_sparse=True,
bm25_index_name="idx_ko_strategyqa_bm25", # Must match!
),
)
HNSW Index (Dense Vectors)
hnsw_index = HNSWIndex(
name="idx_documents_hnsw",
distance_strategy=DistanceStrategy.COSINE_DISTANCE,
m=16, # Max connections per layer (default: 16)
ef_construction=64, # Construction time quality (default: 64)
)
HNSW Query Parameters
Configure search-time parameters for better recall or performance:
from langchain_pgvec_textsearch import HybridSearchConfig, IterativeScanMode
store = await PGVecTextSearchStore.create(
engine=engine,
embedding_service=embeddings,
table_name="documents",
hybrid_search_config=HybridSearchConfig(
enable_dense=True,
enable_sparse=True,
# HNSW query parameters
ef_search=100, # Higher = better recall, slower (default: 40)
iterative_scan=IterativeScanMode.RELAXED_ORDER, # For filtered queries (pgvector 0.8.0+)
),
)
| Parameter | Default | Description |
|---|---|---|
ef_search |
40 | Size of dynamic candidate list. Higher values improve recall at cost of speed. |
iterative_scan |
None | Iterative scan mode for filtered queries. Options: OFF, RELAXED_ORDER, STRICT_ORDER |
When to increase ef_search:
- When recall is more important than speed
- When using filters that may reduce result count
- Recommended range: 40-200
Iterative scan modes (pgvector 0.8.0+):
OFF: Disabled (default)RELAXED_ORDER: Better performance, may slightly reorder resultsSTRICT_ORDER: Maintains strict distance ordering
IVFFlat Index (Alternative Dense Index)
from langchain_pgvec_textsearch import IVFFlatIndex
ivfflat_index = IVFFlatIndex(
name="idx_documents_ivfflat",
distance_strategy=DistanceStrategy.COSINE_DISTANCE,
lists=100, # Number of clusters
)
BM25 Index (Sparse Search)
bm25_index = BM25Index(
name="idx_documents_bm25",
text_config="english", # PostgreSQL text search config
k1=1.2, # Term frequency saturation (default: 1.2)
b=0.75, # Length normalization (default: 0.75)
)
For Korean text:
bm25_index = BM25Index(
name="idx_documents_bm25",
text_config="public.korean", # Korean text search config
)
Additional Operations
Delete Documents
# Delete by IDs
await store.adelete(ids=["id1", "id2", "id3"])
# Delete by filter
filter_obj = MetadataFilter(key="category", value="outdated", operator=FilterOperator.EQ)
await store.adelete(filter=filter_obj)
MMR Search (Maximal Marginal Relevance)
# Get diverse results
results = await store.amax_marginal_relevance_search(
query="machine learning",
k=5,
fetch_k=20,
lambda_mult=0.5, # 0 = max diversity, 1 = max relevance
filter=filter_obj,
)
Get Documents by IDs
docs = await store.aget_by_ids(["id1", "id2", "id3"])
API Reference
PGVecTextSearchEngine
| Method | Description |
|---|---|
from_connection_string_async(url) |
Create engine from connection string |
ainit_hybrid_vectorstore_table(...) |
Create table with indexes |
adrop_table(table_name) |
Drop a table |
PGVecTextSearchStore
| Method | Description |
|---|---|
create(engine, embedding_service, table_name, ...) |
Create store instance |
aadd_documents(documents, ids) |
Add documents |
aadd_texts(texts, metadatas, ids) |
Add texts with metadata |
asimilarity_search(query, k, filter) |
Search by query |
asimilarity_search_with_score(query, k, filter) |
Search with scores |
amax_marginal_relevance_search(query, k, fetch_k, lambda_mult, filter) |
MMR search |
adelete(ids, filter) |
Delete documents |
aget_by_ids(ids) |
Get documents by IDs |
HybridSearchConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_dense |
bool | True | Enable vector search |
enable_sparse |
bool | True | Enable BM25 search |
dense_top_k |
int | 20 | Top K for dense search |
sparse_top_k |
int | 20 | Top K for sparse search |
fusion_function |
callable | reciprocal_rank_fusion |
Fusion function |
fusion_function_parameters |
dict | {} | Fusion function params |
bm25_index_name |
str | None | BM25 index name (auto-sanitized: hyphens/spaces become underscores, lowercased) |
ef_search |
int | None | HNSW ef_search parameter. Higher = better recall, slower. (default: 40 in pgvector) |
iterative_scan |
IterativeScanMode | None | HNSW iterative scan mode for filtered queries (pgvector 0.8.0+) |
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_pgvec_textsearch-0.1.4.tar.gz.
File metadata
- Download URL: langchain_pgvec_textsearch-0.1.4.tar.gz
- Upload date:
- Size: 230.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b71741a1fa1ab248bf6bacea461360c6b4f3f1afe8f535d32a42a036a29a4267
|
|
| MD5 |
3018ce1af61edc25d4a8533ec2b5972a
|
|
| BLAKE2b-256 |
3123c7d0a3214ede00406af4fb9b40cb92c2ca690db05f9976b9d98cf75f9e90
|
File details
Details for the file langchain_pgvec_textsearch-0.1.4-py3-none-any.whl.
File metadata
- Download URL: langchain_pgvec_textsearch-0.1.4-py3-none-any.whl
- Upload date:
- Size: 24.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c72f2909c5ae44071690b154a9c2315b01df8361e71c368cbed46fc15c509c2d
|
|
| MD5 |
6991305e2d36b3e282f42144c1c05a76
|
|
| BLAKE2b-256 |
5f15e00ccffb9c70176073190ba76b2701ba28995ab640a3694ae2bd887c2f7d
|