High accuracy agentic search on scientific documents with citations
Project description
Vigyan — SDK for agentic search on scientific documents with citations
Overview
Vigyan provides a small, clean Python SDK to parse scientific PDFs, embed the content, index it in a vector database, and answer research questions with citation-aware metadata (paper, page range, paragraph ids, etc.).
Design Principles
- Clear interfaces:
VectorStoreandDocumentParserdecouple concerns. - Storage-agnostic domain models from
vigyan.models:Document,Chunk, andQueryHit. - Adapter implementations: LanceDB vector store with built-in embedding, GROBID parser.
- Domain-named Corpus modules:
CorpusIngestor,CorpusRetriever, andrun_research_queryorchestrate ingestion, retrieval, and cited answers.
Install
Requires Python 3.12+.
Dependencies include lancedb, httpx, lxml, and pydantic (declared in pyproject.toml).
Quick Start
from vigyan.corpus import CorpusIngestor, CorpusRetriever
from vigyan.parsers import GrobidParser
from vigyan.vectordb import LanceDBVectorStore
from vigyan.agent import run_research_query
# Configure adapters
store = LanceDBVectorStore(embedding_model="text-embedding-3-small")
parser = GrobidParser(server_url="http://localhost:8070") # GROBID must be running
# Ingest a PDF with automatic metadata via GROBID
pdf_bytes = open("paper.pdf", "rb").read()
ingestor = CorpusIngestor(parser=parser, store=store)
ingestor.ingest_pdf(pdf_bytes, meta=None)
# Retrieve relevant passages directly
retriever = CorpusRetriever(store=store)
hits = retriever.retrieve("protein folding with attention", top_k=5)
for h in hits:
print(h.citation, "-", h.title)
print(h.text)
# Or run the research agent for a cited answer
answer = run_research_query(
"What does this corpus say about protein folding with attention?",
db_uri="./vigyan_db",
embed_model="text-embedding-3-small",
)
print(answer.answer)
for citation in answer.citations:
print(f"[{citation.index}] {citation.citation}")
CLai Web Agent
clai web cannot pass Pydantic AI deps directly, so Vigyan's importable
agent resolves vector-store deps from environment variables when explicit SDK
deps are not provided:
export VIGYAN_DB_URI=./vigyan_db
export VIGYAN_EMBED_MODEL=text-embedding-3-small
# Optional:
# export VIGYAN_TOP_K=8
# export VIGYAN_FILTERS="year >= 2020"
uv run clai web --agent src.vigyan.agent.research_agent:agent
The normal SDK path still uses explicit deps via run_research_query(...).
Notes
- OpenAI-compatible key must be available in the environment for embedding.
- GROBID must be running for parsing and metadata extraction. You can swap in a different
DocumentParserimplementation if preferred. - The LanceDB store uses auto-embedding via the LanceDB registry, supporting OpenAI and other providers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vigyan-0.1.0.tar.gz.
File metadata
- Download URL: vigyan-0.1.0.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3f9746c8d9d95e16374c0146e6899629e5a22c1119c4141eed2628879382ad3
|
|
| MD5 |
22ebc74ef3b474ca61d7f6a0d4ea4cdc
|
|
| BLAKE2b-256 |
e8ddfc1d9b5d0b3227cc48570ad0a91dfac6846ed71969c0da5f4175fc9b20f6
|
File details
Details for the file vigyan-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vigyan-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46088b85f2a6122131df0f36d56dc437273a482557294c193127d9568ce5f606
|
|
| MD5 |
65f3730767c97295c304be7f8b4ad7c5
|
|
| BLAKE2b-256 |
89bbafc4b3b281956420db21c461bb908b504a9696715e6e1ea85ff8e408874e
|