Skip to main content

High accuracy agentic search on scientific documents with citations

Project description

Vigyan — SDK for agentic search on scientific documents with citations

Overview

Vigyan provides a small, clean Python SDK to parse scientific PDFs, embed the content, index it in a vector database, and answer research questions with citation-aware metadata (paper, page range, paragraph ids, etc.).

Design Principles

  • Clear interfaces: VectorStore and DocumentParser decouple concerns.
  • Storage-agnostic domain models from vigyan.models: Document, Chunk, and QueryHit.
  • Adapter implementations: LanceDB vector store with built-in embedding, GROBID parser.
  • Domain-named Corpus modules: CorpusIngestor, CorpusRetriever, and run_research_query orchestrate ingestion, retrieval, and cited answers.

Install

Requires Python 3.12+.

Dependencies include lancedb, httpx, lxml, and pydantic (declared in pyproject.toml).

Quick Start

from vigyan.corpus import CorpusIngestor, CorpusRetriever
from vigyan.parsers import GrobidParser
from vigyan.vectordb import LanceDBVectorStore
from vigyan.agent import run_research_query

# Configure adapters
store = LanceDBVectorStore(embedding_model="text-embedding-3-small")
parser = GrobidParser(server_url="http://localhost:8070")  # GROBID must be running

# Ingest a PDF with automatic metadata via GROBID
pdf_bytes = open("paper.pdf", "rb").read()
ingestor = CorpusIngestor(parser=parser, store=store)
ingestor.ingest_pdf(pdf_bytes, meta=None)

# Retrieve relevant passages directly
retriever = CorpusRetriever(store=store)
hits = retriever.retrieve("protein folding with attention", top_k=5)
for h in hits:
    print(h.citation, "-", h.title)
    print(h.text)

# Or run the research agent for a cited answer
answer = run_research_query(
    "What does this corpus say about protein folding with attention?",
    db_uri="./vigyan_db",
    embed_model="text-embedding-3-small",
)
print(answer.answer)
for citation in answer.citations:
    print(f"[{citation.index}] {citation.citation}")

CLai Web Agent

clai web cannot pass Pydantic AI deps directly, so Vigyan's importable agent resolves vector-store deps from environment variables when explicit SDK deps are not provided:

export VIGYAN_DB_URI=./vigyan_db
export VIGYAN_EMBED_MODEL=text-embedding-3-small
# Optional:
# export VIGYAN_TOP_K=8
# export VIGYAN_FILTERS="year >= 2020"

uv run clai web --agent src.vigyan.agent.research_agent:agent

The normal SDK path still uses explicit deps via run_research_query(...).

Notes

  • OpenAI-compatible key must be available in the environment for embedding.
  • GROBID must be running for parsing and metadata extraction. You can swap in a different DocumentParser implementation if preferred.
  • The LanceDB store uses auto-embedding via the LanceDB registry, supporting OpenAI and other providers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vigyan-0.1.0.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vigyan-0.1.0-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file vigyan-0.1.0.tar.gz.

File metadata

  • Download URL: vigyan-0.1.0.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for vigyan-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b3f9746c8d9d95e16374c0146e6899629e5a22c1119c4141eed2628879382ad3
MD5 22ebc74ef3b474ca61d7f6a0d4ea4cdc
BLAKE2b-256 e8ddfc1d9b5d0b3227cc48570ad0a91dfac6846ed71969c0da5f4175fc9b20f6

See more details on using hashes here.

File details

Details for the file vigyan-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vigyan-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for vigyan-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 46088b85f2a6122131df0f36d56dc437273a482557294c193127d9568ce5f606
MD5 65f3730767c97295c304be7f8b4ad7c5
BLAKE2b-256 89bbafc4b3b281956420db21c461bb908b504a9696715e6e1ea85ff8e408874e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page