Skip to main content

LangChain retriever for Ask AI over published Sourcey docs sites.

Project description

langchain-sourcey

PyPI - Version PyPI - License CI

Implement Ask AI over a published Sourcey docs site.

langchain-sourcey reads Sourcey's generated search and LLM artefacts and returns canonical page URLs for citation.

Sourcey already emits the files a retriever needs:

  • search-index.json for candidate discovery
  • llms-full.txt for full-page hydration
  • canonical page URLs for citations

No hosted index is required. Point site_url at the docs root and use it.

Install

pip install -U langchain-sourcey

Point site_url at the root of a published Sourcey build:

  • https://sourcey.com/docs
  • https://sourcey.com/cheesestore
  • https://cheesestore.github.io

Quickstart

from langchain_sourcey import SourceyRetriever

retriever = SourceyRetriever(
    site_url="https://sourcey.com/docs",
    top_k=3,
)

docs = retriever.invoke("mcp integration")

for doc in docs:
    print(doc.metadata["title"])
    print(doc.metadata["source"])
    print(doc.page_content[:160])
    print()

For a runnable script, see examples/live_quickstart.py.

More context: Sourcey guide

Implement Ask AI

Install a chat model package. This example uses OpenAI:

pip install -U langchain-openai
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

from langchain_sourcey import SourceyRetriever

retriever = SourceyRetriever(site_url="https://sourcey.com/docs", top_k=3)

prompt = ChatPromptTemplate.from_template(
    """Answer the question using the documentation context below.

{context}

Question: {question}"""
)

chain = (
    RunnablePassthrough.assign(context=(lambda x: x["question"]) | retriever)
    | prompt
    | ChatOpenAI(model="gpt-4.1-mini")
    | StrOutputParser()
)

answer = chain.invoke({"question": "How does Sourcey document MCP servers?"})
print(answer)

For a fuller example, see examples/rag_chain.py.

Sourcey Output Contract

For predictable retrieval, the published Sourcey site should expose:

  • publish search-index.json
  • publish llms-full.txt
  • set siteUrl in sourcey.config.ts so citations are canonical

search-index.json is required.

llms-full.txt is strongly recommended. If it is missing, the retriever falls back to the matched page HTML.

Returned Metadata

Each returned Document includes:

  • source: canonical page URL used for citations
  • matched_url: original matched URL, including anchors when relevant
  • matched_title: matched search entry title
  • title: hydrated page title
  • path: Sourcey output path such as guides/search.html
  • anchor: matched fragment, if any
  • tab: Sourcey tab label
  • category: Sourcey search category
  • site_url: docs root used for retrieval
  • score: retriever ranking score

Development

python -m pip install -e .[dev] build twine
PYTHONPATH=src pytest -q
SOURCEY_TEST_SITE_URL=https://sourcey.com/docs PYTHONPATH=src pytest tests/integration_tests/test_live_retriever.py -q
python -m build
python -m twine check dist/*

See CONTRIBUTING.md for the release and verification flow.

LangChain Submission Assets

This repo includes draft docs ready to turn into a LangChain docs PR:

JavaScript Package

This repo also contains the JavaScript package in js.

Scope

This package ships SourceyRetriever only. No loader yet.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_sourcey-0.1.6.tar.gz (62.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_sourcey-0.1.6-py3-none-any.whl (31.4 kB view details)

Uploaded Python 3

File details

Details for the file langchain_sourcey-0.1.6.tar.gz.

File metadata

  • Download URL: langchain_sourcey-0.1.6.tar.gz
  • Upload date:
  • Size: 62.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for langchain_sourcey-0.1.6.tar.gz
Algorithm Hash digest
SHA256 19d481a7e5966d8603e305fcfa0e8a419669c480cf588f1f0ffe5cceec4e84da
MD5 436522d91d7abfe32cba8c414c3592fe
BLAKE2b-256 ae00225234134e6e105b2d9d8424fe1f72944d81bc0df2fb7c0ffbed441e010e

See more details on using hashes here.

File details

Details for the file langchain_sourcey-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_sourcey-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 03c6126978c2ea7d98a8b3631b21d22e4ae0271879ceb8faeafe809c3903f67c
MD5 1d0256f9d0c4b4b1086f9b1a623b6512
BLAKE2b-256 f1920c3511ca69465863fbe56f612ad6e6828766a9ad6c510115812203199d5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page