Skip to main content

Native LangChain retriever for Sourcey-generated documentation sites.

Project description

langchain-sourcey

PyPI - Version PyPI - Downloads PyPI - License CI

langchain-sourcey is the native LangChain retriever for Sourcey-generated documentation sites.

It turns a published Sourcey docs root into a LangChain knowledge source without a private indexing service or ingestion pipeline. The retriever works directly against Sourcey's public artefacts:

  • search-index.json for candidate discovery
  • llms-full.txt for full-page hydration
  • canonical page URLs for citations

Why this integration is a good LangChain fit

  • No credentials required for public docs sites
  • Retrieval works against static hosting, subpath deployments, and GitHub Pages
  • Returned Document objects carry canonical metadata["source"] URLs
  • llms-full.txt gives cleaner full-page content than scraping rendered HTML
  • If llms-full.txt is missing, the retriever falls back to page HTML

Install

pip install -U langchain-sourcey

Point site_url at the root of a published Sourcey build, for example https://sourcey.com/docs or https://sourcey.com/cheesestore.

Quickstart

from langchain_sourcey import SourceyRetriever

retriever = SourceyRetriever(
    site_url="https://sourcey.com/docs",
    top_k=3,
)

docs = retriever.invoke("mcp integration")

for doc in docs:
    print(doc.metadata["title"])
    print(doc.metadata["source"])
    print(doc.page_content[:160])
    print()

For a runnable script, see examples/live_quickstart.py.

Use In A LangChain Chain

Install a chat model integration of your choice. This example uses OpenAI:

pip install -U langchain-openai
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

from langchain_sourcey import SourceyRetriever

retriever = SourceyRetriever(site_url="https://sourcey.com/docs", top_k=3)

prompt = ChatPromptTemplate.from_template(
    """Answer the question using the documentation context below.

{context}

Question: {question}"""
)

chain = (
    RunnablePassthrough.assign(context=(lambda x: x["question"]) | retriever)
    | prompt
    | ChatOpenAI(model="gpt-4.1-mini")
    | StrOutputParser()
)

answer = chain.invoke({"question": "How does Sourcey document MCP servers?"})
print(answer)

For a fuller example, see examples/rag_chain.py.

Sourcey Site Contract

For best results, the published Sourcey site should:

  • publish search-index.json
  • publish llms-full.txt
  • set siteUrl in sourcey.config.ts so citations are canonical

search-index.json is required. llms-full.txt is strongly recommended because it lets the retriever return full page content instead of HTML-derived fallback text.

Returned Metadata

Each returned Document includes:

  • source: canonical page URL used for citations
  • matched_url: original matched URL, including anchors when relevant
  • matched_title: matched search entry title
  • title: hydrated page title
  • path: Sourcey output path such as guides/search.html
  • anchor: matched fragment, if any
  • tab: Sourcey tab label
  • category: Sourcey search category
  • site_url: docs root used for retrieval
  • score: retriever ranking score

Development

python -m pip install -e .[dev] build twine
PYTHONPATH=src pytest -q
SOURCEY_TEST_SITE_URL=https://sourcey.com/docs PYTHONPATH=src pytest tests/integration_tests/test_live_retriever.py -q
python -m build
python -m twine check dist/*

See CONTRIBUTING.md for the release and verification flow.

LangChain Submission Assets

This repo includes draft docs ready to turn into a LangChain docs PR:

Scope

This package intentionally ships SourceyRetriever only. A document loader is deferred until the retriever proves enough demand to justify the maintenance surface.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_sourcey-0.1.2.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_sourcey-0.1.2-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file langchain_sourcey-0.1.2.tar.gz.

File metadata

  • Download URL: langchain_sourcey-0.1.2.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for langchain_sourcey-0.1.2.tar.gz
Algorithm Hash digest
SHA256 72bc1d0fa285c5ae91601b33feb87599861ab2c654326b2eddaeef56e2be23be
MD5 34b0098436c98cae0db17e5a4814efb5
BLAKE2b-256 c4c53146673a411e6c1955e6b8ad01e527782aaee8c46ffa84a683aa673c14a2

See more details on using hashes here.

File details

Details for the file langchain_sourcey-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_sourcey-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 24a8cba940dd529e4cde968d5c926c6cb305a27308db43f3559b5128c6fa1838
MD5 2fd653feb06f6ca731745cd978d7b7d5
BLAKE2b-256 0a9acf91010d1013fd1cfeedcd02ad06ec3c42ccd00b1a78f9c387e7ce6e633f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page