Skip to main content

Native LangChain retriever for Sourcey-generated documentation sites.

Project description

langchain-sourcey

PyPI - Version PyPI - License CI

Your docs retriever should not depend on somebody else's SaaS either.

langchain-sourcey reads a published Sourcey docs site directly.

Sourcey already ships the files a retriever needs. This package uses them:

  • search-index.json for candidate discovery
  • llms-full.txt for full-page hydration
  • canonical page URLs for citations

If llms-full.txt is missing, it falls back to the matched page HTML.

Install

pip install -U langchain-sourcey

Point site_url at the root of a published Sourcey build:

  • https://sourcey.com/docs
  • https://sourcey.com/cheesestore
  • https://cheesestore.github.io

Quickstart

from langchain_sourcey import SourceyRetriever

retriever = SourceyRetriever(
    site_url="https://sourcey.com/docs",
    top_k=3,
)

docs = retriever.invoke("mcp integration")

for doc in docs:
    print(doc.metadata["title"])
    print(doc.metadata["source"])
    print(doc.page_content[:160])
    print()

For a runnable script, see examples/live_quickstart.py.

Sourcey guide: https://sourcey.com/docs/guides/guide-langchain-retriever.html

Use In A LangChain Chain

Install a chat model integration of your choice. This example uses OpenAI:

pip install -U langchain-openai
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

from langchain_sourcey import SourceyRetriever

retriever = SourceyRetriever(site_url="https://sourcey.com/docs", top_k=3)

prompt = ChatPromptTemplate.from_template(
    """Answer the question using the documentation context below.

{context}

Question: {question}"""
)

chain = (
    RunnablePassthrough.assign(context=(lambda x: x["question"]) | retriever)
    | prompt
    | ChatOpenAI(model="gpt-4.1-mini")
    | StrOutputParser()
)

answer = chain.invoke({"question": "How does Sourcey document MCP servers?"})
print(answer)

For a fuller example, see examples/rag_chain.py.

Sourcey Contract

This package assumes the published Sourcey site exposes:

  • publish search-index.json
  • publish llms-full.txt
  • set siteUrl in sourcey.config.ts so citations are canonical

search-index.json is required. llms-full.txt is strongly recommended because it gives the retriever full page content instead of HTML-derived fallback text.

Returned Metadata

Each returned Document includes:

  • source: canonical page URL used for citations
  • matched_url: original matched URL, including anchors when relevant
  • matched_title: matched search entry title
  • title: hydrated page title
  • path: Sourcey output path such as guides/search.html
  • anchor: matched fragment, if any
  • tab: Sourcey tab label
  • category: Sourcey search category
  • site_url: docs root used for retrieval
  • score: retriever ranking score

Development

python -m pip install -e .[dev] build twine
PYTHONPATH=src pytest -q
SOURCEY_TEST_SITE_URL=https://sourcey.com/docs PYTHONPATH=src pytest tests/integration_tests/test_live_retriever.py -q
python -m build
python -m twine check dist/*

See CONTRIBUTING.md for the release and verification flow.

LangChain Submission Assets

This repo includes draft docs ready to turn into a LangChain docs PR:

Scope

This package ships SourceyRetriever only. No loader yet.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_sourcey-0.1.3.tar.gz (35.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_sourcey-0.1.3-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file langchain_sourcey-0.1.3.tar.gz.

File metadata

  • Download URL: langchain_sourcey-0.1.3.tar.gz
  • Upload date:
  • Size: 35.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for langchain_sourcey-0.1.3.tar.gz
Algorithm Hash digest
SHA256 389dcf753d3a08d6fdc0d127c3006397932392aa201e7f492ac96bee45585d34
MD5 f0fcfcf83cd2b5b36151ed6ac347cf4f
BLAKE2b-256 06f24394a7463ff94af5c61c2fa473f76d1e65d97284afe5bb04ff9c36d736ed

See more details on using hashes here.

File details

Details for the file langchain_sourcey-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_sourcey-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0f8f95f7121e0aa22ec53cbee8711412c098fd0724803ba0991270a272174ef8
MD5 846f0c90ef5666bec0960ff8106ef314
BLAKE2b-256 d56c2f639cd1858b8f3837717136c7e33d5d0926184edf6ac896363aa534238c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page