Native LangChain retriever for Sourcey-generated documentation sites.
Project description
langchain-sourcey
langchain-sourcey is the native LangChain retriever for Sourcey-generated
documentation sites.
It turns a published Sourcey docs root into a LangChain knowledge source without a private indexing service or ingestion pipeline. The retriever works directly against Sourcey's public artefacts:
search-index.jsonfor candidate discoveryllms-full.txtfor full-page hydration- canonical page URLs for citations
Why this integration is a good LangChain fit
- No credentials required for public docs sites
- Retrieval works against static hosting, subpath deployments, and GitHub Pages
- Returned
Documentobjects carry canonicalmetadata["source"]URLs llms-full.txtgives cleaner full-page content than scraping rendered HTML- If
llms-full.txtis missing, the retriever falls back to page HTML
Install
pip install -U langchain-sourcey
Point site_url at the root of a published Sourcey build, for example
https://sourcey.com/docs or https://sourcey.com/cheesestore.
Quickstart
from langchain_sourcey import SourceyRetriever
retriever = SourceyRetriever(
site_url="https://sourcey.com/docs",
top_k=3,
)
docs = retriever.invoke("mcp integration")
for doc in docs:
print(doc.metadata["title"])
print(doc.metadata["source"])
print(doc.page_content[:160])
print()
For a runnable script, see examples/live_quickstart.py.
Use In A LangChain Chain
Install a chat model integration of your choice. This example uses OpenAI:
pip install -U langchain-openai
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_sourcey import SourceyRetriever
retriever = SourceyRetriever(site_url="https://sourcey.com/docs", top_k=3)
prompt = ChatPromptTemplate.from_template(
"""Answer the question using the documentation context below.
{context}
Question: {question}"""
)
chain = (
RunnablePassthrough.assign(context=(lambda x: x["question"]) | retriever)
| prompt
| ChatOpenAI(model="gpt-4.1-mini")
| StrOutputParser()
)
answer = chain.invoke({"question": "How does Sourcey document MCP servers?"})
print(answer)
For a fuller example, see examples/rag_chain.py.
Sourcey Site Contract
For best results, the published Sourcey site should:
- publish
search-index.json - publish
llms-full.txt - set
siteUrlinsourcey.config.tsso citations are canonical
search-index.json is required. llms-full.txt is strongly recommended because
it lets the retriever return full page content instead of HTML-derived fallback
text.
Returned Metadata
Each returned Document includes:
source: canonical page URL used for citationsmatched_url: original matched URL, including anchors when relevantmatched_title: matched search entry titletitle: hydrated page titlepath: Sourcey output path such asguides/search.htmlanchor: matched fragment, if anytab: Sourcey tab labelcategory: Sourcey search categorysite_url: docs root used for retrievalscore: retriever ranking score
Development
python -m pip install -e .[dev] build twine
PYTHONPATH=src pytest -q
SOURCEY_TEST_SITE_URL=https://sourcey.com/docs PYTHONPATH=src pytest tests/integration_tests/test_live_retriever.py -q
python -m build
python -m twine check dist/*
See CONTRIBUTING.md for the release and verification flow.
LangChain Submission Assets
This repo includes draft docs ready to turn into a LangChain docs PR:
Scope
This package intentionally ships SourceyRetriever only. A document loader is
deferred until the retriever proves enough demand to justify the maintenance
surface.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_sourcey-0.1.2.tar.gz.
File metadata
- Download URL: langchain_sourcey-0.1.2.tar.gz
- Upload date:
- Size: 36.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72bc1d0fa285c5ae91601b33feb87599861ab2c654326b2eddaeef56e2be23be
|
|
| MD5 |
34b0098436c98cae0db17e5a4814efb5
|
|
| BLAKE2b-256 |
c4c53146673a411e6c1955e6b8ad01e527782aaee8c46ffa84a683aa673c14a2
|
File details
Details for the file langchain_sourcey-0.1.2-py3-none-any.whl.
File metadata
- Download URL: langchain_sourcey-0.1.2-py3-none-any.whl
- Upload date:
- Size: 31.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24a8cba940dd529e4cde968d5c926c6cb305a27308db43f3559b5128c6fa1838
|
|
| MD5 |
2fd653feb06f6ca731745cd978d7b7d5
|
|
| BLAKE2b-256 |
0a9acf91010d1013fd1cfeedcd02ad06ec3c42ccd00b1a78f9c387e7ce6e633f
|