Skip to main content

LangChain retriever for Ask AI over published Sourcey docs sites.

Project description

langchain-sourcey

PyPI - Version PyPI - License CI

Build your own Ask AI on top of a published Sourcey docs site.

langchain-sourcey is the retrieval layer behind that feature.

Sourcey already emits the files a retriever needs:

  • search-index.json for candidate discovery
  • llms-full.txt for full-page hydration
  • canonical page URLs for citations

No hosted index is required. Point site_url at the docs root and use it.

Install

pip install -U langchain-sourcey

Point site_url at the root of a published Sourcey build:

  • https://sourcey.com/docs
  • https://sourcey.com/cheesestore
  • https://cheesestore.github.io

Quickstart

from langchain_sourcey import SourceyRetriever

retriever = SourceyRetriever(
    site_url="https://sourcey.com/docs",
    top_k=3,
)

docs = retriever.invoke("mcp integration")

for doc in docs:
    print(doc.metadata["title"])
    print(doc.metadata["source"])
    print(doc.page_content[:160])
    print()

For a runnable script, see examples/live_quickstart.py.

More context: https://sourcey.com/docs/guides/guide-langchain-retriever

Implement Ask AI

Install a chat model package. This example uses OpenAI:

pip install -U langchain-openai
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

from langchain_sourcey import SourceyRetriever

retriever = SourceyRetriever(site_url="https://sourcey.com/docs", top_k=3)

prompt = ChatPromptTemplate.from_template(
    """Answer the question using the documentation context below.

{context}

Question: {question}"""
)

chain = (
    RunnablePassthrough.assign(context=(lambda x: x["question"]) | retriever)
    | prompt
    | ChatOpenAI(model="gpt-4.1-mini")
    | StrOutputParser()
)

answer = chain.invoke({"question": "How does Sourcey document MCP servers?"})
print(answer)

For a fuller example, see examples/rag_chain.py.

What Has To Exist

For clean retrieval, the published Sourcey site should expose:

  • publish search-index.json
  • publish llms-full.txt
  • set siteUrl in sourcey.config.ts so citations are canonical

search-index.json is required.

llms-full.txt is strongly recommended. If it is missing, the retriever falls back to the matched page HTML.

Returned Metadata

Each returned Document includes:

  • source: canonical page URL used for citations
  • matched_url: original matched URL, including anchors when relevant
  • matched_title: matched search entry title
  • title: hydrated page title
  • path: Sourcey output path such as guides/search.html
  • anchor: matched fragment, if any
  • tab: Sourcey tab label
  • category: Sourcey search category
  • site_url: docs root used for retrieval
  • score: retriever ranking score

Development

python -m pip install -e .[dev] build twine
PYTHONPATH=src pytest -q
SOURCEY_TEST_SITE_URL=https://sourcey.com/docs PYTHONPATH=src pytest tests/integration_tests/test_live_retriever.py -q
python -m build
python -m twine check dist/*

See CONTRIBUTING.md for the release and verification flow.

LangChain Submission Assets

This repo includes draft docs ready to turn into a LangChain docs PR:

JavaScript Package

This repo also contains the JavaScript package in js.

Scope

This package ships SourceyRetriever only. No loader yet.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_sourcey-0.1.4.tar.gz (62.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_sourcey-0.1.4-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file langchain_sourcey-0.1.4.tar.gz.

File metadata

  • Download URL: langchain_sourcey-0.1.4.tar.gz
  • Upload date:
  • Size: 62.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for langchain_sourcey-0.1.4.tar.gz
Algorithm Hash digest
SHA256 ae150a57ddacea7a529369349eaff5fd5fa28907007425389373528d8748aaba
MD5 c48a3e89ba63bedd711e83bbfefd573e
BLAKE2b-256 3e44dd82bb90b85a659d816151ee47c60d02dbcbe98586a10f3842082253082a

See more details on using hashes here.

File details

Details for the file langchain_sourcey-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_sourcey-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 32412b15c14eea157a3f675bce91d5d36138a216c2aec08b1d68f3e1be62c621
MD5 54d1416885ddaf72919a4ece7de9ce0a
BLAKE2b-256 4d8461029d0d7624966d9b7881a4ee57abf91a5e3e6c5f6039c34c2ff5a9cb64

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page