LangChain document loader, tool, and retriever backed by the Crawlbase API.
Project description
langchain-crawlbase
LangChain primitives backed by the Crawlbase Crawling API.
Three drop-in classes:
CrawlbaseLoader— aBaseLoaderthat fetches a list of URLs and returns clean MarkdownDocuments ready for chunking.CrawlbaseTool— aBaseToolthat lets an LLM agent fetch a live web page mid-conversation.CrawlbaseRetriever— aBaseRetrieverthat fetches a fixed set of seed URLs and filters by query.
All three return GitHub-flavored Markdown via Crawlbase's format=md parameter, so you skip the HTML-stripping step entirely.
Installation
pip install langchain-crawlbase
Setup
Get a token from your Crawlbase dashboard. Use your normal token for static pages, or your JavaScript token for SPA / JS-rendered pages — Crawlbase routes the request automatically based on which token you send.
export CRAWLBASE_TOKEN=your_token
Usage
Document loader
import os
from langchain_crawlbase import CrawlbaseLoader
loader = CrawlbaseLoader(
token=os.environ["CRAWLBASE_TOKEN"],
urls=[
"https://en.wikipedia.org/wiki/Large_language_model",
"https://en.wikipedia.org/wiki/Retrieval-augmented_generation",
],
)
docs = loader.load()
print(docs[0].page_content[:500])
print(docs[0].metadata) # {'source': '...', 'pc_status': 200, ...}
For SPA pages, just use your JavaScript token instead — same interface:
loader = CrawlbaseLoader(
token=os.environ["CRAWLBASE_JS_TOKEN"],
urls=["https://some-spa-site.com/page"],
)
Agent tool
import os
from langchain_crawlbase import CrawlbaseTool
tool = CrawlbaseTool(token=os.environ["CRAWLBASE_TOKEN"])
# Use directly:
markdown = tool.invoke({"url": "https://example.com"})
# Or bind to an LLM:
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-opus-4-7")
agent_llm = llm.bind_tools([tool])
Retriever
import os
from langchain_crawlbase import CrawlbaseRetriever
retriever = CrawlbaseRetriever(
token=os.environ["CRAWLBASE_TOKEN"],
urls=[
"https://crawlbase.com/docs/crawling-api",
"https://crawlbase.com/docs/crawling-api#parameters",
],
)
docs = retriever.invoke("how do I render JavaScript pages")
v0.1 uses simple substring matching. For semantic retrieval, pair
CrawlbaseLoaderwith a vector store of your choice.
Extra Crawlbase parameters
Pass any Crawlbase API parameter via extra_params:
loader = CrawlbaseLoader(
token=token,
urls=["https://example.com"],
extra_params={"country": "US", "device": "mobile"},
)
Development
pip install -e ".[dev]"
pytest tests/unit
ruff check .
Integration tests are gated on CRAWLBASE_TOKEN:
CRAWLBASE_TOKEN=xxx pytest tests/integration
License
MIT — © Crawlbase Team. Contact: support@crawlbase.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_crawlbase-0.1.0.tar.gz.
File metadata
- Download URL: langchain_crawlbase-0.1.0.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b8e241ecb17e28414d120de52766cd4402727bcc5e15af1f5cbb720579856f1
|
|
| MD5 |
d1e99325143a6cddfd6ec0171399ce46
|
|
| BLAKE2b-256 |
cc4edc52f9234d04c7b9cbd7b0249d1a1c29f4c6d1ac82a72a27803ff00fc4c8
|
Provenance
The following attestation bundles were made for langchain_crawlbase-0.1.0.tar.gz:
Publisher:
release.yml on crawlbase/langchain-crawlbase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_crawlbase-0.1.0.tar.gz -
Subject digest:
7b8e241ecb17e28414d120de52766cd4402727bcc5e15af1f5cbb720579856f1 - Sigstore transparency entry: 1429467037
- Sigstore integration time:
-
Permalink:
crawlbase/langchain-crawlbase@3bed216d1aa3e99d7a7798f1fc5a7e5d9c66f2b7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/crawlbase
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3bed216d1aa3e99d7a7798f1fc5a7e5d9c66f2b7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file langchain_crawlbase-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langchain_crawlbase-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6409f10c14a90783f32e09b7447413cb9f32416fb4347e0769a3a8d59ac5fa3
|
|
| MD5 |
a00d0de0aef1c613778103bda0db8853
|
|
| BLAKE2b-256 |
9e4953ac68575aae9e92d2a5529299cb996df87d38b65b003d351822bef1b4ff
|
Provenance
The following attestation bundles were made for langchain_crawlbase-0.1.0-py3-none-any.whl:
Publisher:
release.yml on crawlbase/langchain-crawlbase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_crawlbase-0.1.0-py3-none-any.whl -
Subject digest:
e6409f10c14a90783f32e09b7447413cb9f32416fb4347e0769a3a8d59ac5fa3 - Sigstore transparency entry: 1429467049
- Sigstore integration time:
-
Permalink:
crawlbase/langchain-crawlbase@3bed216d1aa3e99d7a7798f1fc5a7e5d9c66f2b7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/crawlbase
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3bed216d1aa3e99d7a7798f1fc5a7e5d9c66f2b7 -
Trigger Event:
release
-
Statement type: