LangChain integration for CRW — high-performance web scraping document loader
Project description
langchain-crw
LangChain document loader for CRW — a high-performance, Firecrawl-compatible web scraper written in Rust.
Installation
pip install langchain-crw
# or
uv add langchain-crw
That's it. No server to install, no cargo install, no Docker. The crw SDK automatically downloads and manages the CRW binary for you.
Quick Start — Zero Config (Subprocess Mode)
from langchain_crw import CrwLoader
# Just works — crw SDK handles everything locally
loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()
print(docs[0].page_content) # clean markdown
Cloud Mode (fastcrw.com)
No local binary needed. Sign up at fastcrw.com and get 500 free credits:
from langchain_crw import CrwLoader
loader = CrwLoader(
url="https://example.com",
mode="scrape",
api_url="https://fastcrw.com/api",
api_key="crw_live_...", # or set CRW_API_KEY env var
)
docs = loader.load()
Advanced: Self-hosted Server
If you prefer running a persistent CRW server (e.g., shared across services):
# Option A: Install binary
curl -fsSL https://raw.githubusercontent.com/us/crw/main/install.sh | sh
crw # starts on http://localhost:3000
# Option B: Docker
docker run -d -p 3000:3000 ghcr.io/us/crw:latest
loader = CrwLoader(url="https://example.com", api_url="http://localhost:3000")
Usage
Scrape a single page
loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()
print(docs[0].page_content) # clean markdown
print(docs[0].metadata) # {'title': '...', 'sourceURL': '...', 'statusCode': 200}
Crawl an entire site
loader = CrwLoader(
url="https://docs.example.com",
mode="crawl",
params={"max_depth": 3, "max_pages": 50},
)
docs = loader.load()
print(f"Crawled {len(docs)} pages")
Discover URLs (map mode)
loader = CrwLoader(url="https://example.com", mode="map")
urls = [doc.page_content for doc in loader.load()]
Search the web (Cloud Only)
Cloud-only feature. Search requires a fastcrw.com API key or a CRW server with search configured.
from langchain_crw import CrwLoader
loader = CrwLoader(
query="web scraping tools 2026",
mode="search",
api_url="https://fastcrw.com/api",
api_key="YOUR_KEY",
params={"limit": 5},
)
docs = loader.load()
for doc in docs:
print(doc.metadata["title"], doc.metadata["url"])
print(doc.page_content[:200])
Scrape with JS rendering
loader = CrwLoader(
url="https://spa-app.example.com",
mode="scrape",
params={
"render_js": True,
"wait_for": 3000,
"css_selector": "article.main-content",
},
)
docs = loader.load()
RAG pipeline
from langchain_crw import CrwLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Crawl docs (self-hosted or cloud — same code)
loader = CrwLoader(url="https://docs.example.com", mode="crawl", params={"max_depth": 3, "max_pages": 50})
docs = loader.load()
# Split and embed
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
# Query
results = vectorstore.similarity_search("how to authenticate")
Configuration
Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str |
"" |
URL to scrape, crawl, or map. Not required for search mode |
api_key |
str | None |
None |
Bearer token. Falls back to CRW_API_KEY env var |
api_url |
str | None |
None |
CRW server URL. Falls back to CRW_API_URL. If unset, uses subprocess mode (no server needed) |
mode |
"scrape" | "crawl" | "map" | "search" |
"scrape" |
Operation mode |
query |
str | None |
None |
Search query string. Required for search mode |
params |
dict | None |
None |
Additional API parameters |
Params (snake_case, auto-converted to camelCase)
| Param | Modes | Description |
|---|---|---|
render_js |
scrape | Enable JavaScript rendering |
wait_for |
scrape | Wait time in ms after page load |
css_selector |
scrape | CSS selector to extract |
only_main_content |
scrape, crawl | Extract main content only |
max_depth |
crawl, map | Maximum crawl depth |
max_pages |
crawl | Maximum pages to crawl |
use_sitemap |
map | Use sitemap for URL discovery |
poll_interval |
crawl | Poll interval in seconds (default: 2) |
timeout |
crawl | Crawl timeout in seconds (default: 300) |
Migrating from FireCrawlLoader
CrwLoader supports the same scrape, crawl, and map modes, plus a search mode. Note that CrwLoader defaults to mode="scrape" while FireCrawlLoader defaults to mode="crawl" — set the mode explicitly when migrating.
# Before
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(url="https://example.com", api_key="fc-...", mode="scrape")
# After — pip install langchain-crw, zero config, no server needed
from langchain_crw import CrwLoader
loader = CrwLoader(url="https://example.com", mode="scrape")
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_crw-0.4.0.tar.gz.
File metadata
- Download URL: langchain_crw-0.4.0.tar.gz
- Upload date:
- Size: 70.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f74fd6b31e409e2c770f17fb8ee56f62d1c40708a974f9c7e6125614b8f30b00
|
|
| MD5 |
0dbeeb4cbd11fa2626466825abc51aca
|
|
| BLAKE2b-256 |
ab663e8eb41dcf650cb2bf26845b10a1db27d5e23bddb41bc18d8dad55c0160c
|
File details
Details for the file langchain_crw-0.4.0-py3-none-any.whl.
File metadata
- Download URL: langchain_crw-0.4.0-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
380c60593e4dc891473aba6f009b8e86342d7f5bb86c183e562c7dd96c3c4997
|
|
| MD5 |
a2fc1d0bcf05ed9265db06c2a341ccb4
|
|
| BLAKE2b-256 |
358aca32a8ab6a3175c2a5dcb7ea150a85d66b9e8cc664e444ad0f42fb88817b
|