LangChain integration for CRW — high-performance web scraping document loader
Project description
langchain-crw
LangChain document loader for CRW — a high-performance, Firecrawl-compatible web scraper written in Rust.
Installation
pip install langchain-crw
# or
uv add langchain-crw
You also need a CRW backend:
# Self-hosted (free)
curl -fsSL https://raw.githubusercontent.com/us/crw/main/install.sh | bash
crw # starts on http://localhost:3000
# Or use fastCRW cloud: https://fastcrw.com
Quick Start
Scrape a single page
from langchain_crw import CrwLoader
loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()
print(docs[0].page_content) # clean markdown
print(docs[0].metadata) # title, sourceURL, statusCode
Crawl an entire site
loader = CrwLoader(
url="https://docs.example.com",
mode="crawl",
params={"max_depth": 3, "max_pages": 50},
)
docs = loader.load()
print(f"Crawled {len(docs)} pages")
Discover URLs (map mode)
loader = CrwLoader(url="https://example.com", mode="map")
urls = [doc.page_content for doc in loader.load()]
Cloud mode (fastCRW)
loader = CrwLoader(
url="https://example.com",
api_key="your-key", # or set CRW_API_KEY env var
api_url="https://fastcrw.com/api", # or set CRW_API_URL env var
)
docs = loader.load()
RAG pipeline
from langchain_crw import CrwLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Crawl docs
loader = CrwLoader(url="https://docs.example.com", mode="crawl", params={"max_depth": 3, "max_pages": 50})
docs = loader.load()
# Split and embed
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
# Query
results = vectorstore.similarity_search("how to authenticate")
Configuration
Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str |
required | URL to scrape, crawl, or map |
api_key |
str | None |
None |
Bearer token. Falls back to CRW_API_KEY env var |
api_url |
str | None |
None |
CRW server URL. Falls back to CRW_API_URL, then http://localhost:3000 |
mode |
"scrape" | "crawl" | "map" |
"scrape" |
Operation mode |
params |
dict | None |
None |
Additional API parameters |
Params (snake_case, auto-converted to camelCase)
| Param | Modes | Description |
|---|---|---|
render_js |
scrape | Enable JavaScript rendering |
wait_for |
scrape | Wait time in ms after page load |
css_selector |
scrape | CSS selector to extract |
only_main_content |
scrape, crawl | Extract main content only |
max_depth |
crawl, map | Maximum crawl depth |
max_pages |
crawl | Maximum pages to crawl |
use_sitemap |
map | Use sitemap for URL discovery |
poll_interval |
crawl | Poll interval in seconds (default: 2) |
timeout |
crawl | Crawl timeout in seconds (default: 300) |
Migrating from FireCrawlLoader
CrwLoader supports the same scrape, crawl, and map modes. Note that CrwLoader defaults to mode="scrape" while FireCrawlLoader defaults to mode="crawl" — set the mode explicitly when migrating.
# Before
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(url="https://example.com", api_key="fc-...", mode="scrape")
# After — similar interface, self-hosted, no SDK needed
from langchain_crw import CrwLoader
loader = CrwLoader(url="https://example.com", mode="scrape")
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_crw-0.1.1.tar.gz.
File metadata
- Download URL: langchain_crw-0.1.1.tar.gz
- Upload date:
- Size: 68.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7e0b518bc3a33ba69c883de0b0afb6b50d41909b8ee4356da977732b35fe1ca
|
|
| MD5 |
fc02143cb3f533ad73dee58247054e80
|
|
| BLAKE2b-256 |
736bde23529ae9abe2c9835cb915e9d3f56f5836a9aacf415b94570fd30ea181
|
File details
Details for the file langchain_crw-0.1.1-py3-none-any.whl.
File metadata
- Download URL: langchain_crw-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91beb5c32c9b199eee4d9801f50a1ab42300a87ae8d738d719c7ea030c986f35
|
|
| MD5 |
635332269551daf087070f3b645fc0e3
|
|
| BLAKE2b-256 |
05b6b3c928760282408eb81b049980fb13bba4b492f47d8b907aeb090337b86d
|