LangChain integration for CRW — high-performance web scraping document loader
Project description
langchain-crw
LangChain document loader for CRW — a high-performance, Firecrawl-compatible web scraper written in Rust.
Installation
pip install langchain-crw
# or
uv add langchain-crw
You also need a CRW backend:
# Self-hosted (free)
curl -fsSL https://raw.githubusercontent.com/us/crw/main/install.sh | bash
crw # starts on http://localhost:3000
# Or use fastCRW cloud: https://fastcrw.com
Quick Start
Scrape a single page
from langchain_crw import CrwLoader
loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()
print(docs[0].page_content) # clean markdown
print(docs[0].metadata) # title, sourceURL, statusCode
Crawl an entire site
loader = CrwLoader(
url="https://docs.example.com",
mode="crawl",
params={"max_depth": 3, "max_pages": 50},
)
docs = loader.load()
print(f"Crawled {len(docs)} pages")
Discover URLs (map mode)
loader = CrwLoader(url="https://example.com", mode="map")
urls = [doc.page_content for doc in loader.load()]
Cloud mode (fastCRW)
loader = CrwLoader(
url="https://example.com",
api_key="your-key", # or set CRW_API_KEY env var
api_url="https://fastcrw.com/api", # or set CRW_API_URL env var
)
docs = loader.load()
RAG pipeline
from langchain_crw import CrwLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Crawl docs
loader = CrwLoader(url="https://docs.example.com", mode="crawl", params={"max_depth": 3, "max_pages": 50})
docs = loader.load()
# Split and embed
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
# Query
results = vectorstore.similarity_search("how to authenticate")
Configuration
Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str |
required | URL to scrape, crawl, or map |
api_key |
str | None |
None |
Bearer token. Falls back to CRW_API_KEY env var |
api_url |
str | None |
None |
CRW server URL. Falls back to CRW_API_URL, then http://localhost:3000 |
mode |
"scrape" | "crawl" | "map" |
"scrape" |
Operation mode |
params |
dict | None |
None |
Additional API parameters |
Params (snake_case, auto-converted to camelCase)
| Param | Modes | Description |
|---|---|---|
render_js |
scrape | Enable JavaScript rendering |
wait_for |
scrape | Wait time in ms after page load |
css_selector |
scrape | CSS selector to extract |
only_main_content |
scrape, crawl | Extract main content only |
max_depth |
crawl, map | Maximum crawl depth |
max_pages |
crawl | Maximum pages to crawl |
use_sitemap |
map | Use sitemap for URL discovery |
poll_interval |
crawl | Poll interval in seconds (default: 2) |
timeout |
crawl | Crawl timeout in seconds (default: 300) |
Migrating from FireCrawlLoader
# Before
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(url="https://example.com", api_key="fc-...", mode="scrape")
# After — same interface, self-hosted, no SDK needed
from langchain_crw import CrwLoader
loader = CrwLoader(url="https://example.com", mode="scrape")
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
langchain_crw-0.1.0.tar.gz
(67.9 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_crw-0.1.0.tar.gz.
File metadata
- Download URL: langchain_crw-0.1.0.tar.gz
- Upload date:
- Size: 67.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e7a6c6e7d9305fe2eb3b8ff64ca8f61e60f04dd7236a2b24a89dda68c2690eb
|
|
| MD5 |
cb5083db56e8bfdf6015a2763ba71a0b
|
|
| BLAKE2b-256 |
d83146cd592cb87c9a5d7994519f12cb2d286a2b10e406f990c7c370acb3d15a
|
File details
Details for the file langchain_crw-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langchain_crw-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0394dad88eb2ae769d6e557184c6020ec43fe9842826f2b9cf9118f7feb1392a
|
|
| MD5 |
801149ddf96c31583968ff1eef3bc376
|
|
| BLAKE2b-256 |
7891d153b8266d1aadf4fbc3e567b23bbc66bb772592fad70403ebd9d13be13e
|