Skip to main content

LangChain integration for CRW — high-performance web scraping document loader

Project description

langchain-crw

PyPI version Python License: MIT

LangChain document loader for CRW — a high-performance, Firecrawl-compatible web scraper written in Rust.

Installation

pip install langchain-crw
# or
uv add langchain-crw

That's it. No server to install, no cargo install, no Docker. The crw SDK automatically downloads and manages the CRW binary for you.

Quick Start — Zero Config (Subprocess Mode)

from langchain_crw import CrwLoader

# Just works — crw SDK handles everything locally
loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()
print(docs[0].page_content)  # clean markdown

Cloud Mode (fastcrw.com)

No local binary needed. Sign up at fastcrw.com and get 500 free credits:

from langchain_crw import CrwLoader

loader = CrwLoader(
    url="https://example.com",
    mode="scrape",
    api_url="https://fastcrw.com/api",
    api_key="crw_live_...",  # or set CRW_API_KEY env var
)
docs = loader.load()

Advanced: Self-hosted Server

If you prefer running a persistent CRW server (e.g., shared across services):

# Option A: Install binary
curl -fsSL https://raw.githubusercontent.com/us/crw/main/install.sh | sh
crw  # starts on http://localhost:3000

# Option B: Docker
docker run -d -p 3000:3000 ghcr.io/us/crw:latest
loader = CrwLoader(url="https://example.com", api_url="http://localhost:3000")

Usage

Scrape a single page

loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()

print(docs[0].page_content)    # clean markdown
print(docs[0].metadata)        # {'title': '...', 'sourceURL': '...', 'statusCode': 200}

Crawl an entire site

loader = CrwLoader(
    url="https://docs.example.com",
    mode="crawl",
    params={"max_depth": 3, "max_pages": 50},
)
docs = loader.load()
print(f"Crawled {len(docs)} pages")

Discover URLs (map mode)

loader = CrwLoader(url="https://example.com", mode="map")
urls = [doc.page_content for doc in loader.load()]

Search the web (Cloud Only)

Cloud-only feature. Search requires a fastcrw.com API key or a CRW server with search configured.

from langchain_crw import CrwLoader

loader = CrwLoader(
    query="web scraping tools 2026",
    mode="search",
    api_url="https://fastcrw.com/api",
    api_key="YOUR_KEY",
    params={"limit": 5},
)
docs = loader.load()

for doc in docs:
    print(doc.metadata["title"], doc.metadata["url"])
    print(doc.page_content[:200])

Scrape with JS rendering

loader = CrwLoader(
    url="https://spa-app.example.com",
    mode="scrape",
    params={
        "render_js": True,
        "wait_for": 3000,
        "css_selector": "article.main-content",
    },
)
docs = loader.load()

RAG pipeline

from langchain_crw import CrwLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Crawl docs (self-hosted or cloud — same code)
loader = CrwLoader(url="https://docs.example.com", mode="crawl", params={"max_depth": 3, "max_pages": 50})
docs = loader.load()

# Split and embed
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())

# Query
results = vectorstore.similarity_search("how to authenticate")

Configuration

Constructor

Parameter Type Default Description
url str "" URL to scrape, crawl, or map. Not required for search mode
api_key str | None None Bearer token. Falls back to CRW_API_KEY env var
api_url str | None None CRW server URL. Falls back to CRW_API_URL. If unset, uses subprocess mode (no server needed)
mode "scrape" | "crawl" | "map" | "search" "scrape" Operation mode
query str | None None Search query string. Required for search mode
params dict | None None Additional API parameters

Params (snake_case, auto-converted to camelCase)

Param Modes Description
render_js scrape Enable JavaScript rendering
wait_for scrape Wait time in ms after page load
css_selector scrape CSS selector to extract
only_main_content scrape, crawl Extract main content only
max_depth crawl, map Maximum crawl depth
max_pages crawl Maximum pages to crawl
use_sitemap map Use sitemap for URL discovery
poll_interval crawl Poll interval in seconds (default: 2)
timeout crawl Crawl timeout in seconds (default: 300)

Migrating from FireCrawlLoader

CrwLoader supports the same scrape, crawl, and map modes, plus a search mode. Note that CrwLoader defaults to mode="scrape" while FireCrawlLoader defaults to mode="crawl" — set the mode explicitly when migrating.

# Before
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(url="https://example.com", api_key="fc-...", mode="scrape")

# After — pip install langchain-crw, zero config, no server needed
from langchain_crw import CrwLoader
loader = CrwLoader(url="https://example.com", mode="scrape")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_crw-0.4.0.tar.gz (70.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_crw-0.4.0-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file langchain_crw-0.4.0.tar.gz.

File metadata

  • Download URL: langchain_crw-0.4.0.tar.gz
  • Upload date:
  • Size: 70.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for langchain_crw-0.4.0.tar.gz
Algorithm Hash digest
SHA256 f74fd6b31e409e2c770f17fb8ee56f62d1c40708a974f9c7e6125614b8f30b00
MD5 0dbeeb4cbd11fa2626466825abc51aca
BLAKE2b-256 ab663e8eb41dcf650cb2bf26845b10a1db27d5e23bddb41bc18d8dad55c0160c

See more details on using hashes here.

File details

Details for the file langchain_crw-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: langchain_crw-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for langchain_crw-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 380c60593e4dc891473aba6f009b8e86342d7f5bb86c183e562c7dd96c3c4997
MD5 a2fc1d0bcf05ed9265db06c2a341ccb4
BLAKE2b-256 358aca32a8ab6a3175c2a5dcb7ea150a85d66b9e8cc664e444ad0f42fb88817b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page