Skip to main content

LangChain integration for CRW — high-performance web scraping document loader

Project description

langchain-crw

PyPI version Python License: MIT

LangChain document loader for CRW — a high-performance, Firecrawl-compatible web scraper written in Rust.

Installation

pip install langchain-crw
# or
uv add langchain-crw

That's it. No server to install, no cargo install, no Docker. The crw SDK automatically downloads and manages the CRW binary for you.

Quick Start — Zero Config (Subprocess Mode)

from langchain_crw import CrwLoader

# Just works — crw SDK handles everything locally
loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()
print(docs[0].page_content)  # clean markdown

Cloud Mode (fastcrw.com)

No local binary needed. Sign up at fastcrw.com and get 500 free credits:

from langchain_crw import CrwLoader

loader = CrwLoader(
    url="https://example.com",
    mode="scrape",
    api_url="https://fastcrw.com/api",
    api_key="crw_live_...",  # or set CRW_API_KEY env var
)
docs = loader.load()

Advanced: Self-hosted Server

If you prefer running a persistent CRW server (e.g., shared across services):

# Option A: Install binary
curl -fsSL https://raw.githubusercontent.com/us/crw/main/install.sh | bash
crw  # starts on http://localhost:3000

# Option B: Docker
docker run -d -p 3000:3000 ghcr.io/us/crw:latest
loader = CrwLoader(url="https://example.com", api_url="http://localhost:3000")

Usage

Scrape a single page

loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()

print(docs[0].page_content)    # clean markdown
print(docs[0].metadata)        # {'title': '...', 'sourceURL': '...', 'statusCode': 200}

Crawl an entire site

loader = CrwLoader(
    url="https://docs.example.com",
    mode="crawl",
    params={"max_depth": 3, "max_pages": 50},
)
docs = loader.load()
print(f"Crawled {len(docs)} pages")

Discover URLs (map mode)

loader = CrwLoader(url="https://example.com", mode="map")
urls = [doc.page_content for doc in loader.load()]

Search the web (Cloud Only)

Cloud-only feature. Search requires a fastcrw.com API key or a CRW server with SearXNG configured.

from langchain_crw import CrwLoader

loader = CrwLoader(
    query="web scraping tools 2026",
    mode="search",
    api_url="https://fastcrw.com/api",
    api_key="YOUR_KEY",
    params={"limit": 5},
)
docs = loader.load()

for doc in docs:
    print(doc.metadata["title"], doc.metadata["url"])
    print(doc.page_content[:200])

Scrape with JS rendering

loader = CrwLoader(
    url="https://spa-app.example.com",
    mode="scrape",
    params={
        "render_js": True,
        "wait_for": 3000,
        "css_selector": "article.main-content",
    },
)
docs = loader.load()

RAG pipeline

from langchain_crw import CrwLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Crawl docs (self-hosted or cloud — same code)
loader = CrwLoader(url="https://docs.example.com", mode="crawl", params={"max_depth": 3, "max_pages": 50})
docs = loader.load()

# Split and embed
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())

# Query
results = vectorstore.similarity_search("how to authenticate")

Configuration

Constructor

Parameter Type Default Description
url str "" URL to scrape, crawl, or map. Not required for search mode
api_key str | None None Bearer token. Falls back to CRW_API_KEY env var
api_url str | None None CRW server URL. Falls back to CRW_API_URL. If unset, uses subprocess mode (no server needed)
mode "scrape" | "crawl" | "map" | "search" "scrape" Operation mode
query str | None None Search query string. Required for search mode
params dict | None None Additional API parameters

Params (snake_case, auto-converted to camelCase)

Param Modes Description
render_js scrape Enable JavaScript rendering
wait_for scrape Wait time in ms after page load
css_selector scrape CSS selector to extract
only_main_content scrape, crawl Extract main content only
max_depth crawl, map Maximum crawl depth
max_pages crawl Maximum pages to crawl
use_sitemap map Use sitemap for URL discovery
poll_interval crawl Poll interval in seconds (default: 2)
timeout crawl Crawl timeout in seconds (default: 300)

Migrating from FireCrawlLoader

CrwLoader supports the same scrape, crawl, and map modes, plus a search mode. Note that CrwLoader defaults to mode="scrape" while FireCrawlLoader defaults to mode="crawl" — set the mode explicitly when migrating.

# Before
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(url="https://example.com", api_key="fc-...", mode="scrape")

# After — pip install langchain-crw, zero config, no server needed
from langchain_crw import CrwLoader
loader = CrwLoader(url="https://example.com", mode="scrape")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_crw-0.3.0.tar.gz (69.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_crw-0.3.0-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file langchain_crw-0.3.0.tar.gz.

File metadata

  • Download URL: langchain_crw-0.3.0.tar.gz
  • Upload date:
  • Size: 69.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for langchain_crw-0.3.0.tar.gz
Algorithm Hash digest
SHA256 aaf0be6787849fc078b2df92555738be3db006839f53f52542cad70436b9fd7c
MD5 b2d9496f6fb60c100f06b40435603b08
BLAKE2b-256 d05343ca06dc082ce18f7c16eb7d2143df725c4a7fb61458e8d208b6ba25e99c

See more details on using hashes here.

File details

Details for the file langchain_crw-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: langchain_crw-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for langchain_crw-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 da437d14f98c694ee237fcf7ad9102ccaeffcd5ceb793599e0e35bdd4b03f81b
MD5 95faf31687a826d9285879c4849dc513
BLAKE2b-256 9a80655c319d0b55c2fe87c32715871ae758bea723ceb5144dc055262da8bdf8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page