Skip to main content

An integration package connecting TheCrawler and LangChain

Project description

langchain-thecrawler

This package contains the LangChain integration with TheCrawler — a web-scraping and structured-extraction API that runs the extraction LLM on its own GPU, so AI extraction is included on every page with no per-call surcharge.

Installation

pip install -U langchain-thecrawler

Set your API key (get one at miaibot.ai):

export THECRAWLER_API_KEY="mai_live_..."

Document Loader

TheCrawlerLoader loads one or more URLs as LangChain Document objects with boilerplate-stripped markdown as page_content and rich page metadata.

from langchain_thecrawler import TheCrawlerLoader

loader = TheCrawlerLoader(
    ["https://example.com"],
    # api_key="mai_live_...",  # or set THECRAWLER_API_KEY
)

docs = loader.load()        # list[Document]
# or stream:
for doc in loader.lazy_load():
    print(doc.metadata["url"], len(doc.page_content))

PDF and DOCX URLs are handled server-side. A per-page failure does not raise — failed pages come back as a Document with empty page_content and metadata["status"] == "error" plus a structured error_type, so you can branch on it:

ok = [d for d in docs if d.metadata.get("status") != "error"]

Options

Arg Description
urls A URL string or list of URLs (required)
api_key TheCrawler key; falls back to THECRAWLER_API_KEY
api_url API base URL (default https://www.miaibot.ai/api/v1)
params Extra options merged into the crawl request (e.g. {"usePlaywright": True})
timeout Per-request HTTP timeout in seconds (default 120)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_thecrawler-0.1.0.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_thecrawler-0.1.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file langchain_thecrawler-0.1.0.tar.gz.

File metadata

  • Download URL: langchain_thecrawler-0.1.0.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.13.6 Windows/11

File hashes

Hashes for langchain_thecrawler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4eb05e3473d1e435140b90ab804d5e574349ab92fb63cbc7e87f5aac9287000f
MD5 2a16834834778ce7896ce406f63f1ac4
BLAKE2b-256 c10c57bc58c848484dd376f9a5a0a4461db85a52e3b91c460791271cf07e17b3

See more details on using hashes here.

File details

Details for the file langchain_thecrawler-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_thecrawler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6a05eb6fa65fecf8b1303e46c7f87c41974ece0c1eb3abf5a16eb3e6e95255c
MD5 01ad723a452817ffa894f4c7b80e9fbf
BLAKE2b-256 1f5ec1c5b03da16f8bb74879aca84ad5b30eaf98aa83fb43dbed9690b523f035

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page