An integration package connecting TheCrawler and LangChain
Project description
langchain-thecrawler
This package contains the LangChain integration with TheCrawler — a web-scraping and structured-extraction API that runs the extraction LLM on its own GPU, so AI extraction is included on every page with no per-call surcharge.
Installation
pip install -U langchain-thecrawler
Set your API key (get one at miaibot.ai):
export THECRAWLER_API_KEY="mai_live_..."
Document Loader
TheCrawlerLoader loads one or more URLs as LangChain Document objects with boilerplate-stripped markdown as page_content and rich page metadata.
from langchain_thecrawler import TheCrawlerLoader
loader = TheCrawlerLoader(
["https://example.com"],
# api_key="mai_live_...", # or set THECRAWLER_API_KEY
)
docs = loader.load() # list[Document]
# or stream:
for doc in loader.lazy_load():
print(doc.metadata["url"], len(doc.page_content))
PDF and DOCX URLs are handled server-side. A per-page failure does not raise — failed pages come back as a Document with empty page_content and metadata["status"] == "error" plus a structured error_type, so you can branch on it:
ok = [d for d in docs if d.metadata.get("status") != "error"]
Options
| Arg | Description |
|---|---|
urls |
A URL string or list of URLs (required) |
api_key |
TheCrawler key; falls back to THECRAWLER_API_KEY |
api_url |
API base URL (default https://www.miaibot.ai/api/v1) |
params |
Extra options merged into the crawl request (e.g. {"usePlaywright": True}) |
timeout |
Per-request HTTP timeout in seconds (default 120) |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_thecrawler-0.1.0.tar.gz.
File metadata
- Download URL: langchain_thecrawler-0.1.0.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.1 CPython/3.13.6 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4eb05e3473d1e435140b90ab804d5e574349ab92fb63cbc7e87f5aac9287000f
|
|
| MD5 |
2a16834834778ce7896ce406f63f1ac4
|
|
| BLAKE2b-256 |
c10c57bc58c848484dd376f9a5a0a4461db85a52e3b91c460791271cf07e17b3
|
File details
Details for the file langchain_thecrawler-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langchain_thecrawler-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.1 CPython/3.13.6 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6a05eb6fa65fecf8b1303e46c7f87c41974ece0c1eb3abf5a16eb3e6e95255c
|
|
| MD5 |
01ad723a452817ffa894f4c7b80e9fbf
|
|
| BLAKE2b-256 |
1f5ec1c5b03da16f8bb74879aca84ad5b30eaf98aa83fb43dbed9690b523f035
|