LangChain integration for Spidra — AI-native web scraping for LLM workflows
Project description
langchain-spidra
LangChain integration for Spidra — AI-native web scraping for LLM workflows.
Spidra is not a traditional scraper that returns raw HTML. It uses AI to extract exactly the data you describe in natural language — returning clean, structured, LLM-ready content. This package brings that capability directly into LangChain pipelines.
Installation
pip install langchain-spidra
Get your Spidra API key at app.spidra.io, then:
export SPIDRA_API_KEY="spd_your_key_here"
Quick Start
from langchain_spidra import SpidraLoader
loader = SpidraLoader(
url="https://example.com",
prompt="Extract the main features and pricing",
output="markdown",
)
docs = loader.load()
print(docs[0].page_content)
Components
SpidraLoader — Document Loader
A LangChain BaseLoader that supports three scraping modes:
| Mode | Description | Use case |
|---|---|---|
scrape |
AI scrape a single URL | Single page Q&A, summarisation |
batch |
Scrape multiple URLs in parallel | Compare pages, bulk extraction |
crawl |
AI-guided crawl of an entire site | RAG, site-wide analysis |
Scrape mode (default)
from langchain_spidra import SpidraLoader
loader = SpidraLoader(
url="https://spidra.io/pricing",
prompt="Extract all pricing plans with their names, prices, and features",
output="json", # json | markdown | text | table
)
docs = loader.load()
Batch mode
loader = SpidraLoader(
urls=[
"https://spidra.io",
"https://spidra.io/blog",
"https://competitor.com",
],
mode="batch",
prompt="Extract the main headline and product description",
)
docs = loader.load() # one Document per URL
Crawl mode
loader = SpidraLoader(
url="https://spidra.io/blog",
mode="crawl",
crawl_instruction="Find all blog posts from 2024 and 2025",
transform_instruction="Extract the title, publication date, and summary",
max_pages=20,
)
docs = loader.load() # one Document per crawled page
Async support
All loaders support async via aload() and alazy_load():
docs = await loader.aload()
async for doc in loader.alazy_load():
process(doc)
Full parameter reference
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str |
— | URL to scrape (scrape/crawl modes) |
urls |
List[str] |
— | URLs to scrape (batch mode) |
api_key |
str |
SPIDRA_API_KEY env |
Spidra API key |
mode |
str |
"scrape" |
"scrape", "batch", or "crawl" |
prompt |
str |
"Extract the main content..." |
What data to extract |
output |
str |
"markdown" |
"json", "markdown", "text", "table" |
crawl_instruction |
str |
— | Which pages to discover (crawl mode) |
transform_instruction |
str |
— | What to extract per page (crawl mode) |
max_pages |
int |
— | Max pages to crawl (crawl mode) |
use_proxy |
bool |
— | Route through residential proxy |
proxy_country |
str |
— | ISO country code for geo-targeted proxy |
extract_content_only |
bool |
— | Strip nav/footer boilerplate |
cookies |
str |
— | Raw cookie header string |
poll_options |
PollOptions |
— | Custom polling timeout/interval |
SpidraScrapeTool — LangChain Tool
Use Spidra as a tool in agent workflows. The agent decides when and what to scrape.
from langchain_spidra import SpidraScrapeTool
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, ToolMessage
tool = SpidraScrapeTool() # reads SPIDRA_API_KEY from env
# Bind to a model
model = ChatOpenAI(model="gpt-4o-mini").bind_tools([tool])
messages = [HumanMessage(
content="What does Spidra cost? Check https://spidra.io/pricing"
)]
response = model.invoke(messages)
# Execute tool calls and get the final answer
if response.tool_calls:
messages.append(response)
for tc in response.tool_calls:
result = tool.invoke(tc["args"])
messages.append(ToolMessage(content=result, tool_call_id=tc["id"]))
final = model.invoke(messages)
print(final.content)
SpidraRetriever — LangChain Retriever
Drop-in retriever for RAG pipelines. Crawls a site with Spidra and uses your query as the AI extraction instruction.
from langchain_spidra import SpidraRetriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
retriever = SpidraRetriever(
url="https://spidra.io/docs",
crawl_instruction="Find all documentation pages",
max_pages=15,
)
# Use in a RAG chain
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| ChatPromptTemplate.from_template(
"Answer based on context:\n{context}\n\nQuestion: {question}"
)
| ChatOpenAI(model="gpt-4o-mini")
| StrOutputParser()
)
answer = chain.invoke("How do I authenticate with the Spidra API?")
print(answer)
Scrape + Chat in 10 lines
from langchain_spidra import SpidraLoader
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
docs = SpidraLoader(
url="https://spidra.io",
prompt="Extract all key product information",
).load()
answer = ChatOpenAI(model="gpt-4o-mini").invoke([
HumanMessage(content=f"Summarise in 3 bullets:\n\n{docs[0].page_content}")
])
print(answer.content)
Examples
See the examples/ directory:
| File | What it shows |
|---|---|
scrape_and_chat.py |
Scrape a URL → chat with the content |
chains.py |
LCEL chain: scrape → extract → format |
tool_calling_agent.py |
Agent with SpidraScrapeTool |
structured_extraction.py |
Typed Pydantic output from a scraped page |
batch_scrape.py |
Scrape multiple URLs at once |
rag_pipeline.py |
Full RAG: crawl → embed → vector store → Q&A |
Development
git clone https://github.com/spidra-io/spidra-langchain
cd spidra-langchain
pip install -e ".[dev]"
pytest
Links
License
MIT © Spidra
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_spidra-0.1.0.tar.gz.
File metadata
- Download URL: langchain_spidra-0.1.0.tar.gz
- Upload date:
- Size: 21.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f65437f450f9f882eb5343363da481be83d1f7f460f09f55f7416ddcdb4fa7eb
|
|
| MD5 |
11e2a604088b7da1a58626a10696c6fb
|
|
| BLAKE2b-256 |
a98674882b0d1a044a707dd34dcf8cf95a9ab1d3df6953ded0fba68b2ea8a422
|
File details
Details for the file langchain_spidra-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langchain_spidra-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c229aae183e43d4515e018d62840adf0247f2fb243a6e06e1c2024d6bb86058
|
|
| MD5 |
37c8d364e091ef004e21e010c24076bd
|
|
| BLAKE2b-256 |
5000a555851d61b12178ea578c220ed4d0776076c5d884567118f633f0958870
|