Skip to main content

LangChain integration for Spidra — AI-native web scraping for LLM workflows

Project description

langchain-spidra

LangChain integration for Spidra — AI-native web scraping for LLM workflows.

PyPI version Python versions License: MIT

Spidra is not a traditional scraper that returns raw HTML. It uses AI to extract exactly the data you describe in natural language — returning clean, structured, LLM-ready content. This package brings that capability directly into LangChain pipelines.


Installation

pip install langchain-spidra

Get your Spidra API key at app.spidra.io, then:

export SPIDRA_API_KEY="spd_your_key_here"

Quick Start

from langchain_spidra import SpidraLoader

loader = SpidraLoader(
    url="https://example.com",
    prompt="Extract the main features and pricing",
    output="markdown",
)
docs = loader.load()
print(docs[0].page_content)

Components

SpidraLoader — Document Loader

A LangChain BaseLoader that supports three scraping modes:

Mode Description Use case
scrape AI scrape a single URL Single page Q&A, summarisation
batch Scrape multiple URLs in parallel Compare pages, bulk extraction
crawl AI-guided crawl of an entire site RAG, site-wide analysis

Scrape mode (default)

from langchain_spidra import SpidraLoader

loader = SpidraLoader(
    url="https://spidra.io/pricing",
    prompt="Extract all pricing plans with their names, prices, and features",
    output="json",       # json | markdown | text | table
)
docs = loader.load()

Batch mode

loader = SpidraLoader(
    urls=[
        "https://spidra.io",
        "https://spidra.io/blog",
        "https://competitor.com",
    ],
    mode="batch",
    prompt="Extract the main headline and product description",
)
docs = loader.load()  # one Document per URL

Crawl mode

loader = SpidraLoader(
    url="https://spidra.io/blog",
    mode="crawl",
    crawl_instruction="Find all blog posts from 2024 and 2025",
    transform_instruction="Extract the title, publication date, and summary",
    max_pages=20,
)
docs = loader.load()  # one Document per crawled page

Async support

All loaders support async via aload() and alazy_load():

docs = await loader.aload()

async for doc in loader.alazy_load():
    process(doc)

Full parameter reference

Parameter Type Default Description
url str URL to scrape (scrape/crawl modes)
urls List[str] URLs to scrape (batch mode)
api_key str SPIDRA_API_KEY env Spidra API key
mode str "scrape" "scrape", "batch", or "crawl"
prompt str "Extract the main content..." What data to extract
output str "markdown" "json", "markdown", "text", "table"
crawl_instruction str Which pages to discover (crawl mode)
transform_instruction str What to extract per page (crawl mode)
max_pages int Max pages to crawl (crawl mode)
use_proxy bool Route through residential proxy
proxy_country str ISO country code for geo-targeted proxy
extract_content_only bool Strip nav/footer boilerplate
cookies str Raw cookie header string
poll_options PollOptions Custom polling timeout/interval

SpidraScrapeTool — LangChain Tool

Use Spidra as a tool in agent workflows. The agent decides when and what to scrape.

from langchain_spidra import SpidraScrapeTool
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, ToolMessage

tool = SpidraScrapeTool()  # reads SPIDRA_API_KEY from env

# Bind to a model
model = ChatOpenAI(model="gpt-4o-mini").bind_tools([tool])

messages = [HumanMessage(
    content="What does Spidra cost? Check https://spidra.io/pricing"
)]
response = model.invoke(messages)

# Execute tool calls and get the final answer
if response.tool_calls:
    messages.append(response)
    for tc in response.tool_calls:
        result = tool.invoke(tc["args"])
        messages.append(ToolMessage(content=result, tool_call_id=tc["id"]))
    final = model.invoke(messages)
    print(final.content)

SpidraRetriever — LangChain Retriever

Drop-in retriever for RAG pipelines. Crawls a site with Spidra and uses your query as the AI extraction instruction.

from langchain_spidra import SpidraRetriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

retriever = SpidraRetriever(
    url="https://spidra.io/docs",
    crawl_instruction="Find all documentation pages",
    max_pages=15,
)

# Use in a RAG chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | ChatPromptTemplate.from_template(
        "Answer based on context:\n{context}\n\nQuestion: {question}"
    )
    | ChatOpenAI(model="gpt-4o-mini")
    | StrOutputParser()
)

answer = chain.invoke("How do I authenticate with the Spidra API?")
print(answer)

Scrape + Chat in 10 lines

from langchain_spidra import SpidraLoader
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

docs = SpidraLoader(
    url="https://spidra.io",
    prompt="Extract all key product information",
).load()

answer = ChatOpenAI(model="gpt-4o-mini").invoke([
    HumanMessage(content=f"Summarise in 3 bullets:\n\n{docs[0].page_content}")
])
print(answer.content)

Examples

See the examples/ directory:

File What it shows
scrape_and_chat.py Scrape a URL → chat with the content
chains.py LCEL chain: scrape → extract → format
tool_calling_agent.py Agent with SpidraScrapeTool
structured_extraction.py Typed Pydantic output from a scraped page
batch_scrape.py Scrape multiple URLs at once
rag_pipeline.py Full RAG: crawl → embed → vector store → Q&A

Development

git clone https://github.com/spidra-io/spidra-langchain
cd spidra-langchain
pip install -e ".[dev]"
pytest

Links


License

MIT © Spidra

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_spidra-0.1.0.tar.gz (21.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_spidra-0.1.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file langchain_spidra-0.1.0.tar.gz.

File metadata

  • Download URL: langchain_spidra-0.1.0.tar.gz
  • Upload date:
  • Size: 21.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for langchain_spidra-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f65437f450f9f882eb5343363da481be83d1f7f460f09f55f7416ddcdb4fa7eb
MD5 11e2a604088b7da1a58626a10696c6fb
BLAKE2b-256 a98674882b0d1a044a707dd34dcf8cf95a9ab1d3df6953ded0fba68b2ea8a422

See more details on using hashes here.

File details

Details for the file langchain_spidra-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_spidra-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c229aae183e43d4515e018d62840adf0247f2fb243a6e06e1c2024d6bb86058
MD5 37c8d364e091ef004e21e010c24076bd
BLAKE2b-256 5000a555851d61b12178ea578c220ed4d0776076c5d884567118f633f0958870

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page