The most reliable and cost-effective web search, scraping and crawling API for AI. Build intelligent agents that can search, scrape, analyze, and structure data from any website.
Project description
LangChain Olostep Integration
The most reliable and cost-effective web search, scraping and crawling API for AI.
Build intelligent agents that can search, scrape, analyze, and structure data from any website. Perfect for LangChain and LangGraph applications.
Features
This integration provides access to all 5 Olostep API endpoints:
1. Scrapes (/v1/scrapes)
Extract content from any single URL in multiple formats (Markdown, HTML, JSON, text). Handles JavaScript rendering, anti-scraping measures, and supports specialized parsers for specific websites.
2. Batches (/v1/batches)
Process up to 10,000 URLs in parallel. Batch jobs typically complete in 5-8 minutes regardless of batch size. Perfect for large-scale data extraction and competitor analysis.
3. Answers (/v1/answers)
AI-powered web search with natural language queries. Ask questions and get structured answers with sources. Ground your AI products on real-world data and facts. Perfect for data enrichment and research.
4. Maps (/v1/maps)
Extract all URLs from a website for site structure analysis and content discovery. Can discover up to ~100,000 URLs in a single call. Perfect for SEO audits and preparing URLs for batch processing.
5. Crawls (/v1/crawls)
Autonomously discover and scrape entire websites by following links. Perfect for documentation sites, blogs, and comprehensive website data extraction.
Core Capabilities
- Multiple Formats: Markdown, HTML, JSON, and plain text
- JavaScript Rendering: Automatic handling of dynamic content
- Specialized Parsers: Custom parsers for Amazon, Google Search, and more
- Location-Specific: Scrape with country-specific settings
- LangGraph Ready: Perfect for building complex AI agent workflows
- Cost-Effective: Pay only for what you use
Installation
pip install langchain-olostep
Setup
Set your Olostep API key:
export OLOSTEP_API_KEY="your_olostep_api_key_here"
Get your API key from https://olostep.com/dashboard
Quick Start
1. Basic Web Scraping
from langchain_olostep import scrape_website
import asyncio
# Scrape a website
content = asyncio.run(scrape_website.ainvoke({
"url": "https://example.com",
"format": "markdown"
}))
print(content)
2. Batch Scraping
from langchain_olostep import scrape_batch
import asyncio
# Scrape multiple URLs in parallel
result = asyncio.run(scrape_batch.ainvoke({
"urls": [
"https://example1.com",
"https://example2.com",
"https://example3.com"
],
"format": "markdown"
}))
print(result)
3. AI-Powered Web Search & Q&A
from langchain_olostep import answer_question
import asyncio
# Ask a question and get structured answer
result = asyncio.run(answer_question.ainvoke({
"task": "What is the latest book by J.K. Rowling?",
"json_schema": {
"book_title": "",
"author": "",
"release_date": ""
}
}))
print(result)
# Returns: {"book_title": "The Hallmarked Man", "author": "J.K. Rowling", ...}
# Also includes sources used
4. Extract URLs from Website
from langchain_olostep import extract_urls
import asyncio
# Get all URLs from a website
result = asyncio.run(extract_urls.ainvoke({
"url": "https://example.com",
"include_urls": ["/blog/**"], # Only blog posts
"top_n": 100
}))
print(result)
5. Crawl Entire Website
from langchain_olostep import crawl_website
import asyncio
# Autonomously crawl and scrape a website
result = asyncio.run(crawl_website.ainvoke({
"start_url": "https://docs.example.com",
"max_pages": 100,
"exclude_urls": ["/admin/**"]
}))
print(result)
Usage with LangChain Agent
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import (
scrape_website,
answer_question,
extract_urls
)
# Create agent with Olostep tools
tools = [scrape_website, answer_question, extract_urls]
llm = ChatOpenAI(model="gpt-4o-mini")
agent = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
# Use the agent
result = agent.run("""
Research the company at https://company.com:
1. Scrape their about page
2. Search for their latest funding round
3. Extract all their product pages
""")
print(result)
Usage with LangGraph
from langgraph.graph import StateGraph, END
from langchain_olostep import (
scrape_website,
answer_question,
extract_urls
)
from langchain_openai import ChatOpenAI
# Build a research agent workflow
def create_research_agent():
workflow = StateGraph(dict)
def discover_pages(state):
# Extract all URLs from target site
result = extract_urls.invoke({
"url": state["target_url"],
"include_urls": ["/product/**"],
"top_n": 50
})
state["urls"] = json.loads(result)["urls"]
return state
def scrape_pages(state):
# Scrape discovered pages in batch
result = scrape_batch.invoke({
"urls": state["urls"],
"format": "markdown"
})
state["batch_id"] = json.loads(result)["batch_id"]
return state
def answer_questions(state):
# Use AI to answer questions about the data
result = answer_question.invoke({
"task": state["research_question"],
"json_schema": state["desired_format"]
})
state["answer"] = json.loads(result)["answer"]
return state
workflow.add_node("discover", discover_pages)
workflow.add_node("scrape", scrape_pages)
workflow.add_node("analyze", answer_questions)
workflow.set_entry_point("discover")
workflow.add_edge("discover", "scrape")
workflow.add_edge("scrape", "analyze")
workflow.add_edge("analyze", END)
return workflow.compile()
# Use the agent
agent = create_research_agent()
result = agent.invoke({
"target_url": "https://store.com",
"research_question": "What are the top 5 most expensive products?",
"desired_format": {"products": [{"name": "", "price": "", "url": ""}]}
})
Available Tools
scrape_website
Scrape content from a single URL. Supports markdown, HTML, JSON, and text formats.
Parameters:
url(required): Website URL to scrapeformat: Output format (markdown, html, json, text). Default: "markdown"country: Country code for location-specific content (e.g., "US", "GB", "CA")wait_before_scraping: Wait time in milliseconds for JavaScript renderingparser: Optional parser ID for specialized extraction (e.g., "@olostep/amazon-product")
scrape_batch
Scrape multiple URLs in parallel (up to 10,000 at once).
Parameters:
urls(required): List of URLs to scrapeformat: Output format for all URLs. Default: "markdown"country: Country code for location-specific contentwait_before_scraping: Wait time in milliseconds for JavaScript renderingparser: Optional parser ID for specialized extraction
answer_question
Search the web and get AI-powered answers with sources.
Parameters:
task(required): Question or task to search forjson_schema: Optional JSON schema dict/string describing desired output format
Examples:
# Simple question
answer_question.invoke({"task": "What is the capital of France?"})
# With structured output
answer_question.invoke({
"task": "Find the CEO of Stripe",
"json_schema": {"ceo_name": "", "founded_year": ""}
})
extract_urls
Extract all URLs from a website for site structure analysis.
Parameters:
url(required): Website URL to extract URLs fromsearch_query: Optional search query to filter URLstop_n: Limit number of URLs returnedinclude_urls: Glob patterns to include (e.g., ["/blog/**"])exclude_urls: Glob patterns to exclude (e.g., ["/admin/**"])
crawl_website
Autonomously discover and scrape entire websites.
Parameters:
start_url(required): Starting URL for the crawlmax_pages: Maximum number of pages to crawl. Default: 100include_urls: Glob patterns to includeexclude_urls: Glob patterns to excludemax_depth: Maximum depth to crawl from start_urlinclude_external: Include external URLs. Default: False
Advanced Examples
Data Enrichment
from langchain_olostep import answer_question
# Enrich company data from spreadsheet
companies = ["Stripe", "Shopify", "Square"]
for company in companies:
result = answer_question.invoke({
"task": f"Find information about {company}",
"json_schema": {
"ceo": "",
"headquarters": "",
"employee_count": "",
"latest_funding": ""
}
})
print(f"{company}: {result}")
E-commerce Product Scraping
from langchain_olostep import scrape_website
# Scrape Amazon product with specialized parser
result = scrape_website.invoke({
"url": "https://www.amazon.com/dp/PRODUCT_ID",
"parser": "@olostep/amazon-product",
"format": "json"
})
# Returns structured product data: price, title, rating, etc.
SEO Audit
from langchain_olostep import extract_urls, scrape_batch
# 1. Discover all pages
urls_result = extract_urls.invoke({
"url": "https://yoursite.com",
"top_n": 1000
})
# 2. Scrape all pages to analyze
urls = json.loads(urls_result)["urls"]
batch_result = scrape_batch.invoke({
"urls": urls,
"format": "html"
})
Documentation Scraping
from langchain_olostep import crawl_website
# Crawl entire docs site
result = crawl_website.invoke({
"start_url": "https://docs.example.com",
"max_pages": 500,
"include_urls": ["/docs/**"],
"exclude_urls": ["/api/**", "/v1/**"] # Exclude old versions
})
Error Handling
from langchain_core.exceptions import LangChainException
try:
result = await scrape_website.ainvoke({
"url": "https://example.com"
})
except LangChainException as e:
print(f"Scraping failed: {e}")
Environment Variables
OLOSTEP_API_KEY: Your Olostep API key (required)
Specialized Parsers
Olostep provides pre-built parsers for popular websites:
@olostep/amazon-product: Amazon product pages@olostep/google-search: Google search results@olostep/google-maps: Google Maps data
Use them with the parser parameter:
scrape_website.invoke({
"url": "https://www.amazon.com/dp/PRODUCT_ID",
"parser": "@olostep/amazon-product"
})
Pricing
Olostep offers competitive pricing:
- Pay only for what you use
- No hidden fees
- Volume discounts available
Visit https://olostep.com/pricing for details.
Support
- Documentation: https://docs.olostep.com
- Issues: GitHub Issues
- Email: info@olostep.com
Links
- PyPI: https://pypi.org/project/langchain-olostep/
- Documentation: https://docs.olostep.com/integrations/langchain
- Dashboard: https://olostep.com/dashboard
- Website: https://olostep.com
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_olostep-0.3.3.tar.gz.
File metadata
- Download URL: langchain_olostep-0.3.3.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d971a3e289a268e6b6e815fd19f2ae1b8cd5d28b7765ee2844eceef8ab4c9baa
|
|
| MD5 |
3528f599f2476cbe21b394ea16cc14cf
|
|
| BLAKE2b-256 |
2b0bda032ae90d9d441563b305167f40357c304539deb86c0da86d7fd16d2079
|
File details
Details for the file langchain_olostep-0.3.3-py3-none-any.whl.
File metadata
- Download URL: langchain_olostep-0.3.3-py3-none-any.whl
- Upload date:
- Size: 14.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99f0e844c9e2bca12fbb7940ef5732b886f4012a7f73956f3850a7a98d088507
|
|
| MD5 |
fe8a06d7fb114652436c4817c65b6c22
|
|
| BLAKE2b-256 |
43f00f676e9a8a75646c27330f1595c2751fcb63db280f46fb2872d756bebca1
|