LangChain/LangGraph integration for Olostep - Powerful web scraping tools for AI agents
Project description
LangChain Olostep Integration
A powerful LangChain/LangGraph integration for the Olostep web scraping API. Build intelligent agents that can scrape, analyze, and extract data from any website.
Features
- Web Scraping: Extract content from any website with JavaScript rendering support
- Batch Processing: Scrape up to 100,000 URLs in parallel
- AI-Powered Q&A: Ask questions about websites and get intelligent answers
- Data Extraction: Extract specific fields using AI-powered mapping
- Multiple Formats: Support for Markdown, HTML, JSON, and plain text
- Specialized Parsers: Use custom parsers for specific websites (e.g., Amazon, LinkedIn)
- Location-Specific: Scrape with country-specific settings
- LangGraph Ready: Perfect for building complex AI agent workflows
Installation
pip install langchain-olostep
Setup
Set your Olostep API key:
export OLOSTEP_API_KEY="your_olostep_api_key_here"
Get your API key from https://olostep.com/dashboard
Quick Start
Basic Web Scraping
from langchain_olostep import scrape_website
import asyncio
# Scrape a website
content = asyncio.run(scrape_website.ainvoke({
"url": "https://example.com",
"format": "markdown"
}))
print(content)
With LangChain Agent
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import scrape_website, scrape_with_answer
# Create agent with Olostep tools
tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")
agent = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
# Use the agent
result = agent.run("""
Scrape https://example.com and tell me:
1. What is the main content about?
2. Extract any contact information
""")
print(result)
With LangGraph
from langgraph.graph import StateGraph
from langchain_olostep import scrape_website, scrape_batch
from langchain_openai import ChatOpenAI
# Build a research agent workflow
workflow = StateGraph(dict)
def scrape_node(state):
urls = state["urls"]
result = scrape_batch.invoke({"urls": urls})
return {"scraped_data": result}
workflow.add_node("scrape", scrape_node)
# ... add more nodes
Available Tools
1. scrape_website
Scrape content from any website.
from langchain_olostep import scrape_website
result = await scrape_website.ainvoke({
"url": "https://example.com",
"format": "markdown", # markdown, html, json, or text
"country": "US", # Optional: country code for location-specific content
"wait_before_scraping": 2000, # Optional: wait time in ms for JS rendering
"parser": "@olostep/amazon-product" # Optional: specialized parser
})
Perfect for:
- Extracting article content
- Scraping dynamic websites
- Bypassing anti-scraping measures
- Getting clean, formatted content
2. scrape_batch
Scrape multiple URLs in parallel.
from langchain_olostep import scrape_batch
urls = [
"https://example1.com",
"https://example2.com",
"https://example3.com"
]
result = await scrape_batch.ainvoke({
"urls": urls,
"format": "markdown"
})
Perfect for:
- Competitive analysis
- Large-scale data collection
- Building datasets
- Monitoring multiple sources
3. scrape_with_answer
Ask questions about website content and get AI-powered answers.
from langchain_olostep import scrape_with_answer
result = await scrape_with_answer.ainvoke({
"url": "https://company.com",
"question": "What is the company's main product and its pricing?"
})
Perfect for:
- Research and information extraction
- Competitive intelligence
- Lead generation
- Content analysis
4. scrape_with_map
Extract specific fields using AI-powered mapping.
from langchain_olostep import scrape_with_map
result = await scrape_with_map.ainvoke({
"url": "https://store.com/product/123",
"fields": ["product_name", "price", "rating", "description"]
})
Perfect for:
- Structured data extraction
- Product information gathering
- Contact details extraction
- E-commerce data collection
Examples
Example 1: Research Agent
from langchain_olostep import scrape_website, scrape_with_answer
from langchain.agents import initialize_agent
from langchain_openai import ChatOpenAI
tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")
agent = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)
# Research a topic
result = agent.run("""
Research the latest developments in AI by:
1. Scraping https://openai.com/blog
2. Extracting key announcements
3. Summarizing the findings
""")
Example 2: Competitive Analysis
from langchain_olostep import scrape_batch, scrape_with_map
# Scrape competitor websites
competitors = [
"https://competitor1.com/pricing",
"https://competitor2.com/pricing",
"https://competitor3.com/pricing"
]
batch_result = await scrape_batch.ainvoke({"urls": competitors})
# Extract pricing information
for url in competitors:
pricing = await scrape_with_map.ainvoke({
"url": url,
"fields": ["pricing_tiers", "features", "prices"]
})
print(f"Competitor: {url}")
print(f"Pricing: {pricing}")
Example 3: Content Monitoring
from langchain_olostep import scrape_website
import schedule
import time
def monitor_website():
content = await scrape_website.ainvoke({
"url": "https://important-site.com",
"format": "markdown"
})
# Check for changes, send alerts, etc.
# ... your logic here
# Run every hour
schedule.every().hour.do(monitor_website)
while True:
schedule.run_pending()
time.sleep(1)
Example 4: LangGraph Research Workflow
See the complete example in the examples directory.
from langgraph.graph import StateGraph, END
from langchain_olostep import scrape_website, scrape_with_answer
# Define your research workflow
workflow = StateGraph(dict)
# Add nodes for different stages
workflow.add_node("plan", plan_research)
workflow.add_node("scrape", scrape_content)
workflow.add_node("analyze", analyze_data)
workflow.add_node("report", generate_report)
# Connect the nodes
workflow.set_entry_point("plan")
workflow.add_edge("plan", "scrape")
workflow.add_edge("scrape", "analyze")
workflow.add_edge("analyze", "report")
workflow.add_edge("report", END)
# Compile and run
agent = workflow.compile()
result = agent.invoke({"query": "Research AI developments"})
Advanced Features
JavaScript Rendering
Handle dynamic websites that load content via JavaScript:
result = await scrape_website.ainvoke({
"url": "https://dynamic-site.com",
"wait_before_scraping": 3000 # Wait 3 seconds
})
Location-Specific Scraping
Get content as it appears in different countries:
result = await scrape_website.ainvoke({
"url": "https://example.com",
"country": "GB" # Scrape as viewed from UK
})
Specialized Parsers
Use pre-built parsers for specific websites:
# Amazon product parser
product = await scrape_website.ainvoke({
"url": "https://amazon.com/product/xyz",
"parser": "@olostep/amazon-product"
})
# LinkedIn profile parser
profile = await scrape_website.ainvoke({
"url": "https://linkedin.com/in/username",
"parser": "@olostep/linkedin-profile"
})
Multiple Output Formats
Get content in different formats:
# Get markdown for readability
markdown = await scrape_website.ainvoke({
"url": "https://example.com",
"format": "markdown"
})
# Get JSON for structured data
json_data = await scrape_website.ainvoke({
"url": "https://example.com",
"format": "json"
})
# Get HTML for full page structure
html = await scrape_website.ainvoke({
"url": "https://example.com",
"format": "html"
})
Configuration
Environment Variables
OLOSTEP_API_KEY: Your Olostep API key (required)
Tool Parameters
All tools accept an optional api_key parameter:
result = await scrape_website.ainvoke({
"url": "https://example.com",
"api_key": "your_api_key_here" # Override environment variable
})
Use Cases
Research & Analysis
- Market research
- Competitive intelligence
- Academic research
- News monitoring
Data Collection
- Building datasets
- Product information gathering
- Price monitoring
- Contact information extraction
AI Agents
- Research assistants
- Data extraction bots
- Content analyzers
- Web automation agents
Business Intelligence
- Competitor tracking
- Lead generation
- Market analysis
- Trend monitoring
Getting Started
-
Install the package
pip install langchain-olostep
-
Get your API key
- Sign up at olostep.com
- Get your API key from the dashboard
-
Set your API key
export OLOSTEP_API_KEY="your_key_here"
-
Try the examples Check out the examples in the repository
Documentation
- Olostep API Documentation: https://docs.olostep.com
- LangChain Documentation: https://python.langchain.com
- LangGraph Documentation: https://langchain-ai.github.io/langgraph/
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Support
- Documentation: docs.olostep.com
- Issues: GitHub Issues
- Email: support@olostep.com
Why Olostep?
- Reliable: Handle JavaScript rendering, anti-scraping measures, and dynamic content
- Fast: Parallel processing for batch operations
- Accurate: AI-powered extraction for precise data gathering
- Flexible: Multiple formats, parsers, and configuration options
- Scalable: From single URLs to 100,000+ URLs in batch
Changelog
0.2.0
- Complete redesign focusing on Olostep's core features
- Added scrape_with_answer for AI-powered Q&A
- Added scrape_with_map for structured data extraction
- Removed confusing document loader terminology
- Improved tool descriptions and examples
- Added comprehensive LangGraph example
0.1.0
- Initial release
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_olostep-0.2.0.tar.gz.
File metadata
- Download URL: langchain_olostep-0.2.0.tar.gz
- Upload date:
- Size: 17.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0609472d69f51009b400e3dab75fecce0af61031ea79e55271ee989e23e4eae4
|
|
| MD5 |
22a0589f3686f8efd9feb488cbf45dd3
|
|
| BLAKE2b-256 |
3d22db1dbe12e86224b88f4d5f76cf9c837c3f7dd041ec6aeddf0f925dc28714
|
File details
Details for the file langchain_olostep-0.2.0-py3-none-any.whl.
File metadata
- Download URL: langchain_olostep-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51bc33baf6db2fed301c863228669f01eafdb47c9c2c9ab6f152848787f572a3
|
|
| MD5 |
54fff007abaa419addcb6b432a62c6f1
|
|
| BLAKE2b-256 |
2014364968c25498e7b73c661437914116fca9b1f2e4f32d3f1ad89e3a91d442
|