Skip to main content

LangChain/LangGraph integration for Olostep - Powerful web scraping tools for AI agents

Project description

LangChain Olostep Integration

A powerful LangChain/LangGraph integration for the Olostep web scraping API. Build intelligent agents that can scrape, analyze, and extract data from any website.

Features

  • Web Scraping: Extract content from any website with JavaScript rendering support
  • Batch Processing: Scrape up to 100,000 URLs in parallel
  • AI-Powered Q&A: Ask questions about websites and get intelligent answers
  • Data Extraction: Extract specific fields using AI-powered mapping
  • Multiple Formats: Support for Markdown, HTML, JSON, and plain text
  • Specialized Parsers: Use custom parsers for specific websites (e.g., Amazon, LinkedIn)
  • Location-Specific: Scrape with country-specific settings
  • LangGraph Ready: Perfect for building complex AI agent workflows

Installation

pip install langchain-olostep

Setup

Set your Olostep API key:

export OLOSTEP_API_KEY="your_olostep_api_key_here"

Get your API key from https://olostep.com/dashboard

Quick Start

Basic Web Scraping

from langchain_olostep import scrape_website
import asyncio

# Scrape a website
content = asyncio.run(scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
}))

print(content)

With LangChain Agent

from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import scrape_website, scrape_with_answer

# Create agent with Olostep tools
tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Use the agent
result = agent.run("""
Scrape https://example.com and tell me:
1. What is the main content about?
2. Extract any contact information
""")

print(result)

With LangGraph

from langgraph.graph import StateGraph
from langchain_olostep import scrape_website, scrape_batch
from langchain_openai import ChatOpenAI

# Build a research agent workflow
workflow = StateGraph(dict)

def scrape_node(state):
    urls = state["urls"]
    result = scrape_batch.invoke({"urls": urls})
    return {"scraped_data": result}

workflow.add_node("scrape", scrape_node)
# ... add more nodes

Available Tools

1. scrape_website

Scrape content from any website.

from langchain_olostep import scrape_website

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown",  # markdown, html, json, or text
    "country": "US",  # Optional: country code for location-specific content
    "wait_before_scraping": 2000,  # Optional: wait time in ms for JS rendering
    "parser": "@olostep/amazon-product"  # Optional: specialized parser
})

Perfect for:

  • Extracting article content
  • Scraping dynamic websites
  • Bypassing anti-scraping measures
  • Getting clean, formatted content

2. scrape_batch

Scrape multiple URLs in parallel.

from langchain_olostep import scrape_batch

urls = [
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
]

result = await scrape_batch.ainvoke({
    "urls": urls,
    "format": "markdown"
})

Perfect for:

  • Competitive analysis
  • Large-scale data collection
  • Building datasets
  • Monitoring multiple sources

3. scrape_with_answer

Ask questions about website content and get AI-powered answers.

from langchain_olostep import scrape_with_answer

result = await scrape_with_answer.ainvoke({
    "url": "https://company.com",
    "question": "What is the company's main product and its pricing?"
})

Perfect for:

  • Research and information extraction
  • Competitive intelligence
  • Lead generation
  • Content analysis

4. scrape_with_map

Extract specific fields using AI-powered mapping.

from langchain_olostep import scrape_with_map

result = await scrape_with_map.ainvoke({
    "url": "https://store.com/product/123",
    "fields": ["product_name", "price", "rating", "description"]
})

Perfect for:

  • Structured data extraction
  • Product information gathering
  • Contact details extraction
  • E-commerce data collection

Examples

Example 1: Research Agent

from langchain_olostep import scrape_website, scrape_with_answer
from langchain.agents import initialize_agent
from langchain_openai import ChatOpenAI

tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)

# Research a topic
result = agent.run("""
Research the latest developments in AI by:
1. Scraping https://openai.com/blog
2. Extracting key announcements
3. Summarizing the findings
""")

Example 2: Competitive Analysis

from langchain_olostep import scrape_batch, scrape_with_map

# Scrape competitor websites
competitors = [
    "https://competitor1.com/pricing",
    "https://competitor2.com/pricing",
    "https://competitor3.com/pricing"
]

batch_result = await scrape_batch.ainvoke({"urls": competitors})

# Extract pricing information
for url in competitors:
    pricing = await scrape_with_map.ainvoke({
        "url": url,
        "fields": ["pricing_tiers", "features", "prices"]
    })
    print(f"Competitor: {url}")
    print(f"Pricing: {pricing}")

Example 3: Content Monitoring

from langchain_olostep import scrape_website
import schedule
import time

def monitor_website():
    content = await scrape_website.ainvoke({
        "url": "https://important-site.com",
        "format": "markdown"
    })
    
    # Check for changes, send alerts, etc.
    # ... your logic here

# Run every hour
schedule.every().hour.do(monitor_website)

while True:
    schedule.run_pending()
    time.sleep(1)

Example 4: LangGraph Research Workflow

See the complete example in the examples directory.

from langgraph.graph import StateGraph, END
from langchain_olostep import scrape_website, scrape_with_answer

# Define your research workflow
workflow = StateGraph(dict)

# Add nodes for different stages
workflow.add_node("plan", plan_research)
workflow.add_node("scrape", scrape_content)
workflow.add_node("analyze", analyze_data)
workflow.add_node("report", generate_report)

# Connect the nodes
workflow.set_entry_point("plan")
workflow.add_edge("plan", "scrape")
workflow.add_edge("scrape", "analyze")
workflow.add_edge("analyze", "report")
workflow.add_edge("report", END)

# Compile and run
agent = workflow.compile()
result = agent.invoke({"query": "Research AI developments"})

Advanced Features

JavaScript Rendering

Handle dynamic websites that load content via JavaScript:

result = await scrape_website.ainvoke({
    "url": "https://dynamic-site.com",
    "wait_before_scraping": 3000  # Wait 3 seconds
})

Location-Specific Scraping

Get content as it appears in different countries:

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "country": "GB"  # Scrape as viewed from UK
})

Specialized Parsers

Use pre-built parsers for specific websites:

# Amazon product parser
product = await scrape_website.ainvoke({
    "url": "https://amazon.com/product/xyz",
    "parser": "@olostep/amazon-product"
})

# LinkedIn profile parser
profile = await scrape_website.ainvoke({
    "url": "https://linkedin.com/in/username",
    "parser": "@olostep/linkedin-profile"
})

Multiple Output Formats

Get content in different formats:

# Get markdown for readability
markdown = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
})

# Get JSON for structured data
json_data = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "json"
})

# Get HTML for full page structure
html = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "html"
})

Configuration

Environment Variables

  • OLOSTEP_API_KEY: Your Olostep API key (required)

Tool Parameters

All tools accept an optional api_key parameter:

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "api_key": "your_api_key_here"  # Override environment variable
})

Use Cases

Research & Analysis

  • Market research
  • Competitive intelligence
  • Academic research
  • News monitoring

Data Collection

  • Building datasets
  • Product information gathering
  • Price monitoring
  • Contact information extraction

AI Agents

  • Research assistants
  • Data extraction bots
  • Content analyzers
  • Web automation agents

Business Intelligence

  • Competitor tracking
  • Lead generation
  • Market analysis
  • Trend monitoring

Getting Started

  1. Install the package

    pip install langchain-olostep
    
  2. Get your API key

    • Sign up at olostep.com
    • Get your API key from the dashboard
  3. Set your API key

    export OLOSTEP_API_KEY="your_key_here"
    
  4. Try the examples Check out the examples in the repository

Documentation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Support

Why Olostep?

  • Reliable: Handle JavaScript rendering, anti-scraping measures, and dynamic content
  • Fast: Parallel processing for batch operations
  • Accurate: AI-powered extraction for precise data gathering
  • Flexible: Multiple formats, parsers, and configuration options
  • Scalable: From single URLs to 100,000+ URLs in batch

Changelog

0.2.0

  • Complete redesign focusing on Olostep's core features
  • Added scrape_with_answer for AI-powered Q&A
  • Added scrape_with_map for structured data extraction
  • Removed confusing document loader terminology
  • Improved tool descriptions and examples
  • Added comprehensive LangGraph example

0.1.0

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_olostep-0.2.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_olostep-0.2.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file langchain_olostep-0.2.0.tar.gz.

File metadata

  • Download URL: langchain_olostep-0.2.0.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for langchain_olostep-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0609472d69f51009b400e3dab75fecce0af61031ea79e55271ee989e23e4eae4
MD5 22a0589f3686f8efd9feb488cbf45dd3
BLAKE2b-256 3d22db1dbe12e86224b88f4d5f76cf9c837c3f7dd041ec6aeddf0f925dc28714

See more details on using hashes here.

File details

Details for the file langchain_olostep-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_olostep-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51bc33baf6db2fed301c863228669f01eafdb47c9c2c9ab6f152848787f572a3
MD5 54fff007abaa419addcb6b432a62c6f1
BLAKE2b-256 2014364968c25498e7b73c661437914116fca9b1f2e4f32d3f1ad89e3a91d442

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page