Skip to main content

The most reliable and cost-effective web search, scraping and crawling API for AI. Build intelligent agents that can search, scrape, analyze, and structure data from any website.

Project description

LangChain Olostep Integration

The most reliable and cost-effective web search, scraping and crawling API for AI.

Build intelligent agents that can search, scrape, analyze, and structure data from any website. Perfect for LangChain and LangGraph applications.

Features

Web Search

Search the web with natural language and return AI-powered answers and data in the JSON shape you want. Ground your products on real-world data and sources.

Web Scraping

Extract content from any website with JavaScript rendering support. Handles anti-scraping measures and dynamic content automatically.

Web Crawling

Crawl entire websites with customizable depth and filters. Perfect for building comprehensive datasets.

Core Capabilities

  • Batch Processing: Scrape up to 100,000 URLs in parallel
  • AI-Powered Q&A: Ask questions about websites and get intelligent answers
  • Data Extraction: Extract specific fields using AI-powered mapping
  • Multiple Formats: Support for Markdown, HTML, JSON, and plain text
  • Specialized Parsers: Use custom parsers for specific websites (e.g., Amazon, LinkedIn)
  • Location-Specific: Scrape with country-specific settings
  • LangGraph Ready: Perfect for building complex AI agent workflows
  • Cost-Effective: Pay only for what you use with competitive pricing

Installation

pip install langchain-olostep

Setup

Set your Olostep API key:

export OLOSTEP_API_KEY="your_olostep_api_key_here"

Get your API key from https://olostep.com/dashboard

Quick Start

Basic Web Scraping

from langchain_olostep import scrape_website
import asyncio

# Scrape a website
content = asyncio.run(scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
}))

print(content)

With LangChain Agent

from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import scrape_website, scrape_with_answer

# Create agent with Olostep tools
tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Use the agent
result = agent.run("""
Scrape https://example.com and tell me:
1. What is the main content about?
2. Extract any contact information
""")

print(result)

With LangGraph

from langgraph.graph import StateGraph
from langchain_olostep import scrape_website, scrape_batch
from langchain_openai import ChatOpenAI

# Build a research agent workflow
workflow = StateGraph(dict)

def scrape_node(state):
    urls = state["urls"]
    result = scrape_batch.invoke({"urls": urls})
    return {"scraped_data": result}

workflow.add_node("scrape", scrape_node)
# ... add more nodes

Available Tools

1. scrape_website

Scrape content from any website.

from langchain_olostep import scrape_website

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown",  # markdown, html, json, or text
    "country": "US",  # Optional: country code for location-specific content
    "wait_before_scraping": 2000,  # Optional: wait time in ms for JS rendering
    "parser": "@olostep/amazon-product"  # Optional: specialized parser
})

Perfect for:

  • Extracting article content
  • Scraping dynamic websites
  • Bypassing anti-scraping measures
  • Getting clean, formatted content

2. scrape_batch

Scrape multiple URLs in parallel.

from langchain_olostep import scrape_batch

urls = [
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
]

result = await scrape_batch.ainvoke({
    "urls": urls,
    "format": "markdown"
})

Perfect for:

  • Competitive analysis
  • Large-scale data collection
  • Building datasets
  • Monitoring multiple sources

3. scrape_with_answer

Ask questions about website content and get AI-powered answers.

from langchain_olostep import scrape_with_answer

result = await scrape_with_answer.ainvoke({
    "url": "https://company.com",
    "question": "What is the company's main product and its pricing?"
})

Perfect for:

  • Research and information extraction
  • Competitive intelligence
  • Lead generation
  • Content analysis

4. scrape_with_map

Extract specific fields using AI-powered mapping.

from langchain_olostep import scrape_with_map

result = await scrape_with_map.ainvoke({
    "url": "https://store.com/product/123",
    "fields": ["product_name", "price", "rating", "description"]
})

Perfect for:

  • Structured data extraction
  • Product information gathering
  • Contact details extraction
  • E-commerce data collection

Examples

Example 1: Research Agent

from langchain_olostep import scrape_website, scrape_with_answer
from langchain.agents import initialize_agent
from langchain_openai import ChatOpenAI

tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)

# Research a topic
result = agent.run("""
Research the latest developments in AI by:
1. Scraping https://openai.com/blog
2. Extracting key announcements
3. Summarizing the findings
""")

Example 2: Competitive Analysis

from langchain_olostep import scrape_batch, scrape_with_map

# Scrape competitor websites
competitors = [
    "https://competitor1.com/pricing",
    "https://competitor2.com/pricing",
    "https://competitor3.com/pricing"
]

batch_result = await scrape_batch.ainvoke({"urls": competitors})

# Extract pricing information
for url in competitors:
    pricing = await scrape_with_map.ainvoke({
        "url": url,
        "fields": ["pricing_tiers", "features", "prices"]
    })
    print(f"Competitor: {url}")
    print(f"Pricing: {pricing}")

Example 3: Content Monitoring

from langchain_olostep import scrape_website
import schedule
import time

def monitor_website():
    content = await scrape_website.ainvoke({
        "url": "https://important-site.com",
        "format": "markdown"
    })
    
    # Check for changes, send alerts, etc.
    # ... your logic here

# Run every hour
schedule.every().hour.do(monitor_website)

while True:
    schedule.run_pending()
    time.sleep(1)

Example 4: LangGraph Research Workflow

See the complete example in the examples directory.

from langgraph.graph import StateGraph, END
from langchain_olostep import scrape_website, scrape_with_answer

# Define your research workflow
workflow = StateGraph(dict)

# Add nodes for different stages
workflow.add_node("plan", plan_research)
workflow.add_node("scrape", scrape_content)
workflow.add_node("analyze", analyze_data)
workflow.add_node("report", generate_report)

# Connect the nodes
workflow.set_entry_point("plan")
workflow.add_edge("plan", "scrape")
workflow.add_edge("scrape", "analyze")
workflow.add_edge("analyze", "report")
workflow.add_edge("report", END)

# Compile and run
agent = workflow.compile()
result = agent.invoke({"query": "Research AI developments"})

Advanced Features

JavaScript Rendering

Handle dynamic websites that load content via JavaScript:

result = await scrape_website.ainvoke({
    "url": "https://dynamic-site.com",
    "wait_before_scraping": 3000  # Wait 3 seconds
})

Location-Specific Scraping

Get content as it appears in different countries:

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "country": "GB"  # Scrape as viewed from UK
})

Specialized Parsers

Use pre-built parsers for specific websites:

# Amazon product parser
product = await scrape_website.ainvoke({
    "url": "https://amazon.com/product/xyz",
    "parser": "@olostep/amazon-product"
})

# LinkedIn profile parser
profile = await scrape_website.ainvoke({
    "url": "https://linkedin.com/in/username",
    "parser": "@olostep/linkedin-profile"
})

Multiple Output Formats

Get content in different formats:

# Get markdown for readability
markdown = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
})

# Get JSON for structured data
json_data = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "json"
})

# Get HTML for full page structure
html = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "html"
})

Configuration

Environment Variables

  • OLOSTEP_API_KEY: Your Olostep API key (required)

Tool Parameters

All tools accept an optional api_key parameter:

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "api_key": "your_api_key_here"  # Override environment variable
})

Use Cases

Research & Analysis

  • Market research
  • Competitive intelligence
  • Academic research
  • News monitoring

Data Collection

  • Building datasets
  • Product information gathering
  • Price monitoring
  • Contact information extraction

AI Agents

  • Research assistants
  • Data extraction bots
  • Content analyzers
  • Web automation agents

Business Intelligence

  • Competitor tracking
  • Lead generation
  • Market analysis
  • Trend monitoring

Getting Started

  1. Install the package

    pip install langchain-olostep
    
  2. Get your API key

    • Sign up at olostep.com
    • Get your API key from the dashboard
  3. Set your API key

    export OLOSTEP_API_KEY="your_key_here"
    
  4. Try the examples Check out the examples in the repository

Documentation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Support

Why Olostep?

  • Reliable: Handle JavaScript rendering, anti-scraping measures, and dynamic content
  • Fast: Parallel processing for batch operations
  • Accurate: AI-powered extraction for precise data gathering
  • Flexible: Multiple formats, parsers, and configuration options
  • Scalable: From single URLs to 100,000+ URLs in batch

Changelog

0.2.0

  • Complete redesign focusing on Olostep's core features
  • Added scrape_with_answer for AI-powered Q&A
  • Added scrape_with_map for structured data extraction
  • Removed confusing document loader terminology
  • Improved tool descriptions and examples
  • Added comprehensive LangGraph example

0.1.0

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_olostep-0.2.1.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_olostep-0.2.1-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file langchain_olostep-0.2.1.tar.gz.

File metadata

  • Download URL: langchain_olostep-0.2.1.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for langchain_olostep-0.2.1.tar.gz
Algorithm Hash digest
SHA256 73c0587441d9470e8db43718756a97522d2a14c272e4341bd4e520546038c5e0
MD5 9cb9ca3d05c6ae10d69ba787f9e992e2
BLAKE2b-256 bcd43c63cfe2d08f664bc25a6380e555cb6e983507f9565669d370eb66d08a92

See more details on using hashes here.

File details

Details for the file langchain_olostep-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_olostep-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7ddb17317f292e5791d32c82c1b9f9af7c2cbe773ec7c491d974d22c3c701b06
MD5 380daa0df6b143202614f13205c864f5
BLAKE2b-256 487e84b60a6592fdb33b3d4df076869457c46d50b6ca077aba085920f8574558

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page