Skip to main content

The most reliable and cost-effective web search, scraping and crawling API for AI. Build intelligent agents that can search, scrape, analyze, and structure data from any website.

Project description

LangChain Olostep Integration

The most reliable and cost-effective web search, scraping and crawling API for AI.

Build intelligent agents that can search, scrape, analyze, and structure data from any website. Perfect for LangChain and LangGraph applications.

Features

Web Search

Search the web with natural language and return AI-powered answers and data in the JSON shape you want. Ground your products on real-world data and sources.

Web Scraping

Extract content from any website with JavaScript rendering support. Handles anti-scraping measures and dynamic content automatically.

Web Crawling

Crawl entire websites with customizable depth and filters. Perfect for building comprehensive datasets.

Core Capabilities

  • Batch Processing: Scrape up to 100,000 URLs in parallel
  • AI-Powered Q&A: Ask questions about websites and get intelligent answers
  • Data Extraction: Extract specific fields using AI-powered mapping
  • Multiple Formats: Support for Markdown, HTML, JSON, and plain text
  • Specialized Parsers: Use custom parsers for specific websites (e.g., Amazon, LinkedIn)
  • Location-Specific: Scrape with country-specific settings
  • LangGraph Ready: Perfect for building complex AI agent workflows
  • Cost-Effective: Pay only for what you use with competitive pricing

Installation

pip install langchain-olostep

Setup

Set your Olostep API key:

export OLOSTEP_API_KEY="your_olostep_api_key_here"

Get your API key from https://olostep.com/dashboard

Quick Start

Basic Web Scraping

from langchain_olostep import scrape_website
import asyncio

# Scrape a website
content = asyncio.run(scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
}))

print(content)

With LangChain Agent

from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import scrape_website, scrape_with_answer

# Create agent with Olostep tools
tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Use the agent
result = agent.run("""
Scrape https://example.com and tell me:
1. What is the main content about?
2. Extract any contact information
""")

print(result)

With LangGraph

from langgraph.graph import StateGraph
from langchain_olostep import scrape_website, scrape_batch
from langchain_openai import ChatOpenAI

# Build a research agent workflow
workflow = StateGraph(dict)

def scrape_node(state):
    urls = state["urls"]
    result = scrape_batch.invoke({"urls": urls})
    return {"scraped_data": result}

workflow.add_node("scrape", scrape_node)
# ... add more nodes

Available Tools

1. scrape_website

Scrape content from any website.

from langchain_olostep import scrape_website

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown",  # markdown, html, json, or text
    "country": "US",  # Optional: country code for location-specific content
    "wait_before_scraping": 2000,  # Optional: wait time in ms for JS rendering
    "parser": "@olostep/amazon-product"  # Optional: specialized parser
})

Perfect for:

  • Extracting article content
  • Scraping dynamic websites
  • Bypassing anti-scraping measures
  • Getting clean, formatted content

2. scrape_batch

Scrape multiple URLs in parallel.

from langchain_olostep import scrape_batch

urls = [
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
]

result = await scrape_batch.ainvoke({
    "urls": urls,
    "format": "markdown"
})

Perfect for:

  • Competitive analysis
  • Large-scale data collection
  • Building datasets
  • Monitoring multiple sources

3. scrape_with_answer

Ask questions about website content and get AI-powered answers.

from langchain_olostep import scrape_with_answer

result = await scrape_with_answer.ainvoke({
    "url": "https://company.com",
    "question": "What is the company's main product and its pricing?"
})

Perfect for:

  • Research and information extraction
  • Competitive intelligence
  • Lead generation
  • Content analysis

4. scrape_with_map

Extract specific fields using AI-powered mapping.

from langchain_olostep import scrape_with_map

result = await scrape_with_map.ainvoke({
    "url": "https://store.com/product/123",
    "fields": ["product_name", "price", "rating", "description"]
})

Perfect for:

  • Structured data extraction
  • Product information gathering
  • Contact details extraction
  • E-commerce data collection

Examples

Example 1: Research Agent

from langchain_olostep import scrape_website, scrape_with_answer
from langchain.agents import initialize_agent
from langchain_openai import ChatOpenAI

tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)

# Research a topic
result = agent.run("""
Research the latest developments in AI by:
1. Scraping https://openai.com/blog
2. Extracting key announcements
3. Summarizing the findings
""")

Example 2: Competitive Analysis

from langchain_olostep import scrape_batch, scrape_with_map

# Scrape competitor websites
competitors = [
    "https://competitor1.com/pricing",
    "https://competitor2.com/pricing",
    "https://competitor3.com/pricing"
]

batch_result = await scrape_batch.ainvoke({"urls": competitors})

# Extract pricing information
for url in competitors:
    pricing = await scrape_with_map.ainvoke({
        "url": url,
        "fields": ["pricing_tiers", "features", "prices"]
    })
    print(f"Competitor: {url}")
    print(f"Pricing: {pricing}")

Example 3: Content Monitoring

from langchain_olostep import scrape_website
import schedule
import time

def monitor_website():
    content = await scrape_website.ainvoke({
        "url": "https://important-site.com",
        "format": "markdown"
    })
    
    # Check for changes, send alerts, etc.
    # ... your logic here

# Run every hour
schedule.every().hour.do(monitor_website)

while True:
    schedule.run_pending()
    time.sleep(1)

Example 4: LangGraph Research Workflow

See the complete example in the examples directory.

from langgraph.graph import StateGraph, END
from langchain_olostep import scrape_website, scrape_with_answer

# Define your research workflow
workflow = StateGraph(dict)

# Add nodes for different stages
workflow.add_node("plan", plan_research)
workflow.add_node("scrape", scrape_content)
workflow.add_node("analyze", analyze_data)
workflow.add_node("report", generate_report)

# Connect the nodes
workflow.set_entry_point("plan")
workflow.add_edge("plan", "scrape")
workflow.add_edge("scrape", "analyze")
workflow.add_edge("analyze", "report")
workflow.add_edge("report", END)

# Compile and run
agent = workflow.compile()
result = agent.invoke({"query": "Research AI developments"})

Advanced Features

JavaScript Rendering

Handle dynamic websites that load content via JavaScript:

result = await scrape_website.ainvoke({
    "url": "https://dynamic-site.com",
    "wait_before_scraping": 3000  # Wait 3 seconds
})

Location-Specific Scraping

Get content as it appears in different countries:

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "country": "GB"  # Scrape as viewed from UK
})

Specialized Parsers

Use pre-built parsers for specific websites:

# Amazon product parser
product = await scrape_website.ainvoke({
    "url": "https://amazon.com/product/xyz",
    "parser": "@olostep/amazon-product"
})

# LinkedIn profile parser
profile = await scrape_website.ainvoke({
    "url": "https://linkedin.com/in/username",
    "parser": "@olostep/linkedin-profile"
})

Multiple Output Formats

Get content in different formats:

# Get markdown for readability
markdown = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
})

# Get JSON for structured data
json_data = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "json"
})

# Get HTML for full page structure
html = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "html"
})

Configuration

Environment Variables

  • OLOSTEP_API_KEY: Your Olostep API key (required)

Tool Parameters

All tools accept an optional api_key parameter:

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "api_key": "your_api_key_here"  # Override environment variable
})

Use Cases

Research & Analysis

  • Market research
  • Competitive intelligence
  • Academic research
  • News monitoring

Data Collection

  • Building datasets
  • Product information gathering
  • Price monitoring
  • Contact information extraction

AI Agents

  • Research assistants
  • Data extraction bots
  • Content analyzers
  • Web automation agents

Business Intelligence

  • Competitor tracking
  • Lead generation
  • Market analysis
  • Trend monitoring

Getting Started

  1. Install the package

    pip install langchain-olostep
    
  2. Get your API key

    • Sign up at olostep.com
    • Get your API key from the dashboard
  3. Set your API key

    export OLOSTEP_API_KEY="your_key_here"
    
  4. Try the examples Check out the examples in the repository

Documentation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Support

Why Olostep?

  • Reliable: Handle JavaScript rendering, anti-scraping measures, and dynamic content
  • Fast: Parallel processing for batch operations
  • Accurate: AI-powered extraction for precise data gathering
  • Flexible: Multiple formats, parsers, and configuration options
  • Scalable: From single URLs to 100,000+ URLs in batch

Changelog

0.2.0

  • Complete redesign focusing on Olostep's core features
  • Added scrape_with_answer for AI-powered Q&A
  • Added scrape_with_map for structured data extraction
  • Removed confusing document loader terminology
  • Improved tool descriptions and examples
  • Added comprehensive LangGraph example

0.1.0

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_olostep-0.2.2.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_olostep-0.2.2-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file langchain_olostep-0.2.2.tar.gz.

File metadata

  • Download URL: langchain_olostep-0.2.2.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for langchain_olostep-0.2.2.tar.gz
Algorithm Hash digest
SHA256 1b8dfee8015b21faa102536401e4b84b60c6c6ce17ebfaf82cbb36abc65764a2
MD5 7893e530cd7d2c0964f8b1235b01c3df
BLAKE2b-256 c49d5f1a750901ef0677311ce485a327c9c28cec64a23169cc2039b3dff6f910

See more details on using hashes here.

File details

Details for the file langchain_olostep-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_olostep-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e0433809e41e6b447f9b76d299d47eb6c1e083ef43dc9fa5e7229fe993a143be
MD5 58b15567d903862a865a2a972de31a9b
BLAKE2b-256 aad1d43673bbb0b1bdefcd92963fa5c764222bdc59a7d3cddd9949908c1b7771

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page