Skip to main content

Apache Airflow provider for Olostep web scraping API

Project description

Apache Airflow Provider for Olostep

PyPI version License Python

Official Apache Airflow provider for Olostep - the API to search, extract, and structure web data at scale.

Features

  • 🔗 Native Airflow Integration - First-class connection type with UI support
  • 📄 Scrape Operator - Extract content from single URLs
  • 📦 Batch Operator - Process multiple URLs efficiently
  • 🕷️ Crawl Operator - Crawl entire websites
  • 🗺️ Map Operator - Discover all URLs on a website
  • Ask Operator - Get AI-powered answers about web pages
  • ⏱️ Sensors - Wait for async jobs to complete
  • 🎨 Templating Support - Use Jinja templates for dynamic values

Installation

pip install apache-airflow-provider-olostep

Quick Start

1. Create an Airflow Connection

Via the Airflow UI:

  1. Go to Admin > Connections
  2. Click + Add a new record
  3. Configure:
    • Connection Id: olostep_default
    • Connection Type: Olostep
    • Password: Your Olostep API key

Via CLI:

airflow connections add olostep_default \
    --conn-type olostep \
    --conn-password "your-api-key-here"

Via Environment Variable:

export AIRFLOW_CONN_OLOSTEP_DEFAULT='{"conn_type": "olostep", "password": "your-api-key"}'

2. Get Your API Key

Sign up at olostep.com and get your API key from the dashboard.

3. Create Your First DAG

from datetime import datetime
from airflow import DAG
from airflow_provider_olostep.operators.scrape import OlostepScrapeOperator

with DAG(
    dag_id="olostep_quickstart",
    start_date=datetime(2024, 1, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:
    
    scrape = OlostepScrapeOperator(
        task_id="scrape_example",
        url="https://example.com",
        formats=["markdown", "text"],
    )

Available Components

Operators

Operator Description
OlostepScrapeOperator Scrape a single URL
OlostepBatchOperator Batch scrape multiple URLs
OlostepCrawlOperator Crawl a website
OlostepMapOperator Create a sitemap
OlostepAskOperator Ask questions about a webpage

Sensors

Sensor Description
OlostepBatchSensor Wait for batch job completion
OlostepCrawlSensor Wait for crawl job completion

Hook

Hook Description
OlostepHook Low-level API access

Examples

Scrape a Single Page

from airflow_provider_olostep.operators.scrape import OlostepScrapeOperator

scrape = OlostepScrapeOperator(
    task_id="scrape_page",
    url="https://news.ycombinator.com",
    formats=["markdown", "text", "links"],
    wait_for=2000,  # Wait 2 seconds for JS rendering
)

Batch Scrape Multiple URLs

from airflow_provider_olostep.operators.batch import OlostepBatchOperator

batch = OlostepBatchOperator(
    task_id="batch_scrape",
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    ],
    formats=["markdown"],
    wait_for_completion=True,  # Block until all pages are scraped
)

Discover and Scrape Website

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow_provider_olostep.operators.map import OlostepMapOperator
from airflow_provider_olostep.operators.batch import OlostepBatchOperator
from airflow_provider_olostep.sensors.batch import OlostepBatchSensor

with DAG("discover_and_scrape", ...) as dag:
    
    # Step 1: Discover all product pages
    discover = OlostepMapOperator(
        task_id="discover_urls",
        url="https://shop.example.com",
        include_patterns=["/products/**"],
        max_urls=100,
    )
    
    # Step 2: Start batch scrape
    def start_batch(**context):
        from airflow_provider_olostep.hooks.olostep import OlostepHook
        urls = context["ti"].xcom_pull(task_ids="discover_urls", key="urls")
        hook = OlostepHook()
        result = hook.batch_scrape(urls=urls[:50], formats=["markdown"])
        return result.get("batch_id") or result.get("id")
    
    batch = PythonOperator(
        task_id="start_batch",
        python_callable=start_batch,
    )
    
    # Step 3: Wait for completion
    wait = OlostepBatchSensor(
        task_id="wait_for_batch",
        batch_id="{{ ti.xcom_pull(task_ids='start_batch') }}",
        poke_interval=30,
        timeout=3600,
        mode="reschedule",
    )
    
    discover >> batch >> wait

Ask Questions About Pages

from airflow_provider_olostep.operators.ask import OlostepAskOperator

ask = OlostepAskOperator(
    task_id="get_pricing",
    url="https://example.com/pricing",
    question="What is the price of the enterprise plan?",
)

Using Dynamic URLs with Templates

scrape = OlostepScrapeOperator(
    task_id="scrape_dynamic",
    url="{{ var.value.target_url }}",  # From Airflow Variables
    formats="{{ dag_run.conf.get('formats', ['markdown']) }}",
)

Using the Hook Directly

For custom logic, use OlostepHook:

from airflow.operators.python import PythonOperator
from airflow_provider_olostep.hooks.olostep import OlostepHook

def custom_scraping(**context):
    hook = OlostepHook(olostep_conn_id="olostep_default")
    
    # Scrape with custom options
    result = hook.scrape(
        url="https://example.com",
        formats=["markdown", "screenshot"],
        wait_for=3000,
        country="US",
    )
    
    # Process results
    markdown = result.get("markdown", "")
    print(f"Scraped {len(markdown)} characters")
    
    return result

task = PythonOperator(
    task_id="custom_scrape",
    python_callable=custom_scraping,
)

Configuration

Connection Options

Field Description
Password Your Olostep API key (required)
Extra JSON with optional settings

Extra JSON options:

{
    "base_url": "https://api.olostep.com/v1",
    "api_key": "alternative-location-for-key"
}

Operator Common Parameters

Parameter Description Default
olostep_conn_id Airflow connection ID olostep_default
formats Output formats (list) ["markdown"]

Development

Local Setup

# Clone the repository
git clone https://github.com/olostep/airflow-provider-olostep.git
cd airflow-provider-olostep

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

Running Airflow Locally

See the examples/local-airflow directory for a Docker Compose setup.

Resources

Support

License

Apache License 2.0 - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apache_airflow_provider_olostep-0.1.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

apache_airflow_provider_olostep-0.1.0-py3-none-any.whl (26.1 kB view details)

Uploaded Python 3

File details

Details for the file apache_airflow_provider_olostep-0.1.0.tar.gz.

File metadata

File hashes

Hashes for apache_airflow_provider_olostep-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c99c31501770e1556b163cfa6306568112f25d37d6f691c5fb0b4d434c35631e
MD5 feafbafdd11f965addec3ee126381ec1
BLAKE2b-256 438d2355f0b9c8acdccab290450ac7c5cb27cf119e8286e326dcaf8b0274f8e4

See more details on using hashes here.

File details

Details for the file apache_airflow_provider_olostep-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for apache_airflow_provider_olostep-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c58f26ba94b7f3f7e7a987817641616319b8e064f84756c8485bc5f183e7d0e
MD5 f97eeea5ce0d90e28a1d5f17411c12be
BLAKE2b-256 6df3c7f627c79485526679ac01e373e5c39fddd0859cc51bc206a6a9e3218b87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page