Skip to main content

Apache Airflow provider for Olostep web scraping API

Project description

Apache Airflow Provider for Olostep

PyPI version License Python

Official Apache Airflow provider for Olostep - the API to search, extract, and structure web data at scale.

Features

  • 🔗 Native Airflow Integration - First-class connection type with UI support
  • 📄 Scrape Operator - Extract content from single URLs
  • 📦 Batch Operator - Process multiple URLs efficiently
  • 🕷️ Crawl Operator - Crawl entire websites
  • 🗺️ Map Operator - Discover all URLs on a website
  • Ask Operator - Get AI-powered answers about web pages
  • ⏱️ Sensors - Wait for async jobs to complete
  • 🎨 Templating Support - Use Jinja templates for dynamic values

Installation

pip install apache-airflow-provider-olostep

Quick Start

1. Create an Airflow Connection

Via the Airflow UI:

  1. Go to Admin > Connections
  2. Click + Add a new record
  3. Configure:
    • Connection Id: olostep_default
    • Connection Type: Olostep
    • Password: Your Olostep API key

Via CLI:

airflow connections add olostep_default \
    --conn-type olostep \
    --conn-password "your-api-key-here"

Via Environment Variable:

export AIRFLOW_CONN_OLOSTEP_DEFAULT='{"conn_type": "olostep", "password": "your-api-key"}'

2. Get Your API Key

Sign up at olostep.com and get your API key from the dashboard.

3. Create Your First DAG

from datetime import datetime
from airflow import DAG
from airflow_provider_olostep.operators.scrape import OlostepScrapeOperator

with DAG(
    dag_id="olostep_quickstart",
    start_date=datetime(2024, 1, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:
    
    scrape = OlostepScrapeOperator(
        task_id="scrape_example",
        url="https://example.com",
        formats=["markdown", "text"],
    )

Available Components

Operators

Operator Description
OlostepScrapeOperator Scrape a single URL
OlostepBatchOperator Batch scrape multiple URLs
OlostepCrawlOperator Crawl a website
OlostepMapOperator Create a sitemap
OlostepAskOperator Ask questions about a webpage

Sensors

Sensor Description
OlostepBatchSensor Wait for batch job completion
OlostepCrawlSensor Wait for crawl job completion

Hook

Hook Description
OlostepHook Low-level API access

Examples

Scrape a Single Page

from airflow_provider_olostep.operators.scrape import OlostepScrapeOperator

scrape = OlostepScrapeOperator(
    task_id="scrape_page",
    url="https://news.ycombinator.com",
    formats=["markdown", "text", "links"],
    wait_for=2000,  # Wait 2 seconds for JS rendering
)

Batch Scrape Multiple URLs

from airflow_provider_olostep.operators.batch import OlostepBatchOperator

batch = OlostepBatchOperator(
    task_id="batch_scrape",
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    ],
    formats=["markdown"],
    wait_for_completion=True,  # Block until all pages are scraped
)

Discover and Scrape Website

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow_provider_olostep.operators.map import OlostepMapOperator
from airflow_provider_olostep.operators.batch import OlostepBatchOperator
from airflow_provider_olostep.sensors.batch import OlostepBatchSensor

with DAG("discover_and_scrape", ...) as dag:
    
    # Step 1: Discover all product pages
    discover = OlostepMapOperator(
        task_id="discover_urls",
        url="https://shop.example.com",
        include_patterns=["/products/**"],
        max_urls=100,
    )
    
    # Step 2: Start batch scrape
    def start_batch(**context):
        from airflow_provider_olostep.hooks.olostep import OlostepHook
        urls = context["ti"].xcom_pull(task_ids="discover_urls", key="urls")
        hook = OlostepHook()
        result = hook.batch_scrape(urls=urls[:50], formats=["markdown"])
        return result.get("batch_id") or result.get("id")
    
    batch = PythonOperator(
        task_id="start_batch",
        python_callable=start_batch,
    )
    
    # Step 3: Wait for completion
    wait = OlostepBatchSensor(
        task_id="wait_for_batch",
        batch_id="{{ ti.xcom_pull(task_ids='start_batch') }}",
        poke_interval=30,
        timeout=3600,
        mode="reschedule",
    )
    
    discover >> batch >> wait

Ask Questions About Pages

from airflow_provider_olostep.operators.ask import OlostepAskOperator

ask = OlostepAskOperator(
    task_id="get_pricing",
    url="https://example.com/pricing",
    question="What is the price of the enterprise plan?",
)

Using Dynamic URLs with Templates

scrape = OlostepScrapeOperator(
    task_id="scrape_dynamic",
    url="{{ var.value.target_url }}",  # From Airflow Variables
    formats="{{ dag_run.conf.get('formats', ['markdown']) }}",
)

Using the Hook Directly

For custom logic, use OlostepHook:

from airflow.operators.python import PythonOperator
from airflow_provider_olostep.hooks.olostep import OlostepHook

def custom_scraping(**context):
    hook = OlostepHook(olostep_conn_id="olostep_default")
    
    # Scrape with custom options
    result = hook.scrape(
        url="https://example.com",
        formats=["markdown", "screenshot"],
        wait_for=3000,
        country="US",
    )
    
    # Process results
    markdown = result.get("markdown", "")
    print(f"Scraped {len(markdown)} characters")
    
    return result

task = PythonOperator(
    task_id="custom_scrape",
    python_callable=custom_scraping,
)

Configuration

Connection Options

Field Description
Password Your Olostep API key (required)
Extra JSON with optional settings

Extra JSON options:

{
    "base_url": "https://api.olostep.com/v1",
    "api_key": "alternative-location-for-key"
}

Operator Common Parameters

Parameter Description Default
olostep_conn_id Airflow connection ID olostep_default
formats Output formats (list) ["markdown"]

Development

Local Setup

# Clone the repository
git clone https://github.com/olostep/airflow-provider-olostep.git
cd airflow-provider-olostep

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

Running Airflow Locally

See the examples/local-airflow directory for a Docker Compose setup.

Resources

Support

License

Apache License 2.0 - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apache_airflow_provider_olostep-0.2.0.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

apache_airflow_provider_olostep-0.2.0-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file apache_airflow_provider_olostep-0.2.0.tar.gz.

File metadata

File hashes

Hashes for apache_airflow_provider_olostep-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0bca139aac634aa68dd4da8e9da7e80b651b0c9947cf8f17299d5c79fb0a3281
MD5 11469c4d88a0a8e0e859e5842df9dd7d
BLAKE2b-256 f9a4c072cc3c6e0e0078f8052b4db0c2b99071dfb5f020ed732c668104b8c268

See more details on using hashes here.

File details

Details for the file apache_airflow_provider_olostep-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for apache_airflow_provider_olostep-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d5b8679e40b1c99630726409b5b1dd844c97cb9b3d5a7f42718c1c9a8fb7164c
MD5 2d85fcd27762549ffea8ecc608cbaf45
BLAKE2b-256 e217510f3d4f73276182d11762dc620eaaef0af08fe773ad1c3438993785da79

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page