Apache Airflow provider for Olostep web scraping API
Project description
Apache Airflow Provider for Olostep
Official Apache Airflow provider for Olostep - the API to search, extract, and structure web data at scale.
Features
- 🔗 Native Airflow Integration - First-class connection type with UI support
- 📄 Scrape Operator - Extract content from single URLs
- 📦 Batch Operator - Process multiple URLs efficiently
- 🕷️ Crawl Operator - Crawl entire websites
- 🗺️ Map Operator - Discover all URLs on a website
- ❓ Ask Operator - Get AI-powered answers about web pages
- ⏱️ Sensors - Wait for async jobs to complete
- 🎨 Templating Support - Use Jinja templates for dynamic values
Installation
pip install apache-airflow-provider-olostep
Quick Start
1. Create an Airflow Connection
Via the Airflow UI:
- Go to Admin > Connections
- Click + Add a new record
- Configure:
- Connection Id:
olostep_default - Connection Type:
Olostep - Password: Your Olostep API key
- Connection Id:
Via CLI:
airflow connections add olostep_default \
--conn-type olostep \
--conn-password "your-api-key-here"
Via Environment Variable:
export AIRFLOW_CONN_OLOSTEP_DEFAULT='{"conn_type": "olostep", "password": "your-api-key"}'
2. Get Your API Key
Sign up at olostep.com and get your API key from the dashboard.
3. Create Your First DAG
from datetime import datetime
from airflow import DAG
from airflow_provider_olostep.operators.scrape import OlostepScrapeOperator
with DAG(
dag_id="olostep_quickstart",
start_date=datetime(2024, 1, 1),
schedule_interval="@daily",
catchup=False,
) as dag:
scrape = OlostepScrapeOperator(
task_id="scrape_example",
url="https://example.com",
formats=["markdown", "text"],
)
Available Components
Operators
| Operator | Description |
|---|---|
OlostepScrapeOperator |
Scrape a single URL |
OlostepBatchOperator |
Batch scrape multiple URLs |
OlostepCrawlOperator |
Crawl a website |
OlostepMapOperator |
Create a sitemap |
OlostepAskOperator |
Ask questions about a webpage |
Sensors
| Sensor | Description |
|---|---|
OlostepBatchSensor |
Wait for batch job completion |
OlostepCrawlSensor |
Wait for crawl job completion |
Hook
| Hook | Description |
|---|---|
OlostepHook |
Low-level API access |
Examples
Scrape a Single Page
from airflow_provider_olostep.operators.scrape import OlostepScrapeOperator
scrape = OlostepScrapeOperator(
task_id="scrape_page",
url="https://news.ycombinator.com",
formats=["markdown", "text", "links"],
wait_for=2000, # Wait 2 seconds for JS rendering
)
Batch Scrape Multiple URLs
from airflow_provider_olostep.operators.batch import OlostepBatchOperator
batch = OlostepBatchOperator(
task_id="batch_scrape",
urls=[
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
],
formats=["markdown"],
wait_for_completion=True, # Block until all pages are scraped
)
Discover and Scrape Website
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow_provider_olostep.operators.map import OlostepMapOperator
from airflow_provider_olostep.operators.batch import OlostepBatchOperator
from airflow_provider_olostep.sensors.batch import OlostepBatchSensor
with DAG("discover_and_scrape", ...) as dag:
# Step 1: Discover all product pages
discover = OlostepMapOperator(
task_id="discover_urls",
url="https://shop.example.com",
include_patterns=["/products/**"],
max_urls=100,
)
# Step 2: Start batch scrape
def start_batch(**context):
from airflow_provider_olostep.hooks.olostep import OlostepHook
urls = context["ti"].xcom_pull(task_ids="discover_urls", key="urls")
hook = OlostepHook()
result = hook.batch_scrape(urls=urls[:50], formats=["markdown"])
return result.get("batch_id") or result.get("id")
batch = PythonOperator(
task_id="start_batch",
python_callable=start_batch,
)
# Step 3: Wait for completion
wait = OlostepBatchSensor(
task_id="wait_for_batch",
batch_id="{{ ti.xcom_pull(task_ids='start_batch') }}",
poke_interval=30,
timeout=3600,
mode="reschedule",
)
discover >> batch >> wait
Ask Questions About Pages
from airflow_provider_olostep.operators.ask import OlostepAskOperator
ask = OlostepAskOperator(
task_id="get_pricing",
url="https://example.com/pricing",
question="What is the price of the enterprise plan?",
)
Using Dynamic URLs with Templates
scrape = OlostepScrapeOperator(
task_id="scrape_dynamic",
url="{{ var.value.target_url }}", # From Airflow Variables
formats="{{ dag_run.conf.get('formats', ['markdown']) }}",
)
Using the Hook Directly
For custom logic, use OlostepHook:
from airflow.operators.python import PythonOperator
from airflow_provider_olostep.hooks.olostep import OlostepHook
def custom_scraping(**context):
hook = OlostepHook(olostep_conn_id="olostep_default")
# Scrape with custom options
result = hook.scrape(
url="https://example.com",
formats=["markdown", "screenshot"],
wait_for=3000,
country="US",
)
# Process results
markdown = result.get("markdown", "")
print(f"Scraped {len(markdown)} characters")
return result
task = PythonOperator(
task_id="custom_scrape",
python_callable=custom_scraping,
)
Configuration
Connection Options
| Field | Description |
|---|---|
Password |
Your Olostep API key (required) |
Extra |
JSON with optional settings |
Extra JSON options:
{
"base_url": "https://api.olostep.com/v1",
"api_key": "alternative-location-for-key"
}
Operator Common Parameters
| Parameter | Description | Default |
|---|---|---|
olostep_conn_id |
Airflow connection ID | olostep_default |
formats |
Output formats (list) | ["markdown"] |
Development
Local Setup
# Clone the repository
git clone https://github.com/olostep/airflow-provider-olostep.git
cd airflow-provider-olostep
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
Running Airflow Locally
See the examples/local-airflow directory for a Docker Compose setup.
Resources
Support
- 📧 Email: support@olostep.com
- 🐛 Issues: GitHub Issues
- 📖 Docs: docs.olostep.com
License
Apache License 2.0 - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file apache_airflow_provider_olostep-0.2.0.tar.gz.
File metadata
- Download URL: apache_airflow_provider_olostep-0.2.0.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bca139aac634aa68dd4da8e9da7e80b651b0c9947cf8f17299d5c79fb0a3281
|
|
| MD5 |
11469c4d88a0a8e0e859e5842df9dd7d
|
|
| BLAKE2b-256 |
f9a4c072cc3c6e0e0078f8052b4db0c2b99071dfb5f020ed732c668104b8c268
|
File details
Details for the file apache_airflow_provider_olostep-0.2.0-py3-none-any.whl.
File metadata
- Download URL: apache_airflow_provider_olostep-0.2.0-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5b8679e40b1c99630726409b5b1dd844c97cb9b3d5a7f42718c1c9a8fb7164c
|
|
| MD5 |
2d85fcd27762549ffea8ecc608cbaf45
|
|
| BLAKE2b-256 |
e217510f3d4f73276182d11762dc620eaaef0af08fe773ad1c3438993785da79
|