High-performance asynchronous Python framework for orchestrating API data collection with clean, modular components.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

darkstussy

These details have not been verified by PyPI

Project description

aioscraper

aioscraper logo

Python GitHub License PyPI - Downloads GitHub Actions Workflow Status GitHub last commit

High-performance asynchronous Python framework for large-scale API data collection.

Beta notice: APIs and behavior may change; expect sharp edges while things settle.

What is aioscraper?
Key Features
Installation
Quick Start
Examples
Why aioscraper?
Use Cases
Performance
Documentation
Changelog
Contributing

What is aioscraper?

aioscraper is an async Python framework designed for mass data collection from APIs and external services at scale.

Built for:

Fetching data from hundreds/thousands of REST API endpoints concurrently
Integrating multiple external services (payment gateways, analytics APIs, etc.)
Building data aggregation pipelines from heterogeneous API sources
Queue-based scraping workers consuming tasks from Redis/RabbitMQ
Microservice fan-out requests with automatic rate limiting and retries

NOT built for:

Parsing HTML/CSS (but nothing stops you from using BeautifulSoup if you want - see examples/quotes.py)
Single API requests (use httpx or aiohttp directly)
GraphQL or WebSocket scraping (different paradigm)

Think: "I need to fetch data from 10,000 product API endpoints" or "I need to poll 50 microservices every minute" → aioscraper is for you.

Key Features

Async-first core with pluggable HTTP backends (aiohttp/httpx) and aiojobs scheduling
Declarative flow: requests → callbacks → pipelines, with middleware hooks at each stage
Priority queueing plus configurable concurrency limits per group
Adaptive rate limiting with EWMA + AIMD algorithm - automatically backs off on server overload
Small, explicit API that is easy to test and compose with existing async applications

Installation

Choose your HTTP backend:

# Option 1: Use aiohttp (recommended for most cases)
pip install "aioscraper[aiohttp]"

# Option 2: Use httpx (if you prefer httpx ecosystem)
pip install "aioscraper[httpx]"

# Option 3: Install both backends for flexibility
pip install "aioscraper[aiohttp,httpx]"

Quick Start

Create scraper.py:

import logging
from aioscraper import AIOScraper, Request, Response, SendRequest, Pipeline
from dataclasses import dataclass

logger = logging.getLogger("github_repos")
scraper = AIOScraper()


@dataclass(slots=True)
class RepoStats:
    """Data model for extracted repository stats."""

    name: str
    stars: int
    language: str


# this decorator registers this pipeline to handle RepoStats items
@scraper.pipeline(RepoStats)
class StatsPipeline:
    """Pipeline for processing extracted repository data."""

    def __init__(self):
        self.total_stars = 0

    async def put_item(self, item: RepoStats) -> RepoStats:
        """
        Called for each extracted item.

        This is where you'd:
        - Save to database
        - Send to message queue
        - Perform validation/transformation
        - Aggregate statistics
        """
        self.total_stars += item.stars
        logger.info("✓ %s: ⭐ %s (%s)", item.name, item.stars, item.language)
        return item

    async def close(self):
        """
        Called when scraper shuts down.

        Use for:
        - Final aggregations
        - Closing database connections
        - Cleanup operations
        """
        logger.info("Total stars collected: %s", self.total_stars)


# this decorator marks this as the scraper's entry point.
@scraper
async def get_repos(send_request: SendRequest):
    """
    Entry point: defines what to scrape.

    Receives send_request - a function to schedule HTTP requests.
    """
    repos = (
        "django/django",
        "fastapi/fastapi",
        "pallets/flask",
        "encode/httpx",
        "aio-libs/aiohttp",
    )

    for repo in repos:
        await send_request(
            Request(
                url=f"https://api.github.com/repos/{repo}",  # API endpoint
                callback=parse_repo,  # Success handler
                errback=on_failure,  # Error handler (network failures, timeouts)
                cb_kwargs={"repo": repo},  # Additional arguments to pass to callbacks
                headers={"Accept": "application/vnd.github+json"},  # Required by GitHub API
            )
        )


async def parse_repo(response: Response, pipeline: Pipeline):
    """
    Success callback: parse response and extract data.

    The `pipeline` dependency is automatically injected by aioscraper.
    """
    data = await response.json()  # Parse JSON response from API
    await pipeline(  # Send extracted item to pipeline
        RepoStats(
            name=data["full_name"],
            stars=data["stargazers_count"],
            language=data.get("language", "Unknown"),
        )
    )


async def on_failure(exc: Exception, repo: str):
    """
    Error callback: handle request/processing failures.

       Use for:
       - Logging errors
       - Sending alerts
       - Custom retry logic
       """
       logger.error("%s: cannot parse response: %s", repo, exc)

Run it:

aioscraper scraper

What's happening?

@scraper registers your entry point
@scraper.pipeline registers a pipeline for processing extracted data
send_request() schedules multiple API requests concurrently with automatic queuing
callback=parse_repo processes successful responses, errback=on_failure handles errors

Recommendation:

Configure retries, rate limiting, concurrency via environment variables for production use.

Examples

See the examples/ directory for fully commented code demonstrating.

Why aioscraper?

vs Scrapy:

Scrapy is built for HTML scraping with CSS/XPath selectors and website crawling
aioscraper is optimized for API data collection (JSON, REST, microservices)
Native asyncio (no Twisted), modern type hints, minimal footprint
Easily embeds into existing async applications

vs httpx/aiohttp directly:

Manual approach: you handle rate limiting, retries, queuing, concurrency, backpressure
aioscraper: adaptive rate limits, priority queues, pipelines, middleware out of the box
Declarative Request → callback → pipeline instead of imperative control flow

vs building custom async workers:

Less boilerplate: focus on business logic, not infrastructure
Production-ready components: EWMA+AIMD rate limiting, graceful shutdown, dependency injection
Testable: explicit dependencies, no global state, easy mocking

When to use aioscraper:

Collecting data from 100+ API endpoints
Fan-out calls to microservices for data enrichment
Queue consumers processing API scraping tasks
API aggregation/monitoring pipelines
High-throughput data collection jobs

Use Cases

1. E-commerce price monitoring

Poll 10,000 product API endpoints across multiple marketplaces:

Adaptive rate limiting prevents bans
Priority queue for trending products
Pipeline aggregates prices → saves to DB → sends alerts on changes

2. Cryptocurrency data aggregation

Collect real-time prices from 20+ exchange APIs:

Concurrent requests with per-exchange rate limits
Built-in retry for transient failures
Pipeline normalizes data formats → writes to time-series DB

3. Microservice data hydration

Your FastAPI app needs data from 50 internal services:

Embed aioscraper in your async application
Fan-out concurrent requests with backpressure control
Middleware for auth, logging, circuit breaking

4. Queue-based scraping workers

Distributed architecture with Redis/RabbitMQ/SQS:

Message queue publishes scraping tasks (URLs + params)
aioscraper workers consume queue → fetch data → process
Pipeline acknowledges messages after successful processing

5. Social media API aggregation

Aggregate user stats from Twitter, LinkedIn, GitHub APIs:

Different rate limits per platform (adaptive throttling)
Error callbacks for quota exceeded / auth failures
Pipeline deduplicates → enriches → stores to database

6. Multi-source data snapshots

Collect point-in-time data from 500+ API sources simultaneously:

Health monitoring: poll status endpoints of distributed services every minute
Market data: snapshot prices from 200+ suppliers at exact intervals
Analytics aggregation: fetch metrics from dozens of analytics APIs on schedule
Concurrent execution with precise timing and automatic retries for failed sources

Performance

Benchmarks show stable throughput across CPython 3.11–3.14 (see benchmarks)

Documentation

Full documentation at aioscraper.readthedocs.io

Changelog

See CHANGELOG.md for version history and release notes.

Contributing

Please see the Contributing guide for workflow, tooling, and review expectations.

License

MIT License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

darkstussy

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.10.4

Dec 12, 2025

This version

0.10.3

Dec 12, 2025

0.10.2

Dec 10, 2025

0.10.1

Dec 9, 2025

0.10.0

Dec 9, 2025

0.9.0

Dec 7, 2025

0.8.0

Dec 5, 2025

0.7.1

Dec 3, 2025

0.7.0

Dec 2, 2025

0.6.1

Nov 23, 2025

0.6.0

Nov 23, 2025

0.5.0

Sep 25, 2025

0.4.1

Sep 23, 2025

0.4.0

Sep 21, 2025

0.3.1

Sep 6, 2025

0.3.0

Sep 1, 2025

0.2.1

Aug 26, 2025

0.2.0

Aug 26, 2025

0.1.0

Apr 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aioscraper-0.10.3.tar.gz (56.4 kB view details)

Uploaded Dec 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aioscraper-0.10.3-py3-none-any.whl (51.0 kB view details)

Uploaded Dec 12, 2025 Python 3

File details

Details for the file aioscraper-0.10.3.tar.gz.

File metadata

Download URL: aioscraper-0.10.3.tar.gz
Upload date: Dec 12, 2025
Size: 56.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aioscraper-0.10.3.tar.gz
Algorithm	Hash digest
SHA256	`f0d7525221e190bc293d9a25684226e8567814f1bb28956915fe78fd8e67fa39`
MD5	`c4ece1f832427a01d02b380882fadbae`
BLAKE2b-256	`22257999523cb9d9280af16a87de17e0c01f295764d89333bb567520bfb80384`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aioscraper-0.10.3.tar.gz:

Publisher: release.yml on DarkStussy/aioscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aioscraper-0.10.3.tar.gz
- Subject digest: f0d7525221e190bc293d9a25684226e8567814f1bb28956915fe78fd8e67fa39
- Sigstore transparency entry: 761571354
- Sigstore integration time: Dec 12, 2025
Source repository:
- Permalink: DarkStussy/aioscraper@da5d277295152683ceb951f9f3d0aa74cf824307
- Branch / Tag: refs/tags/0.10.3
- Owner: https://github.com/DarkStussy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@da5d277295152683ceb951f9f3d0aa74cf824307
- Trigger Event: release

File details

Details for the file aioscraper-0.10.3-py3-none-any.whl.

File metadata

Download URL: aioscraper-0.10.3-py3-none-any.whl
Upload date: Dec 12, 2025
Size: 51.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aioscraper-0.10.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99e8e718431f6a6c3d7be3ba918745b77b4dacbad35b62864ec48e3b62c270c3`
MD5	`84775e85d7cd8f1234c19213773b0582`
BLAKE2b-256	`29f49d64af1f303be428d79c5b1eb20438cd144c8738568f2a79ba68c3236108`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aioscraper-0.10.3-py3-none-any.whl:

Publisher: release.yml on DarkStussy/aioscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aioscraper-0.10.3-py3-none-any.whl
- Subject digest: 99e8e718431f6a6c3d7be3ba918745b77b4dacbad35b62864ec48e3b62c270c3
- Sigstore transparency entry: 761571369
- Sigstore integration time: Dec 12, 2025
Source repository:
- Permalink: DarkStussy/aioscraper@da5d277295152683ceb951f9f3d0aa74cf824307
- Branch / Tag: refs/tags/0.10.3
- Owner: https://github.com/DarkStussy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@da5d277295152683ceb951f9f3d0aa74cf824307
- Trigger Event: release

aioscraper 0.10.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

aioscraper

High-performance asynchronous Python framework for large-scale API data collection.

Table of Contents

What is aioscraper?

Key Features

Installation

Quick Start

Examples

Why aioscraper?

Use Cases

1. E-commerce price monitoring

2. Cryptocurrency data aggregation

3. Microservice data hydration

4. Queue-based scraping workers

5. Social media API aggregation

6. Multi-source data snapshots

Performance

Documentation

Changelog

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance