An async Python crawling library for discovering URLs, extracting links, and scraping structured content.

These details have not been verified by PyPI

Project links

Project description

Onecrawler

An async Python crawling library for discovering URLs, extracting links, and scraping structured content.

Installation · Quick Start · Documentation

Overview

Onecrawler helps you build maintainable crawling and extraction workflows without turning every project into a custom scraping script. It provides a shared configuration model, async execution, sitemap discovery, browser-backed link extraction, heuristic content extraction, and optional GenAI extraction for typed outputs.

Recommended workflow:

Use sitemaps first whenever possible.
Fall back to browser link extraction when sitemap coverage is missing or dynamic.
Scrape the final URL list with heuristic extraction by default.
Use GenAI extraction when you need structured output in a Pydantic schema.

async with LinkExtractionEngine(settings) as link_engine:
    links = await link_engine.run("https://example.com")

async with ScraperEngine(settings) as scraper_engine:
    records = await scraper_engine.run(links)

Features

Capability	Details
Sitemap discovery	Resolves `robots.txt`, common sitemap paths, nested indexes, `.xml.gz`, feeds, and HTML fallback
Browser link extraction	Shallow and deep Playwright-backed discovery for JavaScript-rendered or sitemap-poor sites
URL filtering	Wildcard path filters with `include_link_patterns`
Async performance	Tunable concurrency, retries, timeouts, and crawl limits
Content extraction	Heuristic extraction with `trafilatura` for fast article-like content
GenAI extraction	Optional model-assisted extraction for strongly typed Pydantic outputs
Output formats	`markdown`, `json`, `csv`, `html`, `python`, `txt`, `xml`, `xmltei`
Proxy support	Single proxy or rotating proxy pools for browser and sitemap workflows
Browser controls	Viewport, user agent, locale, timezone, storage state, and runtime settings

When To Use What

Need	Use	Why
Fast URL discovery from a public site	`UniversalSiteMap`	Simplest, fastest, and least expensive way to collect URLs
Links from one listing page	Shallow `LinkExtractionEngine`	Reads direct same-site links from the page
Recursive discovery through navigation	Deep `LinkExtractionEngine`	Follows internal links until your configured limit
Bulk article or page text extraction	Heuristic `ScraperEngine`	Deterministic and avoids model cost
Typed fields or semantic normalization	GenAI extraction	Produces schema-shaped output for downstream systems

Installation

pip install onecrawler

Install Playwright browser binaries when you use browser-backed crawling or scraping:

python -m playwright install chromium

Install optional GenAI dependencies when you use model-assisted extraction:

pip install "onecrawler[genai]"

[!NOTE] GenAI extraction requires an API key from your chosen provider (OpenAI, Google) or a running Ollama instance. See GenAI Extraction for details.

For local development:

git clone https://github.com/sayedshaun/onecrawler.git
cd onecrawler
python -m pip install -e ".[dev]"
python -m playwright install chromium

Quick Start

import json
from onecrawler import CrawlerSettings, LinkExtractionEngine, ScraperEngine


async def main():
    settings = CrawlerSettings(
        link_extraction_strategy="deep",
        link_extraction_limit=10,
        concurrency=7,
        scraping_strategy="heuristic",
        scraping_output_format="json",
        enable_human_behaviors=True,
    )

    async with LinkExtractionEngine(settings) as link_engine:
        links = await link_engine.run("https://www.example.com/")

    async with ScraperEngine(settings) as scraper_engine:
        results = await scraper_engine.run(links)

    with open("output.json", "w", encoding="utf-8") as f:
        json.dump(results, f, ensure_ascii=False, indent=4)


if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

[!TIP] Always set link_extraction_limit when crawling broad sites. Without it, discovery can run indefinitely on large domains.

Browser Link Extraction

Use browser extraction when sitemaps are incomplete, unavailable, or unable to expose JavaScript-rendered links.

import asyncio
from onecrawler import CrawlerSettings, LinkExtractionEngine


async def main():
    settings = CrawlerSettings(
        link_extraction_strategy="deep",
        link_extraction_limit=250,
        include_link_patterns=["/news/*"],
        concurrency=5,
    )

    async with LinkExtractionEngine(settings) as engine:
        links = await engine.run("https://example.com/news")

    print(f"Collected {len(links)} links")


if __name__ == "__main__":
    asyncio.run(main())

[!TIP] Use include_link_patterns to keep discovery focused on relevant paths. For example, ["/blog/*", "/docs/*"] prevents the crawler from wandering into auth pages, admin routes, or unrelated sections.

[!NOTE] Deep extraction follows internal links recursively. Use shallow strategy when you only need links visible on a single listing page — it's significantly faster.

GenAI Extraction With a Schema

Use GenAI extraction when you need a strongly typed response shape instead of plain content.

pip install "onecrawler[genai]"

import asyncio
from typing import Optional
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings, ScraperEngine


class ArticleSummary(BaseModel):
    title: str
    author: Optional[str] = None
    published_at: Optional[str] = None
    summary: str
    topics: list[str]


async def main():
    settings = CrawlerSettings(
        scraping_strategy="genai",
        scraping_output_format="json",
        genai=GenerativeAISettings(
            provider="openai",
            model_name="gpt-4o-mini",
            api_key="YOUR_API_KEY",
            output_schema=ArticleSummary,
        ),
        concurrency=2,
        request_timeout=30,
    )

    async with ScraperEngine(settings) as scraper:
        result = await scraper.run("https://example.com/articles/story")

    print(result.model_dump() if hasattr(result, "model_dump") else result)


if __name__ == "__main__":
    asyncio.run(main())

[!TIP] Keep concurrency low (2–4) for GenAI extraction. Each page triggers a model call; high concurrency can exhaust rate limits quickly and inflate costs.

[!WARNING] Never hardcode your API key in source files. Use environment variables or a secrets manager instead:
import os
api_key=os.environ["OPENAI_API_KEY"]

Supported Providers

Provider	Requires	Models
OpenAI	`api_key`	GPT-4o, GPT-4o-mini, etc.
Google	`api_key`	Gemini models
Ollama	`base_url` (no key needed)	Any locally hosted model

Ollama Example

settings = CrawlerSettings(
    scraping_strategy="genai",
    genai=GenerativeAISettings(
        provider="ollama",
        model_name="llama3:8b",
        base_url="http://localhost:11434/",
        output_schema=ArticleSummary,
    ),
)

[!NOTE] Ollama requires a running local instance. Install it from ollama.com and pull your model (ollama pull llama3:8b) before running.

Proxy Support

Attach one proxy or a rotating proxy pool directly to CrawlerSettings.

from onecrawler import CrawlerSettings, ProxySettings


settings = CrawlerSettings(
    proxies=[
        ProxySettings(server="http://proxy-1.example:8080"),
        ProxySettings(
            server="http://proxy-2.example:8080",
            username="user",
            password="pass",
        ),
    ],
    proxy_rotation="round_robin",
)

Use proxy=ProxySettings(...) for a single proxy, or proxies=[...] with proxy_rotation for a pool.

[!TIP] round_robin rotation distributes requests evenly across your proxy pool. For rate-limited targets, pair this with a modest concurrency value and a request_delay to avoid triggering bans.

Production Tips

[!IMPORTANT] Split URL discovery and scraping into separate pipeline steps. Collecting all URLs first gives you a checkpoint to resume from if scraping fails partway through — without re-running discovery.

[!TIP] Start with UniversalSiteMap before reaching for browser extraction. Sitemap-based discovery is faster, cheaper, and more complete on well-maintained sites. Fall back to LinkExtractionEngine only when sitemaps are missing or stale.

[!TIP] Use heuristic scraping (scraping_strategy="heuristic") for bulk content extraction. Reserve GenAI extraction for cases where you genuinely need structured, schema-shaped output — it adds latency and cost at scale.

[!CAUTION] Respect robots.txt and a site's terms of service before crawling. Onecrawler does not enforce crawl policies automatically — you are responsible for staying within allowed access patterns.

License

Released under the MIT License. See LICENSE for full terms.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

May 15, 2026

0.1.1

May 10, 2026

0.1.0

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onecrawler-0.1.2.tar.gz (38.5 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

onecrawler-0.1.2-py3-none-any.whl (36.6 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file onecrawler-0.1.2.tar.gz.

File metadata

Download URL: onecrawler-0.1.2.tar.gz
Upload date: May 15, 2026
Size: 38.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for onecrawler-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`410fc2db61aa6cad85035560cb9b6ee6b9a435dad03bd77fdd8d72cadb2b1e49`
MD5	`be6c82cdd86f1d021fe8821319310fe5`
BLAKE2b-256	`b5217ee3c3f7026a1105d4738b5939102c11a37c03ef6207c8e6547a405f5867`

See more details on using hashes here.

File details

Details for the file onecrawler-0.1.2-py3-none-any.whl.

File metadata

Download URL: onecrawler-0.1.2-py3-none-any.whl
Upload date: May 15, 2026
Size: 36.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for onecrawler-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f561aadfdbca23958ac7bfe68699a17a5d5526f81876f22df658d2a7d47cd80`
MD5	`cce3ad0144ef75b3a45e805c9a57fecc`
BLAKE2b-256	`c176082b086332838c1a171c39bab0ba1feb3d54d8859dc82ef35997d3c4fb25`

See more details on using hashes here.

onecrawler 0.1.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

Onecrawler

Overview

Features

When To Use What

Installation

Quick Start

Browser Link Extraction

GenAI Extraction With a Schema

Supported Providers

Ollama Example

Proxy Support

Production Tips

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes