An async Python toolkit for sitemap discovery, browser crawling, and structured content extraction.

These details have not been verified by PyPI

Project links

Project description

Onecrawler

An async Python toolkit for sitemap discovery, browser crawling, and structured content extraction.

Installation · Quick Start · Documentation · Development

Overview

Onecrawler helps you build maintainable crawling and extraction workflows without turning every project into a custom scraping script. It gives you a shared settingsuration model, async execution, sitemap discovery, browser-backed link extraction, heuristic content extraction, and optional GenAI extraction for typed outputs.

The recommended workflow is:

Use sitemaps first whenever possible.
Fall back to browser link extraction when sitemap coverage is missing or dynamic.
Scrape the final URL list with heuristic extraction by default.
Use GenAI extraction when you need structured output in a Pydantic schema.

sitemap = UniversalSiteMap(settings)
urls = await sitemap.run("https://example.com")

async with ScraperEngine(settings) as scraper:
    records = await scraper.run(urls)

Features

Capability	Details
Sitemap discovery	Resolves `robots.txt`, common sitemap paths, nested indexes, `.xml.gz`, feeds, and HTML fallback
Browser link extraction	Shallow and deep Playwright-backed discovery for JavaScript-rendered or sitemap-poor sites
URL filtering	Wildcard path filters with `include_link_patterns`
Async performance	Tunable concurrency, retries, timeouts, and crawl limits
Content extraction	Heuristic extraction with `trafilatura` for fast article-like content extraction
GenAI extraction	Optional model-assisted extraction for strongly typed Pydantic outputs
Output formats	`markdown`, `json`, `csv`, `html`, `python`, `txt`, `xml`, `xmltei`
Proxy support	Single proxy or rotating proxy pools for browser and sitemap workflows
Browser controls	Viewport, user agent, locale, timezone, storage state, and runtime settings

When To Use What

Need	Use	Why
Fast URL discovery from a public site	`UniversalSiteMap`	It is usually the simplest, fastest, and least expensive way to collect URLs
Links from one listing page	Shallow `LinkExtractionEngine`	It reads direct same-site links from the page
Recursive discovery through navigation	Deep `LinkExtractionEngine`	It follows internal links until your settingsured limit
Bulk article or page text extraction	Heuristic `ScraperEngine`	It is deterministic and avoids model cost
Typed fields or semantic normalization	GenAI extraction	It can produce schema-shaped output for downstream systems

Installation

pip install onecrawler

Install Playwright browser binaries when you use browser-backed crawling or scraping:

python -m playwright install chromium

Install optional GenAI dependencies when you use model-assisted extraction:

pip install "onecrawler[genai]"

For local development:

git clone https://github.com/sayedshaun/onecrawler.git
cd onecrawler
python -m pip install -e ".[dev]"
python -m playwright install chromium

Quick Start

This example uses the production-friendly path: discover URLs from the sitemap, then scrape them.

import json
import asyncio
from onecrawler import CrawlerSettings, ScraperEngine, UniversalSiteMap


async def main():
    settings = CrawlerSettings(
        link_extraction_limit=100,
        include_link_patterns=["/articles/*"],
        scraping_strategy="heuristic",
        scraping_output_format="json",
        concurrency=8,
        request_timeout=15,
        max_retries=3,
    )

    sitemap = UniversalSiteMap(settings)
    urls = await sitemap.run("https://example.com")

    async with ScraperEngine(settings) as scraper:
        records = await scraper.run(urls)

    with open("articles.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)


if __name__ == "__main__":
    asyncio.run(main())

Browser Link Extraction

Use browser extraction when sitemaps are incomplete, unavailable, or unable to expose JavaScript-rendered links.

import asyncio
from onecrawler import CrawlerSettings, LinkExtractionEngine


async def main():
    settings = CrawlerSettings(
        link_extraction_strategy="deep",
        link_extraction_limit=250,
        include_link_patterns=["/news/*"],
        concurrency=5,
    )

    async with LinkExtractionEngine(settings) as engine:
        links = await engine.run("https://example.com/news")

    print(f"Collected {len(links)} links")


if __name__ == "__main__":
    asyncio.run(main())

GenAI Extraction With A Schema

Use GenAI extraction when you need a strongly typed response shape instead of plain content. This requires installing the GenAI dependencies:

pip install "onecrawler[genai]"

import asyncio
from typing import Optional
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings, ScraperEngine


class ArticleSummary(BaseModel):
    title: str
    author: Optional[str] = None
    published_at: Optional[str] = None
    summary: str
    topics: list[str]


async def main():
    settings = CrawlerSettings(
        scraping_strategy="genai",  # Required for GenAI extraction
        scraping_output_format="json",  # GenAI only supports JSON
        genai=GenerativeAISettings(
            provider="openai",  # Options: "openai", "google", "ollama"
            model_name="gpt-4o-mini",
            api_key="YOUR_API_KEY",  # Required for OpenAI/Google, optional for Ollama
            output_schema=ArticleSummary,  # Pydantic model for structured output
            # Optional: base_url for custom endpoints (e.g., Ollama)
            # base_url="https://your-ollama-instance.com/",
        ),
        concurrency=2,  # Lower concurrency recommended for GenAI
        request_timeout=30,  # Increase timeout for model responses
    )

    async with ScraperEngine(settings) as scraper:
        result = await scraper.run("https://example.com/articles/story")

    # Convert Pydantic model to dict for JSON serialization
    print(result.model_dump() if hasattr(result, 'model_dump') else result)


if __name__ == "__main__":
    asyncio.run(main())

Supported Providers

OpenAI: Requires api_key, supports GPT models
Google: Requires api_key, supports Gemini models
Ollama: No API key needed, requires base_url, supports local models

Ollama Example

settings = CrawlerSettings(
    scraping_strategy="genai",
    genai=GenerativeAISettings(
        provider="ollama",
        model_name="llama3:8b",
        base_url="http://localhost:11434/",  # Your Ollama instance
        output_schema=ArticleSummary,
    ),
)

Proxy Support

Attach one proxy or a rotating proxy pool directly to CrawlerSettings.

from onecrawler import CrawlerSettings, ProxySettings


settings = CrawlerSettings(
    proxies=[
        ProxySettings(server="http://proxy-1.example:8080"),
        ProxySettings(
            server="http://proxy-2.example:8080",
            username="user",
            password="pass",
        ),
    ],
    proxy_rotation="round_robin",
)

Use proxy=ProxySettings(...) for one proxy, or proxies=[...] for rotation.

Documentation

The README is the project overview. The full documentation in docs/ contains production guidance, caveats, performance notes, and copy-paste examples.

Topic	Guide
Install the package	Installation
Run your first crawl	Quick start
Tune crawler settings	settings
Discover URLs from sitemaps	Sitemap discovery
Extract and filter links	Link extraction
Scrape page content	Scraping
Public classes and exports	API reference
Common fixes	Troubleshooting
Contribute locally	Contributing
Work on the project	Development

See Contributing for how to improve the docs.

Production Tips

Prefer UniversalSiteMap before browser crawling.
Always set link_extraction_limit for broad jobs.
Use include_link_patterns to keep discovery focused.
Start with moderate concurrency, then increase gradually.
Use heuristic scraping for bulk content extraction.
Use GenAI extraction for schema-shaped output, summaries, classification, or field normalization.
Split discovery and scraping into separate steps for easier retries.

License

Released under the MIT License. See LICENSE for full terms.

Built by sayedshaun

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

May 15, 2026

This version

0.1.1

May 10, 2026

0.1.0

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onecrawler-0.1.1.tar.gz (31.5 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

onecrawler-0.1.1-py3-none-any.whl (31.3 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file onecrawler-0.1.1.tar.gz.

File metadata

Download URL: onecrawler-0.1.1.tar.gz
Upload date: May 10, 2026
Size: 31.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for onecrawler-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`850dc975b149a798880aef570845c38a8ea6cc0290b4696735d289f09723034d`
MD5	`ad3efb7cd0798fccf513acd94b2cc873`
BLAKE2b-256	`41bacecf37a4d13726be18d538e7b2ddb64f1d8e57865604df2ae1a21a469190`

See more details on using hashes here.

File details

Details for the file onecrawler-0.1.1-py3-none-any.whl.

File metadata

Download URL: onecrawler-0.1.1-py3-none-any.whl
Upload date: May 10, 2026
Size: 31.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for onecrawler-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11fadb25aef4736c64d09fc1ee629428ce52cc694314d14d4420153a447d8ce2`
MD5	`1cbfeced4df2a96b1868798d3316f64a`
BLAKE2b-256	`e99c6c9f4f28b0920831e4b14a02b3134575540797208e6c198ae4fd62a519ff`

See more details on using hashes here.

onecrawler 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Onecrawler

Overview

Features

When To Use What

Installation

Quick Start

Browser Link Extraction

GenAI Extraction With A Schema

Supported Providers

Ollama Example

Proxy Support

Documentation

Production Tips

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes