Skip to main content

An async Python toolkit for sitemap discovery, browser crawling, and structured content extraction.

Project description

Onecrawler

Onecrawler

An async Python toolkit for sitemap discovery, browser crawling, and structured content extraction.

CI Python PyPI Code style: black Imports: isort License: MIT

Installation · Quick Start · Documentation · Development


Overview

Onecrawler helps you build maintainable crawling and extraction workflows without turning every project into a custom scraping script. It gives you a shared settingsuration model, async execution, sitemap discovery, browser-backed link extraction, heuristic content extraction, and optional GenAI extraction for typed outputs.

The recommended workflow is:

  1. Use sitemaps first whenever possible.
  2. Fall back to browser link extraction when sitemap coverage is missing or dynamic.
  3. Scrape the final URL list with heuristic extraction by default.
  4. Use GenAI extraction when you need structured output in a Pydantic schema.
sitemap = UniversalSiteMap(settings)
urls = await sitemap.run("https://example.com")

async with ScraperEngine(settings) as scraper:
    records = await scraper.run(urls)

Features

Capability Details
Sitemap discovery Resolves robots.txt, common sitemap paths, nested indexes, .xml.gz, feeds, and HTML fallback
Browser link extraction Shallow and deep Playwright-backed discovery for JavaScript-rendered or sitemap-poor sites
URL filtering Wildcard path filters with include_link_patterns
Async performance Tunable concurrency, retries, timeouts, and crawl limits
Content extraction Heuristic extraction with trafilatura for fast article-like content extraction
GenAI extraction Optional model-assisted extraction for strongly typed Pydantic outputs
Output formats markdown, json, csv, html, python, txt, xml, xmltei
Proxy support Single proxy or rotating proxy pools for browser and sitemap workflows
Browser controls Viewport, user agent, locale, timezone, storage state, and runtime settings

When To Use What

Need Use Why
Fast URL discovery from a public site UniversalSiteMap It is usually the simplest, fastest, and least expensive way to collect URLs
Links from one listing page Shallow LinkExtractionEngine It reads direct same-site links from the page
Recursive discovery through navigation Deep LinkExtractionEngine It follows internal links until your settingsured limit
Bulk article or page text extraction Heuristic ScraperEngine It is deterministic and avoids model cost
Typed fields or semantic normalization GenAI extraction It can produce schema-shaped output for downstream systems

Installation

pip install onecrawler

Install Playwright browser binaries when you use browser-backed crawling or scraping:

python -m playwright install chromium

Install optional GenAI dependencies when you use model-assisted extraction:

pip install "onecrawler[genai]"

For local development:

git clone https://github.com/sayedshaun/onecrawler.git
cd onecrawler
python -m pip install -e ".[dev]"
python -m playwright install chromium

Quick Start

This example uses the production-friendly path: discover URLs from the sitemap, then scrape them.

import json
import asyncio
from onecrawler import CrawlerSettings, ScraperEngine, UniversalSiteMap


async def main():
    settings = CrawlerSettings(
        link_extraction_limit=100,
        include_link_patterns=["/articles/*"],
        scraping_strategy="heuristic",
        scraping_output_format="json",
        concurrency=8,
        request_timeout=15,
        max_retries=3,
    )

    sitemap = UniversalSiteMap(settings)
    urls = await sitemap.run("https://example.com")

    async with ScraperEngine(settings) as scraper:
        records = await scraper.run(urls)

    with open("articles.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)


if __name__ == "__main__":
    asyncio.run(main())

Browser Link Extraction

Use browser extraction when sitemaps are incomplete, unavailable, or unable to expose JavaScript-rendered links.

import asyncio
from onecrawler import CrawlerSettings, LinkExtractionEngine


async def main():
    settings = CrawlerSettings(
        link_extraction_strategy="deep",
        link_extraction_limit=250,
        include_link_patterns=["/news/*"],
        concurrency=5,
    )

    async with LinkExtractionEngine(settings) as engine:
        links = await engine.run("https://example.com/news")

    print(f"Collected {len(links)} links")


if __name__ == "__main__":
    asyncio.run(main())

GenAI Extraction With A Schema

Use GenAI extraction when you need a strongly typed response shape instead of plain content. This requires installing the GenAI dependencies:

pip install "onecrawler[genai]"
import asyncio
from typing import Optional
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings, ScraperEngine


class ArticleSummary(BaseModel):
    title: str
    author: Optional[str] = None
    published_at: Optional[str] = None
    summary: str
    topics: list[str]


async def main():
    settings = CrawlerSettings(
        scraping_strategy="genai",  # Required for GenAI extraction
        scraping_output_format="json",  # GenAI only supports JSON
        genai=GenerativeAISettings(
            provider="openai",  # Options: "openai", "google", "ollama"
            model_name="gpt-4o-mini",
            api_key="YOUR_API_KEY",  # Required for OpenAI/Google, optional for Ollama
            output_schema=ArticleSummary,  # Pydantic model for structured output
            # Optional: base_url for custom endpoints (e.g., Ollama)
            # base_url="https://your-ollama-instance.com/",
        ),
        concurrency=2,  # Lower concurrency recommended for GenAI
        request_timeout=30,  # Increase timeout for model responses
    )

    async with ScraperEngine(settings) as scraper:
        result = await scraper.run("https://example.com/articles/story")

    # Convert Pydantic model to dict for JSON serialization
    print(result.model_dump() if hasattr(result, 'model_dump') else result)


if __name__ == "__main__":
    asyncio.run(main())

Supported Providers

  • OpenAI: Requires api_key, supports GPT models
  • Google: Requires api_key, supports Gemini models
  • Ollama: No API key needed, requires base_url, supports local models

Ollama Example

settings = CrawlerSettings(
    scraping_strategy="genai",
    genai=GenerativeAISettings(
        provider="ollama",
        model_name="llama3:8b",
        base_url="http://localhost:11434/",  # Your Ollama instance
        output_schema=ArticleSummary,
    ),
)

Proxy Support

Attach one proxy or a rotating proxy pool directly to CrawlerSettings.

from onecrawler import CrawlerSettings, ProxySettings


settings = CrawlerSettings(
    proxies=[
        ProxySettings(server="http://proxy-1.example:8080"),
        ProxySettings(
            server="http://proxy-2.example:8080",
            username="user",
            password="pass",
        ),
    ],
    proxy_rotation="round_robin",
)

Use proxy=ProxySettings(...) for one proxy, or proxies=[...] for rotation.


Documentation

The README is the project overview. The full documentation in docs/ contains production guidance, caveats, performance notes, and copy-paste examples.

Topic Guide
Install the package Installation
Run your first crawl Quick start
Tune crawler settings settings
Discover URLs from sitemaps Sitemap discovery
Extract and filter links Link extraction
Scrape page content Scraping
Public classes and exports API reference
Common fixes Troubleshooting
Contribute locally Contributing
Work on the project Development

See Contributing for how to improve the docs.


Production Tips

  • Prefer UniversalSiteMap before browser crawling.
  • Always set link_extraction_limit for broad jobs.
  • Use include_link_patterns to keep discovery focused.
  • Start with moderate concurrency, then increase gradually.
  • Use heuristic scraping for bulk content extraction.
  • Use GenAI extraction for schema-shaped output, summaries, classification, or field normalization.
  • Split discovery and scraping into separate steps for easier retries.

License

Released under the MIT License. See LICENSE for full terms.


Built by sayedshaun

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onecrawler-0.1.1.tar.gz (31.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onecrawler-0.1.1-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file onecrawler-0.1.1.tar.gz.

File metadata

  • Download URL: onecrawler-0.1.1.tar.gz
  • Upload date:
  • Size: 31.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for onecrawler-0.1.1.tar.gz
Algorithm Hash digest
SHA256 850dc975b149a798880aef570845c38a8ea6cc0290b4696735d289f09723034d
MD5 ad3efb7cd0798fccf513acd94b2cc873
BLAKE2b-256 41bacecf37a4d13726be18d538e7b2ddb64f1d8e57865604df2ae1a21a469190

See more details on using hashes here.

File details

Details for the file onecrawler-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: onecrawler-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 31.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for onecrawler-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 11fadb25aef4736c64d09fc1ee629428ce52cc694314d14d4420153a447d8ce2
MD5 1cbfeced4df2a96b1868798d3316f64a
BLAKE2b-256 e99c6c9f4f28b0920831e4b14a02b3134575540797208e6c198ae4fd62a519ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page