An async Python toolkit for sitemap discovery, browser crawling, and structured content extraction.
Project description
Onecrawler
An async Python toolkit for sitemap discovery, browser crawling, and structured content extraction.
Overview
Onecrawler helps you build maintainable crawling and extraction workflows without turning every project into a custom scraping script. It gives you a shared settingsuration model, async execution, sitemap discovery, browser-backed link extraction, heuristic content extraction, and optional GenAI extraction for typed outputs.
The recommended workflow is:
- Use sitemaps first whenever possible.
- Fall back to browser link extraction when sitemap coverage is missing or dynamic.
- Scrape the final URL list with heuristic extraction by default.
- Use GenAI extraction when you need structured output in a Pydantic schema.
sitemap = UniversalSiteMap(settings)
urls = await sitemap.run("https://example.com")
async with ScraperEngine(settings) as scraper:
records = await scraper.run(urls)
Features
| Capability | Details |
|---|---|
| Sitemap discovery | Resolves robots.txt, common sitemap paths, nested indexes, .xml.gz, feeds, and HTML fallback |
| Browser link extraction | Shallow and deep Playwright-backed discovery for JavaScript-rendered or sitemap-poor sites |
| URL filtering | Wildcard path filters with include_link_patterns |
| Async performance | Tunable concurrency, retries, timeouts, and crawl limits |
| Content extraction | Heuristic extraction with trafilatura for fast article-like content extraction |
| GenAI extraction | Optional model-assisted extraction for strongly typed Pydantic outputs |
| Output formats | markdown, json, csv, html, python, txt, xml, xmltei |
| Proxy support | Single proxy or rotating proxy pools for browser and sitemap workflows |
| Browser controls | Viewport, user agent, locale, timezone, storage state, and runtime settings |
When To Use What
| Need | Use | Why |
|---|---|---|
| Fast URL discovery from a public site | UniversalSiteMap |
It is usually the simplest, fastest, and least expensive way to collect URLs |
| Links from one listing page | Shallow LinkExtractionEngine |
It reads direct same-site links from the page |
| Recursive discovery through navigation | Deep LinkExtractionEngine |
It follows internal links until your settingsured limit |
| Bulk article or page text extraction | Heuristic ScraperEngine |
It is deterministic and avoids model cost |
| Typed fields or semantic normalization | GenAI extraction | It can produce schema-shaped output for downstream systems |
Installation
pip install onecrawler
Install Playwright browser binaries when you use browser-backed crawling or scraping:
python -m playwright install chromium
Install optional GenAI dependencies when you use model-assisted extraction:
pip install "onecrawler[genai]"
For local development:
git clone https://github.com/sayedshaun/onecrawler.git
cd onecrawler
python -m pip install -e ".[dev]"
python -m playwright install chromium
Quick Start
This example uses the production-friendly path: discover URLs from the sitemap, then scrape them.
import json
import asyncio
from onecrawler import CrawlerSettings, ScraperEngine, UniversalSiteMap
async def main():
settings = CrawlerSettings(
link_extraction_limit=100,
include_link_patterns=["/articles/*"],
scraping_strategy="heuristic",
scraping_output_format="json",
concurrency=8,
request_timeout=15,
max_retries=3,
)
sitemap = UniversalSiteMap(settings)
urls = await sitemap.run("https://example.com")
async with ScraperEngine(settings) as scraper:
records = await scraper.run(urls)
with open("articles.json", "w", encoding="utf-8") as f:
json.dump(records, f, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(main())
Browser Link Extraction
Use browser extraction when sitemaps are incomplete, unavailable, or unable to expose JavaScript-rendered links.
import asyncio
from onecrawler import CrawlerSettings, LinkExtractionEngine
async def main():
settings = CrawlerSettings(
link_extraction_strategy="deep",
link_extraction_limit=250,
include_link_patterns=["/news/*"],
concurrency=5,
)
async with LinkExtractionEngine(settings) as engine:
links = await engine.run("https://example.com/news")
print(f"Collected {len(links)} links")
if __name__ == "__main__":
asyncio.run(main())
GenAI Extraction With A Schema
Use GenAI extraction when you need a strongly typed response shape instead of plain content. This requires installing the GenAI dependencies:
pip install "onecrawler[genai]"
import asyncio
from typing import Optional
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings, ScraperEngine
class ArticleSummary(BaseModel):
title: str
author: Optional[str] = None
published_at: Optional[str] = None
summary: str
topics: list[str]
async def main():
settings = CrawlerSettings(
scraping_strategy="genai", # Required for GenAI extraction
scraping_output_format="json", # GenAI only supports JSON
genai=GenerativeAISettings(
provider="openai", # Options: "openai", "google", "ollama"
model_name="gpt-4o-mini",
api_key="YOUR_API_KEY", # Required for OpenAI/Google, optional for Ollama
output_schema=ArticleSummary, # Pydantic model for structured output
# Optional: base_url for custom endpoints (e.g., Ollama)
# base_url="https://your-ollama-instance.com/",
),
concurrency=2, # Lower concurrency recommended for GenAI
request_timeout=30, # Increase timeout for model responses
)
async with ScraperEngine(settings) as scraper:
result = await scraper.run("https://example.com/articles/story")
# Convert Pydantic model to dict for JSON serialization
print(result.model_dump() if hasattr(result, 'model_dump') else result)
if __name__ == "__main__":
asyncio.run(main())
Supported Providers
- OpenAI: Requires
api_key, supports GPT models - Google: Requires
api_key, supports Gemini models - Ollama: No API key needed, requires
base_url, supports local models
Ollama Example
settings = CrawlerSettings(
scraping_strategy="genai",
genai=GenerativeAISettings(
provider="ollama",
model_name="llama3:8b",
base_url="http://localhost:11434/", # Your Ollama instance
output_schema=ArticleSummary,
),
)
Proxy Support
Attach one proxy or a rotating proxy pool directly to CrawlerSettings.
from onecrawler import CrawlerSettings, ProxySettings
settings = CrawlerSettings(
proxies=[
ProxySettings(server="http://proxy-1.example:8080"),
ProxySettings(
server="http://proxy-2.example:8080",
username="user",
password="pass",
),
],
proxy_rotation="round_robin",
)
Use proxy=ProxySettings(...) for one proxy, or proxies=[...] for rotation.
Documentation
The README is the project overview. The full documentation in docs/
contains production guidance, caveats, performance notes, and copy-paste examples.
| Topic | Guide |
|---|---|
| Install the package | Installation |
| Run your first crawl | Quick start |
| Tune crawler settings | settings |
| Discover URLs from sitemaps | Sitemap discovery |
| Extract and filter links | Link extraction |
| Scrape page content | Scraping |
| Public classes and exports | API reference |
| Common fixes | Troubleshooting |
| Contribute locally | Contributing |
| Work on the project | Development |
See Contributing for how to improve the docs.
Production Tips
- Prefer
UniversalSiteMapbefore browser crawling. - Always set
link_extraction_limitfor broad jobs. - Use
include_link_patternsto keep discovery focused. - Start with moderate
concurrency, then increase gradually. - Use heuristic scraping for bulk content extraction.
- Use GenAI extraction for schema-shaped output, summaries, classification, or field normalization.
- Split discovery and scraping into separate steps for easier retries.
License
Released under the MIT License. See LICENSE for full terms.
Built by sayedshaun
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file onecrawler-0.1.1.tar.gz.
File metadata
- Download URL: onecrawler-0.1.1.tar.gz
- Upload date:
- Size: 31.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
850dc975b149a798880aef570845c38a8ea6cc0290b4696735d289f09723034d
|
|
| MD5 |
ad3efb7cd0798fccf513acd94b2cc873
|
|
| BLAKE2b-256 |
41bacecf37a4d13726be18d538e7b2ddb64f1d8e57865604df2ae1a21a469190
|
File details
Details for the file onecrawler-0.1.1-py3-none-any.whl.
File metadata
- Download URL: onecrawler-0.1.1-py3-none-any.whl
- Upload date:
- Size: 31.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11fadb25aef4736c64d09fc1ee629428ce52cc694314d14d4420153a447d8ce2
|
|
| MD5 |
1cbfeced4df2a96b1868798d3316f64a
|
|
| BLAKE2b-256 |
e99c6c9f4f28b0920831e4b14a02b3134575540797208e6c198ae4fd62a519ff
|