Async web content extraction library with three-tier fetch strategy
Project description
daz-web-extract
Async Python library that extracts clean title and body text from any URL. It automatically escalates through multiple fetch strategies to handle everything from simple static pages to JavaScript-rendered content. It never throws exceptions — every call returns a structured result indicating success or failure.
Installation
Requires Python 3.12+.
pip install daz-web-extract
After installing, set up the browser engine for pages that require JavaScript rendering:
playwright install chromium
Usage
Python API
The library exposes a single async function extract and a result type ExtractionResult.
import asyncio
from daz_web_extract import extract, ExtractionResult
result: ExtractionResult = asyncio.run(extract("https://example.com"))
if result.success:
print(result.title) # Page title
print(result.body) # Clean body text
print(result.fetch_method) # Which strategy succeeded
print(result.content_length) # Length of body in characters
print(result.elapsed_ms) # Total time in milliseconds
print(result.status_code) # HTTP status code (if available)
else:
print(result.error) # Human-readable error message
Limiting fetch strategies
Use the max_tier parameter to control how far the library escalates:
# Only use fast HTTP fetch (no browser, no trafilatura)
result = await extract("https://example.com", max_tier=1)
# Use HTTP fetch + trafilatura, but skip the browser
result = await extract("https://example.com", max_tier=2)
# Use all strategies including headless browser (default)
result = await extract("https://example.com", max_tier=3)
Serialization
Results can be converted to dictionaries or JSON:
result.to_dict() # Returns a plain dict
result.to_json() # Returns a JSON string
Using in async code
import asyncio
from daz_web_extract import extract
async def main():
urls = [
"https://example.com",
"https://www.iana.org/help/example-domains",
]
results = await asyncio.gather(*[extract(url) for url in urls])
for r in results:
print(f"{r.url}: {r.title} ({r.content_length} chars)")
asyncio.run(main())
Command Line
Extract content from a URL and print the result:
python run_cli.py extract https://example.com
Output:
Title: Example Domain
Method: httpx
Length: 217 chars
Time: 142ms
Example Domain
This domain is for use in illustrative examples in documents. You may use this domain
in literature without prior coordination or asking for permission.
More information...
Get raw JSON output:
python run_cli.py extract https://example.com --raw
Output:
{
"success": true,
"url": "https://example.com",
"title": "Example Domain",
"body": "Example Domain\nThis domain is for use in ...",
"error": null,
"fetch_method": "httpx",
"status_code": 200,
"content_length": 217,
"elapsed_ms": 142
}
Using via the run script
The project includes a run script that automatically activates the virtual environment:
# Extract content
./run extract https://example.com
./run extract https://example.com --raw
# Run tests
./run test src/daz_web_extract/result_test.py
# Run linter
./run lint
# Run full quality checks
./run check
Development
Set up a development environment:
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
playwright install chromium
Run the tests:
pytest -q src/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file daz_web_extract-0.4.0.tar.gz.
File metadata
- Download URL: daz_web_extract-0.4.0.tar.gz
- Upload date:
- Size: 27.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
108ecd2fcc2341acfd745b660970f901149c19eed1e5e215bc1edcb626dacdf7
|
|
| MD5 |
b51d9a2ae2d502c03e3381bd21ff1634
|
|
| BLAKE2b-256 |
8d72f7c35c0977c841e890ae4016b2eede92b2747a2a08f771b402d9d8a54d5a
|
File details
Details for the file daz_web_extract-0.4.0-py3-none-any.whl.
File metadata
- Download URL: daz_web_extract-0.4.0-py3-none-any.whl
- Upload date:
- Size: 30.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7eabd311e6cf86b277559705dc3835a434f177fcbc01fecce7e4ec40e69cda78
|
|
| MD5 |
7de9594029a36135c8fbb14ffd6856e3
|
|
| BLAKE2b-256 |
5930cda575e6b0acea459656d9833302d8fadb3a55dd818c30b0388f439f93e1
|