Async web content extraction library with three-tier fetch strategy

These details have not been verified by PyPI

Project links

Project description

daz-web-extract

Async Python library that extracts clean title and body text from any URL. It automatically escalates through multiple fetch strategies to handle everything from simple static pages to JavaScript-rendered content. It never throws exceptions — every call returns a structured result indicating success or failure.

Installation

Requires Python 3.12+.

pip install daz-web-extract

After installing, set up the browser engine for pages that require JavaScript rendering:

playwright install chromium

Usage

Python API

The library exposes a single async function extract and a result type ExtractionResult.

import asyncio
from daz_web_extract import extract, ExtractionResult

result: ExtractionResult = asyncio.run(extract("https://example.com"))

if result.success:
    print(result.title)           # Page title
    print(result.body)            # Clean body text
    print(result.fetch_method)    # Which strategy succeeded
    print(result.content_length)  # Length of body in characters
    print(result.elapsed_ms)      # Total time in milliseconds
    print(result.status_code)     # HTTP status code (if available)
else:
    print(result.error)           # Human-readable error message

Limiting fetch strategies

Use the max_tier parameter to control how far the library escalates:

# Only use fast HTTP fetch (no browser, no trafilatura)
result = await extract("https://example.com", max_tier=1)

# Use HTTP fetch + trafilatura, but skip the browser
result = await extract("https://example.com", max_tier=2)

# Use all strategies including headless browser (default)
result = await extract("https://example.com", max_tier=3)

Serialization

Results can be converted to dictionaries or JSON:

result.to_dict()  # Returns a plain dict
result.to_json()  # Returns a JSON string

Using in async code

import asyncio
from daz_web_extract import extract

async def main():
    urls = [
        "https://example.com",
        "https://www.iana.org/help/example-domains",
    ]
    results = await asyncio.gather(*[extract(url) for url in urls])
    for r in results:
        print(f"{r.url}: {r.title} ({r.content_length} chars)")

asyncio.run(main())

Command Line

Extract content from a URL and print the result:

python run_cli.py extract https://example.com

Output:

Title: Example Domain
Method: httpx
Length: 217 chars
Time: 142ms

Example Domain
This domain is for use in illustrative examples in documents. You may use this domain
in literature without prior coordination or asking for permission.
More information...

Get raw JSON output:

python run_cli.py extract https://example.com --raw

Output:

{
  "success": true,
  "url": "https://example.com",
  "title": "Example Domain",
  "body": "Example Domain\nThis domain is for use in ...",
  "error": null,
  "fetch_method": "httpx",
  "status_code": 200,
  "content_length": 217,
  "elapsed_ms": 142
}

Using via the `run` script

The project includes a run script that automatically activates the virtual environment:

# Extract content
./run extract https://example.com
./run extract https://example.com --raw

# Run tests
./run test src/daz_web_extract/result_test.py

# Run linter
./run lint

# Run full quality checks
./run check

Development

Set up a development environment:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
playwright install chromium

Run the tests:

pytest -q src/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Feb 6, 2026

0.2.0

Feb 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daz_web_extract-0.4.0.tar.gz (27.8 kB view details)

Uploaded Feb 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

daz_web_extract-0.4.0-py3-none-any.whl (30.9 kB view details)

Uploaded Feb 6, 2026 Python 3

File details

Details for the file daz_web_extract-0.4.0.tar.gz.

File metadata

Download URL: daz_web_extract-0.4.0.tar.gz
Upload date: Feb 6, 2026
Size: 27.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for daz_web_extract-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`108ecd2fcc2341acfd745b660970f901149c19eed1e5e215bc1edcb626dacdf7`
MD5	`b51d9a2ae2d502c03e3381bd21ff1634`
BLAKE2b-256	`8d72f7c35c0977c841e890ae4016b2eede92b2747a2a08f771b402d9d8a54d5a`

See more details on using hashes here.

File details

Details for the file daz_web_extract-0.4.0-py3-none-any.whl.

File metadata

Download URL: daz_web_extract-0.4.0-py3-none-any.whl
Upload date: Feb 6, 2026
Size: 30.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for daz_web_extract-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7eabd311e6cf86b277559705dc3835a434f177fcbc01fecce7e4ec40e69cda78`
MD5	`7de9594029a36135c8fbb14ffd6856e3`
BLAKE2b-256	`5930cda575e6b0acea459656d9833302d8fadb3a55dd818c30b0388f439f93e1`

See more details on using hashes here.

daz-web-extract 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

daz-web-extract

Installation

Usage

Python API

Limiting fetch strategies

Serialization

Using in async code

Command Line

Using via the `run` script

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

daz-web-extract 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

daz-web-extract

Installation

Usage

Python API

Limiting fetch strategies

Serialization

Using in async code

Command Line

Using via the run script

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Using via the `run` script