Skip to main content

Async web content extraction library with three-tier fetch strategy

Project description

daz-web-extract

Async Python library that extracts clean title and body text from any URL. It automatically escalates through multiple fetch strategies to handle everything from simple static pages to JavaScript-rendered content. It never throws exceptions — every call returns a structured result indicating success or failure.

Installation

Requires Python 3.12+.

pip install daz-web-extract

After installing, set up the browser engine for pages that require JavaScript rendering:

playwright install chromium

Usage

Python API

The library exposes a single async function extract and a result type ExtractionResult.

import asyncio
from daz_web_extract import extract, ExtractionResult

result: ExtractionResult = asyncio.run(extract("https://example.com"))

if result.success:
    print(result.title)           # Page title
    print(result.body)            # Clean body text
    print(result.fetch_method)    # Which strategy succeeded
    print(result.content_length)  # Length of body in characters
    print(result.elapsed_ms)      # Total time in milliseconds
    print(result.status_code)     # HTTP status code (if available)
else:
    print(result.error)           # Human-readable error message

Limiting fetch strategies

Use the max_tier parameter to control how far the library escalates:

# Only use fast HTTP fetch (no browser, no trafilatura)
result = await extract("https://example.com", max_tier=1)

# Use HTTP fetch + trafilatura, but skip the browser
result = await extract("https://example.com", max_tier=2)

# Use all strategies including headless browser (default)
result = await extract("https://example.com", max_tier=3)

Serialization

Results can be converted to dictionaries or JSON:

result.to_dict()  # Returns a plain dict
result.to_json()  # Returns a JSON string

Using in async code

import asyncio
from daz_web_extract import extract

async def main():
    urls = [
        "https://example.com",
        "https://www.iana.org/help/example-domains",
    ]
    results = await asyncio.gather(*[extract(url) for url in urls])
    for r in results:
        print(f"{r.url}: {r.title} ({r.content_length} chars)")

asyncio.run(main())

Command Line

Extract content from a URL and print the result:

python run_cli.py extract https://example.com

Output:

Title: Example Domain
Method: httpx
Length: 217 chars
Time: 142ms

Example Domain
This domain is for use in illustrative examples in documents. You may use this domain
in literature without prior coordination or asking for permission.
More information...

Get raw JSON output:

python run_cli.py extract https://example.com --raw

Output:

{
  "success": true,
  "url": "https://example.com",
  "title": "Example Domain",
  "body": "Example Domain\nThis domain is for use in ...",
  "error": null,
  "fetch_method": "httpx",
  "status_code": 200,
  "content_length": 217,
  "elapsed_ms": 142
}

Using via the run script

The project includes a run script that automatically activates the virtual environment:

# Extract content
./run extract https://example.com
./run extract https://example.com --raw

# Run tests
./run test src/daz_web_extract/result_test.py

# Run linter
./run lint

# Run full quality checks
./run check

Development

Set up a development environment:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
playwright install chromium

Run the tests:

pytest -q src/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daz_web_extract-0.4.0.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

daz_web_extract-0.4.0-py3-none-any.whl (30.9 kB view details)

Uploaded Python 3

File details

Details for the file daz_web_extract-0.4.0.tar.gz.

File metadata

  • Download URL: daz_web_extract-0.4.0.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for daz_web_extract-0.4.0.tar.gz
Algorithm Hash digest
SHA256 108ecd2fcc2341acfd745b660970f901149c19eed1e5e215bc1edcb626dacdf7
MD5 b51d9a2ae2d502c03e3381bd21ff1634
BLAKE2b-256 8d72f7c35c0977c841e890ae4016b2eede92b2747a2a08f771b402d9d8a54d5a

See more details on using hashes here.

File details

Details for the file daz_web_extract-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for daz_web_extract-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7eabd311e6cf86b277559705dc3835a434f177fcbc01fecce7e4ec40e69cda78
MD5 7de9594029a36135c8fbb14ffd6856e3
BLAKE2b-256 5930cda575e6b0acea459656d9833302d8fadb3a55dd818c30b0388f439f93e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page