Skip to main content

Async web content extraction library with three-tier fetch strategy

Project description

daz-web-extract

Async Python library that extracts clean title and body text from any URL. It automatically escalates through multiple fetch strategies to handle everything from simple static pages to JavaScript-rendered content. It never throws exceptions — every call returns a structured result indicating success or failure.

Installation

Requires Python 3.12+.

pip install daz-web-extract

After installing, set up the browser engine for pages that require JavaScript rendering:

playwright install chromium

Usage

Python API

The library exposes a single async function extract and a result type ExtractionResult.

import asyncio
from daz_web_extract import extract, ExtractionResult

result: ExtractionResult = asyncio.run(extract("https://example.com"))

if result.success:
    print(result.title)           # Page title
    print(result.body)            # Clean body text
    print(result.fetch_method)    # Which strategy succeeded
    print(result.content_length)  # Length of body in characters
    print(result.elapsed_ms)      # Total time in milliseconds
    print(result.status_code)     # HTTP status code (if available)
else:
    print(result.error)           # Human-readable error message

Limiting fetch strategies

Use the max_tier parameter to control how far the library escalates:

# Only use fast HTTP fetch (no browser, no trafilatura)
result = await extract("https://example.com", max_tier=1)

# Use HTTP fetch + trafilatura, but skip the browser
result = await extract("https://example.com", max_tier=2)

# Use all strategies including headless browser (default)
result = await extract("https://example.com", max_tier=3)

Serialization

Results can be converted to dictionaries or JSON:

result.to_dict()  # Returns a plain dict
result.to_json()  # Returns a JSON string

Using in async code

import asyncio
from daz_web_extract import extract

async def main():
    urls = [
        "https://example.com",
        "https://www.iana.org/help/example-domains",
    ]
    results = await asyncio.gather(*[extract(url) for url in urls])
    for r in results:
        print(f"{r.url}: {r.title} ({r.content_length} chars)")

asyncio.run(main())

Command Line

Extract content from a URL and print the result:

python run_cli.py extract https://example.com

Output:

Title: Example Domain
Method: httpx
Length: 217 chars
Time: 142ms

Example Domain
This domain is for use in illustrative examples in documents. You may use this domain
in literature without prior coordination or asking for permission.
More information...

Get raw JSON output:

python run_cli.py extract https://example.com --raw

Output:

{
  "success": true,
  "url": "https://example.com",
  "title": "Example Domain",
  "body": "Example Domain\nThis domain is for use in ...",
  "error": null,
  "fetch_method": "httpx",
  "status_code": 200,
  "content_length": 217,
  "elapsed_ms": 142
}

Using via the run script

The project includes a run script that automatically activates the virtual environment:

# Extract content
./run extract https://example.com
./run extract https://example.com --raw

# Run tests
./run test src/daz_web_extract/result_test.py

# Run linter
./run lint

# Run full quality checks
./run check

Development

Set up a development environment:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
playwright install chromium

Run the tests:

pytest -q src/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daz_web_extract-0.2.0.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

daz_web_extract-0.2.0-py3-none-any.whl (29.2 kB view details)

Uploaded Python 3

File details

Details for the file daz_web_extract-0.2.0.tar.gz.

File metadata

  • Download URL: daz_web_extract-0.2.0.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for daz_web_extract-0.2.0.tar.gz
Algorithm Hash digest
SHA256 cf5191e2329bff6d3b37abf507103d9f6000d0c791608284417ae4b7ca95d7ed
MD5 bd5bbdd8af182ca9802f91773a025479
BLAKE2b-256 bd120eed6604af460bed08b6a0c68d35e0c1a7b7c4cb864c4d67460cd16ccaa3

See more details on using hashes here.

File details

Details for the file daz_web_extract-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for daz_web_extract-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c0177940ebfeb1009e5a9af52be6ca29f58e2ac200b805687ecdda87219c4542
MD5 cff89ae5a8cbcdfd2fbde2e0c80e285e
BLAKE2b-256 7e1a56a743ff12b50719eaa3dcc46e3945097aba895c759816e345c74f000572

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page