Two-stage web article extractor (trafilatura + Playwright) with YouTube transcript fetching.

These details have not been verified by PyPI

Project links

Project description

web-article-extractor

A small, dependency-light toolkit for pulling readable content off the web. It extracts article text from any URL using a two-stage strategy — trafilatura first (fast, no browser), then a headless Playwright Chromium fallback for JavaScript-heavy pages — and fetches transcripts from YouTube videos through a manual-then-automatic subtitle cascade.

Naming. The PyPI distribution is harnais-web-extractor (this repository keeps the name web-article-extractor). The Python import package is web_article_extractor.

Install

From PyPI:

pip install harnais-web-extractor
# Playwright also needs a browser binary the first time:
playwright install chromium
# YouTube transcript extraction requires yt-dlp (optional extra):
pip install "harnais-web-extractor[youtube]"

Or from source (GitHub):

pip install git+https://github.com/JohnLinotte/web-article-extractor.git

Usage

Command line

# Extract an article as Markdown (default):
python -m web_article_extractor https://example.com/some-article

# As JSON:
python -m web_article_extractor https://example.com/some-article --format json

# A YouTube URL fetches the transcript instead:
python -m web_article_extractor "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

YouTube transcripts require yt-dlp. Install it via the youtube extra (pip install "harnais-web-extractor[youtube]") or provide any yt-dlp binary on your PATH. Without it, fetch_transcript() cannot run.

yt-dlp tuning (optional env vars, all empty by default so the package works on any machine):

YT_DLP_COOKIES_FROM_BROWSER=firefox — pass --cookies-from-browser firefox to yt-dlp (needed for age-restricted or members-only videos).

YT_DLP_JS_RUNTIME=node — pass --js-runtime node.

YT_DLP_BIN=/path/to/yt-dlp — override the binary location.

Python API

from web_article_extractor import extract_article, fetch_transcript, is_youtube_url

result = extract_article("https://example.com/some-article")
if result:
    print(result["title"])
    print(result["content"])      # Markdown
    print(result["word_count"])

if is_youtube_url(url):
    transcript = fetch_transcript(url)
    if transcript:
        print(transcript["text"])

extract_article returns a dict with url, title, content, source_method, extracted_at and word_count, or None when extraction fails entirely.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Jun 18, 2026

0.1.1

Jun 18, 2026

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harnais_web_extractor-0.1.2.tar.gz (13.4 kB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

harnais_web_extractor-0.1.2-py3-none-any.whl (13.9 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file harnais_web_extractor-0.1.2.tar.gz.

File metadata

Download URL: harnais_web_extractor-0.1.2.tar.gz
Upload date: Jun 18, 2026
Size: 13.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for harnais_web_extractor-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`362388526a0cadee01168e5465ba4bbe389449ee5d85184807db02387bc83965`
MD5	`2438e3c37db7dd5d204cd881e3e80434`
BLAKE2b-256	`de32c8fe7eacb006c0dc2ed186282c91571a05597a9b2e883f39cad412ef27c2`

See more details on using hashes here.

File details

Details for the file harnais_web_extractor-0.1.2-py3-none-any.whl.

File metadata

Download URL: harnais_web_extractor-0.1.2-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 13.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for harnais_web_extractor-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cbbc7b358c8f41598a04df179dd6346d64554965ab62dc82c162a4b826565d09`
MD5	`77a32596f8550204289cbef253309842`
BLAKE2b-256	`421bc3781c34406e2d808e388cc130dace41680cd03a4c6e31884fbcd28d37a9`

See more details on using hashes here.

harnais-web-extractor 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

web-article-extractor

Install

Usage

Command line

Python API

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes