Two-stage web article extractor (trafilatura + Playwright) with YouTube transcript fetching.
Project description
web-article-extractor
A small, dependency-light toolkit for pulling readable content off the web. It
extracts article text from any URL using a two-stage strategy — trafilatura
first (fast, no browser), then a headless Playwright Chromium fallback for
JavaScript-heavy pages — and fetches transcripts from YouTube videos through a
manual-then-automatic subtitle cascade.
Naming. The PyPI distribution is
harnais-web-extractor(this repository keeps the nameweb-article-extractor). The Python import package isweb_article_extractor.
Install
From PyPI:
pip install harnais-web-extractor
# Playwright also needs a browser binary the first time:
playwright install chromium
# YouTube transcript extraction requires yt-dlp (optional extra):
pip install "harnais-web-extractor[youtube]"
Or from source (GitHub):
pip install git+https://github.com/JohnLinotte/web-article-extractor.git
Usage
Command line
# Extract an article as Markdown (default):
python -m web_article_extractor https://example.com/some-article
# As JSON:
python -m web_article_extractor https://example.com/some-article --format json
# A YouTube URL fetches the transcript instead:
python -m web_article_extractor "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
YouTube transcripts require
yt-dlp. Install it via theyoutubeextra (pip install "harnais-web-extractor[youtube]") or provide anyyt-dlpbinary on yourPATH. Without it,fetch_transcript()cannot run.yt-dlp tuning (optional env vars, all empty by default so the package works on any machine):
YT_DLP_COOKIES_FROM_BROWSER=firefox— pass--cookies-from-browser firefoxto yt-dlp (needed for age-restricted or members-only videos).YT_DLP_JS_RUNTIME=node— pass--js-runtime node.YT_DLP_BIN=/path/to/yt-dlp— override the binary location.
Python API
from web_article_extractor import extract_article, fetch_transcript, is_youtube_url
result = extract_article("https://example.com/some-article")
if result:
print(result["title"])
print(result["content"]) # Markdown
print(result["word_count"])
if is_youtube_url(url):
transcript = fetch_transcript(url)
if transcript:
print(transcript["text"])
extract_article returns a dict with url, title, content,
source_method, extracted_at and word_count, or None when extraction
fails entirely.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file harnais_web_extractor-0.1.2.tar.gz.
File metadata
- Download URL: harnais_web_extractor-0.1.2.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
362388526a0cadee01168e5465ba4bbe389449ee5d85184807db02387bc83965
|
|
| MD5 |
2438e3c37db7dd5d204cd881e3e80434
|
|
| BLAKE2b-256 |
de32c8fe7eacb006c0dc2ed186282c91571a05597a9b2e883f39cad412ef27c2
|
File details
Details for the file harnais_web_extractor-0.1.2-py3-none-any.whl.
File metadata
- Download URL: harnais_web_extractor-0.1.2-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbbc7b358c8f41598a04df179dd6346d64554965ab62dc82c162a4b826565d09
|
|
| MD5 |
77a32596f8550204289cbef253309842
|
|
| BLAKE2b-256 |
421bc3781c34406e2d808e388cc130dace41680cd03a4c6e31884fbcd28d37a9
|