Skip to main content

Two-stage web article extractor (trafilatura + Playwright) with YouTube transcript fetching.

Project description

web-article-extractor

A small, dependency-light toolkit for pulling readable content off the web. It extracts article text from any URL using a two-stage strategy — trafilatura first (fast, no browser), then a headless Playwright Chromium fallback for JavaScript-heavy pages — and fetches transcripts from YouTube videos through a manual-then-automatic subtitle cascade.

Naming. The PyPI distribution is harnais-web-extractor (this repository keeps the name web-article-extractor). The Python import package is web_article_extractor.

Install

From PyPI:

pip install harnais-web-extractor
# Playwright also needs a browser binary the first time:
playwright install chromium
# YouTube transcript extraction requires yt-dlp (optional extra):
pip install "harnais-web-extractor[youtube]"

Or from source (GitHub):

pip install git+https://github.com/JohnLinotte/web-article-extractor.git

Usage

Command line

# Extract an article as Markdown (default):
python -m web_article_extractor https://example.com/some-article

# As JSON:
python -m web_article_extractor https://example.com/some-article --format json

# A YouTube URL fetches the transcript instead:
python -m web_article_extractor "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

YouTube transcripts require yt-dlp. Install it via the youtube extra (pip install "harnais-web-extractor[youtube]") or provide any yt-dlp binary on your PATH. Without it, fetch_transcript() cannot run.

yt-dlp tuning (optional env vars, all empty by default so the package works on any machine):

  • YT_DLP_COOKIES_FROM_BROWSER=firefox — pass --cookies-from-browser firefox to yt-dlp (needed for age-restricted or members-only videos).
  • YT_DLP_JS_RUNTIME=node — pass --js-runtime node.
  • YT_DLP_BIN=/path/to/yt-dlp — override the binary location.

Python API

from web_article_extractor import extract_article, fetch_transcript, is_youtube_url

result = extract_article("https://example.com/some-article")
if result:
    print(result["title"])
    print(result["content"])      # Markdown
    print(result["word_count"])

if is_youtube_url(url):
    transcript = fetch_transcript(url)
    if transcript:
        print(transcript["text"])

extract_article returns a dict with url, title, content, source_method, extracted_at and word_count, or None when extraction fails entirely.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harnais_web_extractor-0.1.2.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harnais_web_extractor-0.1.2-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file harnais_web_extractor-0.1.2.tar.gz.

File metadata

  • Download URL: harnais_web_extractor-0.1.2.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for harnais_web_extractor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 362388526a0cadee01168e5465ba4bbe389449ee5d85184807db02387bc83965
MD5 2438e3c37db7dd5d204cd881e3e80434
BLAKE2b-256 de32c8fe7eacb006c0dc2ed186282c91571a05597a9b2e883f39cad412ef27c2

See more details on using hashes here.

File details

Details for the file harnais_web_extractor-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for harnais_web_extractor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cbbc7b358c8f41598a04df179dd6346d64554965ab62dc82c162a4b826565d09
MD5 77a32596f8550204289cbef253309842
BLAKE2b-256 421bc3781c34406e2d808e388cc130dace41680cd03a4c6e31884fbcd28d37a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page