Skip to main content

Two-stage web article extractor (trafilatura + Playwright) with YouTube transcript fetching.

Project description

web-article-extractor

A small, dependency-light toolkit for pulling readable content off the web. It extracts article text from any URL using a two-stage strategy — trafilatura first (fast, no browser), then a headless Playwright Chromium fallback for JavaScript-heavy pages — and fetches transcripts from YouTube videos through a manual-then-automatic subtitle cascade.

Naming. The PyPI distribution is harnais-web-extractor (this repository keeps the name web-article-extractor). The Python import package is web_article_extractor.

Install

From PyPI:

pip install harnais-web-extractor
# Playwright also needs a browser binary the first time:
playwright install chromium
# YouTube transcript extraction requires yt-dlp (optional extra):
pip install "harnais-web-extractor[youtube]"

Or from source (GitHub):

pip install git+https://github.com/JohnLinotte/web-article-extractor.git

Usage

Command line

# Extract an article as Markdown (default):
python -m web_article_extractor https://example.com/some-article

# As JSON:
python -m web_article_extractor https://example.com/some-article --format json

# A YouTube URL fetches the transcript instead:
python -m web_article_extractor "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

YouTube transcripts require yt-dlp. Install it via the youtube extra (pip install "harnais-web-extractor[youtube]") or provide any yt-dlp binary on your PATH. Without it, fetch_transcript() cannot run.

Optional: Whisper fallback for videos without subtitles — install faster-whisper via the whisper extra: pip install "harnais-web-extractor[whisper]".

yt-dlp tuning (optional env vars, all empty by default so the package works on any machine):

  • YT_DLP_COOKIES_FROM_BROWSER=firefox — pass --cookies-from-browser firefox to yt-dlp (needed for age-restricted or members-only videos).
  • YT_DLP_JS_RUNTIME=node — pass --js-runtime node.
  • YT_DLP_BIN=/path/to/yt-dlp — override the binary location.

Python API

from web_article_extractor import extract_article, fetch_transcript, is_youtube_url

result = extract_article("https://example.com/some-article")
if result:
    print(result["title"])
    print(result["content"])      # Markdown
    print(result["word_count"])

if is_youtube_url(url):
    transcript = fetch_transcript(url)
    if transcript:
        print(transcript["text"])

extract_article returns a dict with url, title, content, source_method, extracted_at and word_count, or None when extraction fails entirely.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harnais_web_extractor-0.1.1.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harnais_web_extractor-0.1.1-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file harnais_web_extractor-0.1.1.tar.gz.

File metadata

  • Download URL: harnais_web_extractor-0.1.1.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for harnais_web_extractor-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e818f68ae64bced8ad7a08f2b9ba33ea33762185b42b6d300569d0ca545b4271
MD5 cd7f94d77772be7a626d256baf5246b5
BLAKE2b-256 c5c82ad0506384b93fe8fad6117086cba360e67c88f99c91aebc65a5cb8f3612

See more details on using hashes here.

File details

Details for the file harnais_web_extractor-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for harnais_web_extractor-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 27807851684e4f24b2fba3eb9793a528037f5bdb90645ca43b716e6d005e55f2
MD5 11425020e796da20e5d0155c23136a3d
BLAKE2b-256 dc2fa323b56c50d06fb873c27d05db2df775e49cbc5128374671332c80cedea5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page