Skip to main content

Two-stage web article extractor (trafilatura + Playwright) with YouTube transcript fetching.

Project description

web-article-extractor

A small, dependency-light toolkit for pulling readable content off the web. It extracts article text from any URL using a two-stage strategy — trafilatura first (fast, no browser), then a headless Playwright Chromium fallback for JavaScript-heavy pages — and fetches transcripts from YouTube videos through a manual-then-automatic subtitle cascade.

Naming. The PyPI distribution is harnais-web-extractor (this repository keeps the name web-article-extractor). The Python import package is web_article_extractor.

Install

From PyPI:

pip install harnais-web-extractor
# Playwright also needs a browser binary the first time:
playwright install chromium
# Optional YouTube transcript support:
pip install "harnais-web-extractor[youtube]"

Or from source (GitHub):

pip install git+https://github.com/JohnLinotte/web-article-extractor.git

Usage

Command line

# Extract an article as Markdown (default):
python -m web_article_extractor https://example.com/some-article

# As JSON:
python -m web_article_extractor https://example.com/some-article --format json

# A YouTube URL fetches the transcript instead:
python -m web_article_extractor "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Python API

from web_article_extractor import extract_article, fetch_transcript, is_youtube_url

result = extract_article("https://example.com/some-article")
if result:
    print(result["title"])
    print(result["content"])      # Markdown
    print(result["word_count"])

if is_youtube_url(url):
    transcript = fetch_transcript(url)
    if transcript:
        print(transcript["text"])

extract_article returns a dict with url, title, content, source_method, extracted_at and word_count, or None when extraction fails entirely.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harnais_web_extractor-0.1.0.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harnais_web_extractor-0.1.0-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file harnais_web_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: harnais_web_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for harnais_web_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4ad2ff11ccc85eacb58e7853bf7145c212652657bd970231dcd414d85f138b7e
MD5 3811b42e8c6038651848c01868c05a46
BLAKE2b-256 be15a6d9df07c86dd0b2c3d605f8390f797968e8da7eec53fe0b7dd5a1bc63d1

See more details on using hashes here.

File details

Details for the file harnais_web_extractor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for harnais_web_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7dd1b54a3f71b11e96afda04a4e3e9ca06ed209093fdeafe23c881a084c6ad13
MD5 fef29b907e504f1831ea330ac4d25554
BLAKE2b-256 b067a1bbc79d37ec4407d8c7a48da23a25e7074a7e91d1e57fef6269c5361268

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page