Skip to main content

Convert an article or web page to Markdown

Project description

article-to-md

A CLI tool to extract core content from webpages or local HTML and convert it to Markdown.

╭─ Commands ────────────────────────────────────────────────────────────────────╮
│ --help (-h)  Display this message and exit.                                   │
│ --version    Display application version.                                     │
╰───────────────────────────────────────────────────────────────────────────────╯
╭─ Parameters ──────────────────────────────────────────────────────────────────╮
│ *  SOURCE --source                [required]                                  │
│    --method                       [choices: readability, trafilatura, raw]    │
│                                   [default: readability]                      │
│    --favor                        [choices: recall, precision]                │
│    --remove-ads --no-remove-ads   [default: False]                            │
│    --strip-tag --empty-strip-tag                                              │
╰───────────────────────────────────────────────────────────────────────────────╯

Installation

uv is recommended to install the package in a managed environment:

uv tool install article-to-md

Note: To use the readability method, Node.js (v14+) must be installed on your system. Without Node.js, the tool uses Python-based extraction.

Usage

From a publicly accessible web page:

article-to-md https://example.com/article

From a local HTML file:

article-to-md /path/to/file.html

Advanced options:

  • --remove-ads - Basic ad removal from the DOM using generic cosmetic filters from EasyList
  • --method - Affects pre-processing of the DOM before conversion to Markdown.
    • readability (default) - Uses ReadabiliPy which can use the original Readability.js Node package when Node is present on the system.
    • trafilatura - Uses the Trafilatura pure Python library
    • raw - Sends the full DOM to be converted
  • --favor - Only used with --method trafilatura to control options documented here.
  • --strip-tag - An HTML tag to be stripped from the DOM before conversion
    • This argument can be supplied multiple times
    • By default, <img> tags are stripped; use --empty-strip-tag to keep them.

Features

  • Stealth Requests: Uses curl_cffi to impersonate a Chrome browser and avoid bot detection.
  • Enhanced Markdown:
    • Converts <var> to italics.
    • Includes <abbr> titles in the text output.
    • Renders Markdown tables from HTML tables

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

article_to_md-0.4.0.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

article_to_md-0.4.0-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file article_to_md-0.4.0.tar.gz.

File metadata

  • Download URL: article_to_md-0.4.0.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for article_to_md-0.4.0.tar.gz
Algorithm Hash digest
SHA256 5369d3bf7caf59738bbd969acfa348e1af3dfeef3073cb3c9aab0c1aa71c6954
MD5 92459d3f179d1229709c99b0d89c2cf5
BLAKE2b-256 300d6bfb34d0f9d68b2a3b51fd7d8fec5c604f6f5b8c652a20fb40bb01db1a5d

See more details on using hashes here.

File details

Details for the file article_to_md-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: article_to_md-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for article_to_md-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ebf001db71883974907c735faccf1b04e786c65e862ecdebc1ba5954e1990fa5
MD5 3c9aead50dc2805cf431fe5ac065fd6d
BLAKE2b-256 2a716e6b4731582bed129283a8a0f5ec2843d8afa28745808824b7f8d438f586

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page