Skip to main content

Fetch a web page and convert it into cleaned Markdown.

Project description

extract2md

extract2md is all about “HTML in → Markdown out.” You can start from a live URL, a file on disk, or an already-loaded HTML string.

It can be used from CLI or as a Python library.

Installation

pip install extract2md

Prerequisites:

  • Python 3.10+ runtime
  • Node.js (recommended for best results; powers Readability.js content extraction)

CLI usage

1. Fetch a URL and display Markdown

extract2md https://www.iana.org/help/example-domains

2. Fetch and write to a file

extract2md https://www.iana.org/help/example-domains > sample-output.md

3. Convert previously saved HTML (files or stdin)

# convert file
extract2md sample-page.html
# or from stdin
cat sample-page.html | extract2md -

Parameters

Usage: extract2md [OPTIONS] SOURCE

Global

  • source: HTTP(S) URL, filesystem path, or - when reading HTML from stdin.

Fetching (URL sources only)

  • --ignore-robots: skip robots.txt validation (use sparingly).
  • --proxy URL: HTTP(S) proxy forwarded to httpx.
  • --timeout SECONDS: request timeout (default 30 seconds).
  • --user-agent STRING: override the default identifier.

HTML rewriting

  • --rewrite-relative-urls/--no-rewrite-relative-urls: enable or disable rewriting relative href/src attributes to absolute links (default on).
  • --base-url URL: optional base URL for rewriting relative URLs (default source).

Conversion

  • --converter NAME: choose the HTML conversion backend. Defaults to trafilatura; readability (requires Node.js) is also available.

Environment variables

  • EXTRACT2MD_NODE_PATH: Set the EXTRACT2MD_NODE_PATH environment variable to the Node.js binary (or its directory) if Readability.js cannot find node on your PATH.

Python Library usage

extract2md can also be used as a Python library.

1. Fetch a URL and get Markdown

from extract2md import fetch_to_markdown

markdown = fetch_to_markdown("https://www.iana.org/help/example-domains")

2. Convert a previously saved HTML file

from extract2md import file_to_markdown

markdown_from_file = file_to_markdown("sample-page.html")

3. Convert an HTML string you already have

from extract2md import html_to_markdown

html = "<html><body><h1>Offline HTML</h1></body></html>"
markdown_from_html = html_to_markdown(html)

# Optionally disable replacing relative links with absolute URLs
markdown_custom = html_to_markdown(
    html,
    rewrite_relative_urls=False,
)

# Or replace relative links with a custom base URL
markdown_custom = html_to_markdown(
    html,
    rewrite_relative_urls=False,
    base_url="https://example.com/docs/",
)

# Pick an alternate conversion backend (e.g., Readability)
markdown_readability = html_to_markdown(html, converter="readability")

Additional public methods

Need to store markup or run your own converter? Use fetch and skip the Markdown step entirely:

from extract2md import fetch

raw_html, content_type = fetch("https://example.com/docs")

Notes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract2md-0.1.2.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extract2md-0.1.2-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file extract2md-0.1.2.tar.gz.

File metadata

  • Download URL: extract2md-0.1.2.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for extract2md-0.1.2.tar.gz
Algorithm Hash digest
SHA256 25c6dd8f1e7d07c6e803430fdc3db06371490b712e7b05539440732def42ea9d
MD5 f51e96946b6f1bbaf58aee619e1e7d25
BLAKE2b-256 d709d7b822410e78a344652d15aad3dd6d8d2a0cac848adbdce3e797de3afb6a

See more details on using hashes here.

Provenance

The following attestation bundles were made for extract2md-0.1.2.tar.gz:

Publisher: ci.yml on Wuodan/extract2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file extract2md-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: extract2md-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for extract2md-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f0217dd664b2a3f7436bf66824f3790270174283049539cdab553aab25e3ac98
MD5 6d586dc6c5a34ccedf3463123954c896
BLAKE2b-256 1fc18c705859b511d2d3c0d345163205c2ce74e41d27337d23dc580cf22277c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for extract2md-0.1.2-py3-none-any.whl:

Publisher: ci.yml on Wuodan/extract2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page