Fetch a web page and convert it into cleaned Markdown.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

extract2md

extract2md is all about “HTML in → Markdown out.” You can start from a live URL, a file on disk, or an already-loaded HTML string.

It can be used from CLI or as a Python library.

Installation

pip install extract2md

Prerequisites:

Python 3.10+ runtime
Node.js (recommended for best results; powers Readability.js content extraction)

CLI usage

1. Fetch a URL and display Markdown

extract2md https://www.iana.org/help/example-domains

2. Fetch and write to a file

extract2md https://www.iana.org/help/example-domains > sample-output.md

3. Convert previously saved HTML (files or stdin)

# convert file
extract2md sample-page.html
# or from stdin
cat sample-page.html | extract2md -

Parameters

Usage: extract2md [OPTIONS] SOURCE

Global

source: HTTP(S) URL, filesystem path, or - when reading HTML from stdin.

Fetching (URL sources only)

--ignore-robots: skip robots.txt validation (use sparingly).
--proxy URL: HTTP(S) proxy forwarded to httpx.
--timeout SECONDS: request timeout (default 30 seconds).
--user-agent STRING: override the default identifier.

HTML rewriting

--rewrite-relative-urls/--no-rewrite-relative-urls: enable or disable rewriting relative href/src attributes to absolute links (default on).
--base-url URL: optional base URL for rewriting relative URLs (default source).

Conversion

--converter NAME: choose the HTML conversion backend. Defaults to trafilatura; readability (requires Node.js) is also available.

Environment variables

EXTRACT2MD_NODE_PATH: Set the EXTRACT2MD_NODE_PATH environment variable to the Node.js binary (or its directory) if Readability.js cannot find node on your PATH.

Python Library usage

extract2md can also be used as a Python library.

1. Fetch a URL and get Markdown

from extract2md import fetch_to_markdown

markdown = fetch_to_markdown("https://www.iana.org/help/example-domains")

2. Convert a previously saved HTML file

from extract2md import file_to_markdown

markdown_from_file = file_to_markdown("sample-page.html")

3. Convert an HTML string you already have

from extract2md import html_to_markdown

html = "<html><body><h1>Offline HTML</h1></body></html>"
markdown_from_html = html_to_markdown(html)

# Optionally disable replacing relative links with absolute URLs
markdown_custom = html_to_markdown(
    html,
    rewrite_relative_urls=False,
)

# Or replace relative links with a custom base URL
markdown_custom = html_to_markdown(
    html,
    rewrite_relative_urls=False,
    base_url="https://example.com/docs/",
)

# Pick an alternate conversion backend (e.g., Readability)
markdown_readability = html_to_markdown(html, converter="readability")

Additional public methods

Need to store markup or run your own converter? Use fetch and skip the Markdown step entirely:

from extract2md import fetch

raw_html, content_type = fetch("https://example.com/docs")

Notes

The CLI and library both fetch live webpages from URLs; network availability and site rate limits apply.
Inspired by the Fetch MCP Server.
Thanks go to these libraries for the heavy lifting:
- ReadabiliPy with Mozilla's Readability.js Node.js package
- Markdownify
- Trafilatura

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Wuodan

Release history Release notifications | RSS feed

This version

0.1.2

Nov 19, 2025

0.1.1

Nov 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract2md-0.1.2.tar.gz (13.0 kB view details)

Uploaded Nov 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extract2md-0.1.2-py3-none-any.whl (12.6 kB view details)

Uploaded Nov 19, 2025 Python 3

File details

Details for the file extract2md-0.1.2.tar.gz.

File metadata

Download URL: extract2md-0.1.2.tar.gz
Upload date: Nov 19, 2025
Size: 13.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for extract2md-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`25c6dd8f1e7d07c6e803430fdc3db06371490b712e7b05539440732def42ea9d`
MD5	`f51e96946b6f1bbaf58aee619e1e7d25`
BLAKE2b-256	`d709d7b822410e78a344652d15aad3dd6d8d2a0cac848adbdce3e797de3afb6a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for extract2md-0.1.2.tar.gz:

Publisher: ci.yml on Wuodan/extract2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: extract2md-0.1.2.tar.gz
- Subject digest: 25c6dd8f1e7d07c6e803430fdc3db06371490b712e7b05539440732def42ea9d
- Sigstore transparency entry: 708125643
- Sigstore integration time: Nov 19, 2025
Source repository:
- Permalink: Wuodan/extract2md@e48866753d655ea49b8528089677d26667021df1
- Branch / Tag: refs/tags/0.1.2
- Owner: https://github.com/Wuodan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@e48866753d655ea49b8528089677d26667021df1
- Trigger Event: push

File details

Details for the file extract2md-0.1.2-py3-none-any.whl.

File metadata

Download URL: extract2md-0.1.2-py3-none-any.whl
Upload date: Nov 19, 2025
Size: 12.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for extract2md-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f0217dd664b2a3f7436bf66824f3790270174283049539cdab553aab25e3ac98`
MD5	`6d586dc6c5a34ccedf3463123954c896`
BLAKE2b-256	`1fc18c705859b511d2d3c0d345163205c2ce74e41d27337d23dc580cf22277c3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for extract2md-0.1.2-py3-none-any.whl:

Publisher: ci.yml on Wuodan/extract2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: extract2md-0.1.2-py3-none-any.whl
- Subject digest: f0217dd664b2a3f7436bf66824f3790270174283049539cdab553aab25e3ac98
- Sigstore transparency entry: 708125646
- Sigstore integration time: Nov 19, 2025
Source repository:
- Permalink: Wuodan/extract2md@e48866753d655ea49b8528089677d26667021df1
- Branch / Tag: refs/tags/0.1.2
- Owner: https://github.com/Wuodan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@e48866753d655ea49b8528089677d26667021df1
- Trigger Event: push

extract2md 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

extract2md

Installation

CLI usage

1. Fetch a URL and display Markdown

2. Fetch and write to a file

3. Convert previously saved HTML (files or stdin)

Parameters

Global

Fetching (URL sources only)

HTML rewriting

Conversion

Environment variables

Python Library usage

1. Fetch a URL and get Markdown

2. Convert a previously saved HTML file

3. Convert an HTML string you already have

Additional public methods

Notes

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance