Fetch a web page and convert it into cleaned Markdown.
Project description
extract2md
extract2md is all about “HTML in → Markdown out.” You can start from a live
URL, a file on disk, or an already-loaded HTML string.
It can be used from CLI or as a Python library.
Installation
pip install extract2md
Prerequisites:
- Python 3.10+ runtime
- Node.js (recommended for best results; powers Readability.js content extraction)
CLI usage
1. Fetch a URL and display Markdown
extract2md https://www.iana.org/help/example-domains
2. Fetch and write to a file
extract2md https://www.iana.org/help/example-domains > sample-output.md
3. Convert previously saved HTML (files or stdin)
# convert file
extract2md sample-page.html
# or from stdin
cat sample-page.html | extract2md -
Parameters
Usage: extract2md [OPTIONS] SOURCE
Global
source: HTTP(S) URL, filesystem path, or-when reading HTML from stdin.
Fetching (URL sources only)
--ignore-robots: skip robots.txt validation (use sparingly).--proxy URL: HTTP(S) proxy forwarded to httpx.--timeout SECONDS: request timeout (default 30 seconds).--user-agent STRING: override the default identifier.
HTML rewriting
--rewrite-relative-urls/--no-rewrite-relative-urls: enable or disable rewriting relativehref/srcattributes to absolute links (default on).--base-url URL: optional base URL for rewriting relative URLs (defaultsource).
Conversion
--converter NAME: choose the HTML conversion backend. Defaults totrafilatura;readability(requires Node.js) is also available.
Environment variables
EXTRACT2MD_NODE_PATH: Set theEXTRACT2MD_NODE_PATHenvironment variable to the Node.js binary (or its directory) if Readability.js cannot findnodeon yourPATH.
Python Library usage
extract2md can also be used as a Python library.
1. Fetch a URL and get Markdown
from extract2md import fetch_to_markdown
markdown = fetch_to_markdown("https://www.iana.org/help/example-domains")
2. Convert a previously saved HTML file
from extract2md import file_to_markdown
markdown_from_file = file_to_markdown("sample-page.html")
3. Convert an HTML string you already have
from extract2md import html_to_markdown
html = "<html><body><h1>Offline HTML</h1></body></html>"
markdown_from_html = html_to_markdown(html)
# Optionally disable replacing relative links with absolute URLs
markdown_custom = html_to_markdown(
html,
rewrite_relative_urls=False,
)
# Or replace relative links with a custom base URL
markdown_custom = html_to_markdown(
html,
rewrite_relative_urls=False,
base_url="https://example.com/docs/",
)
# Pick an alternate conversion backend (e.g., Readability)
markdown_readability = html_to_markdown(html, converter="readability")
Additional public methods
Need to store markup or run your own converter? Use fetch and skip the Markdown
step entirely:
from extract2md import fetch
raw_html, content_type = fetch("https://example.com/docs")
Notes
- The CLI and library both fetch live webpages from URLs; network availability and site rate limits apply.
- Inspired by the Fetch MCP Server.
- Thanks go to these libraries for the heavy lifting:
- ReadabiliPy with Mozilla's Readability.js Node.js package
- Markdownify
- Trafilatura
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extract2md-0.1.2.tar.gz.
File metadata
- Download URL: extract2md-0.1.2.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25c6dd8f1e7d07c6e803430fdc3db06371490b712e7b05539440732def42ea9d
|
|
| MD5 |
f51e96946b6f1bbaf58aee619e1e7d25
|
|
| BLAKE2b-256 |
d709d7b822410e78a344652d15aad3dd6d8d2a0cac848adbdce3e797de3afb6a
|
Provenance
The following attestation bundles were made for extract2md-0.1.2.tar.gz:
Publisher:
ci.yml on Wuodan/extract2md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
extract2md-0.1.2.tar.gz -
Subject digest:
25c6dd8f1e7d07c6e803430fdc3db06371490b712e7b05539440732def42ea9d - Sigstore transparency entry: 708125643
- Sigstore integration time:
-
Permalink:
Wuodan/extract2md@e48866753d655ea49b8528089677d26667021df1 -
Branch / Tag:
refs/tags/0.1.2 - Owner: https://github.com/Wuodan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@e48866753d655ea49b8528089677d26667021df1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file extract2md-0.1.2-py3-none-any.whl.
File metadata
- Download URL: extract2md-0.1.2-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0217dd664b2a3f7436bf66824f3790270174283049539cdab553aab25e3ac98
|
|
| MD5 |
6d586dc6c5a34ccedf3463123954c896
|
|
| BLAKE2b-256 |
1fc18c705859b511d2d3c0d345163205c2ce74e41d27337d23dc580cf22277c3
|
Provenance
The following attestation bundles were made for extract2md-0.1.2-py3-none-any.whl:
Publisher:
ci.yml on Wuodan/extract2md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
extract2md-0.1.2-py3-none-any.whl -
Subject digest:
f0217dd664b2a3f7436bf66824f3790270174283049539cdab553aab25e3ac98 - Sigstore transparency entry: 708125646
- Sigstore integration time:
-
Permalink:
Wuodan/extract2md@e48866753d655ea49b8528089677d26667021df1 -
Branch / Tag:
refs/tags/0.1.2 - Owner: https://github.com/Wuodan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@e48866753d655ea49b8528089677d26667021df1 -
Trigger Event:
push
-
Statement type: