Fetch web pages and convert them to markdown
Project description
markdfetch
A lightweight Python library for fetching web pages and extracting content as Markdown, plain text, or structured links.
Features
- Fetch web pages with a simple API
- Convert HTML to Markdown
- Extract plain text from web pages
- Extract links with URL and anchor text
- Exclude unwanted HTML tags before processing
- Include only specific HTML tags before processing
- Support for custom request headers and timeouts
- Automatic resolution of relative URLs
- CSS selector support
- Optional link deduplication
- Automatic retry handling
Installation
pip install markdfetch
Quick Start
import markdfetch
page = markdfetch.fetch("https://example.com")
print(page.markdown())
Fetch a Page
import markdfetch
page = markdfetch.fetch("https://example.com")
print(page.status_code)
print(page.url)
Convert HTML to Markdown
page = markdfetch.fetch("https://example.com")
markdown = page.markdown()
print(markdown)
Exclude HTML Tags
Remove unwanted sections before converting to Markdown.
page = markdfetch.fetch("https://example.com")
markdown = page.markdown(
exclude=["nav", "footer"]
)
print(markdown)
Include Specific HTML Tags
Extract content only from selected tags.
page = markdfetch.fetch("https://example.com")
markdown = page.markdown(
include=["article"]
)
print(markdown)
Combine Include and Exclude
page = markdfetch.fetch("https://example.com")
markdown = page.markdown(
include=["article"],
exclude=["nav", "footer"]
)
print(markdown)
Extract Plain Text
page = markdfetch.fetch("https://example.com")
text = page.text()
print(text)
Extract Links
page = markdfetch.fetch("https://example.com")
links = page.links()
print(links)
Example output:
[
{
"url": "https://example.com/about",
"text": "About Us"
},
{
"url": "https://example.com/contact",
"text": "Contact"
}
]
Skip Empty Links
page = markdfetch.fetch("https://example.com")
links = page.links(skip_empty=True)
Extract Content Using CSS Selectors
Target specific elements using CSS selectors.
page = markdfetch.fetch("https://example.com")
markdown = page.markdown(
selector="article"
)
print(markdown)
You can use any valid CSS selector:
page.markdown(selector=".content")
page.markdown(selector="#main")
page.markdown(selector="article.post")
Extract Text Using CSS Selectors
Extract plain text from specific sections of a page.
page = markdfetch.fetch("https://example.com")
text = page.text(
selector=".content"
)
print(text)
Extract Unique Links
Remove duplicate URLs from the extracted links.
page = markdfetch.fetch("https://example.com")
links = page.links(
unique=True
)
print(links)
Roadmap
Planned features:
- Async support via httpx
- Proxy support
- Metadata extraction
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markdfetch-0.1.0.tar.gz.
File metadata
- Download URL: markdfetch-0.1.0.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdc84afe23d55973656d266fb6dfca591eac49c64f912151d18e3e60f07225c8
|
|
| MD5 |
9a888b964e151bbd72870bddd6f0c8d6
|
|
| BLAKE2b-256 |
f651aa0e05fbf4ecbf41bf337756e01fa276f66d39a709e45df57ede89b903e5
|
File details
Details for the file markdfetch-0.1.0-py3-none-any.whl.
File metadata
- Download URL: markdfetch-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
959f003027bb7a41205679724899cedbd296696c490d962aef00800a2b453f22
|
|
| MD5 |
6080fe55acda81599c0dcb406575d5ba
|
|
| BLAKE2b-256 |
ad13ae1a02d60cbf0311cc99b7c1a91bc8341040a521681924ecca5f387909bd
|