Skip to main content

Fetch web pages and convert them to markdown

Project description

markdfetch

A lightweight Python library for fetching web pages and extracting content as Markdown, plain text, or structured links.

Features

  • Fetch web pages with a simple API
  • Convert HTML to Markdown
  • Extract plain text from web pages
  • Extract links with URL and anchor text
  • Exclude unwanted HTML tags before processing
  • Include only specific HTML tags before processing
  • Support for custom request headers and timeouts
  • Automatic resolution of relative URLs
  • CSS selector support
  • Optional link deduplication
  • Automatic retry handling

Installation

pip install markdfetch

Quick Start

import markdfetch

page = markdfetch.fetch("https://example.com")

print(page.markdown())

Fetch a Page

import markdfetch

page = markdfetch.fetch("https://example.com")

print(page.status_code)
print(page.url)

Convert HTML to Markdown

page = markdfetch.fetch("https://example.com")

markdown = page.markdown()

print(markdown)

Exclude HTML Tags

Remove unwanted sections before converting to Markdown.

page = markdfetch.fetch("https://example.com")

markdown = page.markdown(
    exclude=["nav", "footer"]
)

print(markdown)

Include Specific HTML Tags

Extract content only from selected tags.

page = markdfetch.fetch("https://example.com")

markdown = page.markdown(
    include=["article"]
)

print(markdown)

Combine Include and Exclude

page = markdfetch.fetch("https://example.com")

markdown = page.markdown(
    include=["article"],
    exclude=["nav", "footer"]
)

print(markdown)

Extract Plain Text

page = markdfetch.fetch("https://example.com")

text = page.text()

print(text)

Extract Links

page = markdfetch.fetch("https://example.com")

links = page.links()

print(links)

Example output:

[
    {
        "url": "https://example.com/about",
        "text": "About Us"
    },
    {
        "url": "https://example.com/contact",
        "text": "Contact"
    }
]

Skip Empty Links

page = markdfetch.fetch("https://example.com")

links = page.links(skip_empty=True)

Extract Content Using CSS Selectors

Target specific elements using CSS selectors.

page = markdfetch.fetch("https://example.com")

markdown = page.markdown(
    selector="article"
)

print(markdown)

You can use any valid CSS selector:

page.markdown(selector=".content")
page.markdown(selector="#main")
page.markdown(selector="article.post")

Extract Text Using CSS Selectors

Extract plain text from specific sections of a page.

page = markdfetch.fetch("https://example.com")

text = page.text(
    selector=".content"
)

print(text)

Extract Unique Links

Remove duplicate URLs from the extracted links.

page = markdfetch.fetch("https://example.com")

links = page.links(
    unique=True
)

print(links)

Roadmap

Planned features:

  • Async support via httpx
  • Proxy support
  • Metadata extraction

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdfetch-0.1.0.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markdfetch-0.1.0-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file markdfetch-0.1.0.tar.gz.

File metadata

  • Download URL: markdfetch-0.1.0.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for markdfetch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cdc84afe23d55973656d266fb6dfca591eac49c64f912151d18e3e60f07225c8
MD5 9a888b964e151bbd72870bddd6f0c8d6
BLAKE2b-256 f651aa0e05fbf4ecbf41bf337756e01fa276f66d39a709e45df57ede89b903e5

See more details on using hashes here.

File details

Details for the file markdfetch-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: markdfetch-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for markdfetch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 959f003027bb7a41205679724899cedbd296696c490d962aef00800a2b453f22
MD5 6080fe55acda81599c0dcb406575d5ba
BLAKE2b-256 ad13ae1a02d60cbf0311cc99b7c1a91bc8341040a521681924ecca5f387909bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page