Skip to main content

HTML to Markdown converter with Requests or Playwright backend

Project description

pg2md

HTML to Markdown converter with Requests or Playwright backend.

Convert any webpage to clean Markdown. Choose between fast requests or full browser playwright for JavaScript-rendered pages.

Features

  • Two backends: Pg2MdRequests (fast) or Pg2MdPlaywright (JS support)
  • Browser reuse: Playwright instances share a single browser
  • Proxy support: HTTP/HTTPS proxies with authentication
  • Custom headers & cookies: Full control over requests
  • Clean output: Optional removal of images and links
  • Context manager: Auto-cleanup with with statement

Installation

pip install pg2md

# For Playwright backend:
pip install pg2md[playwright]
playwright install chromium

Quick Start

from pg2md import Pg2MdRequests, Pg2MdPlaywright

# Simple usage with Requests
pg = Pg2MdRequests()
markdown = pg.run("https://example.com")
print(markdown)

# Playwright for JS-heavy sites
pg = Pg2MdPlaywright()
markdown = pg.run("https://spa-example.com")
pg.close()

Usage

Basic Conversion

from pg2md import Pg2MdRequests

pg = Pg2MdRequests(with_image=False, with_link=False)
md = pg.run("https://news.ycombinator.com")

With Proxy

from pg2md import Pg2MdRequests, Pg2MdPlaywright

# Format: http://user:password@host:port
# Or: host:port:user:password
proxy = "http://user:pass@proxy.example.com:8080"

# Requests
pg = Pg2MdRequests()
md = pg.run("https://example.com", proxy=proxy)

# Playwright
pg = Pg2MdPlaywright()
md = pg.run("https://example.com", proxy=proxy)
pg.close()

Custom Headers & User-Agent

from pg2md import Pg2MdRequests

pg = Pg2MdRequests()
md = pg.run(
    "https://api.example.com/data",
    headers={
        "X-API-Key": "secret123",
        "Accept": "application/json",
    },
    user_agent="MyBot/1.0",
)

With Cookies

from pg2md import Pg2MdRequests

pg = Pg2MdRequests()
md = pg.run(
    "https://example.com/dashboard",
    cookies={
        "session": "abc123",
        "auth_token": "xyz789",
    },
)

Save to File

from pg2md import Pg2MdRequests

pg = Pg2MdRequests()
pg.save("output.md", "https://example.com")

# With options
pg.save(
    "article.md",
    "https://blog.example.com/post",
    proxy="http://user:pass@host:port",
    user_agent="MyBot/1.0",
)

Context Manager

from pg2md import Pg2MdPlaywright

with Pg2MdPlaywright() as pg:
    md1 = pg.run("https://site1.com")
    md2 = pg.run("https://site2.com")
    # Browser closed automatically

Multiple Instances

from pg2md import Pg2MdPlaywright

# Both share the same browser (efficient)
pg1 = Pg2MdPlaywright()
pg2 = Pg2MdPlaywright()

md1 = pg1.run("https://site1.com")
md2 = pg2.run("https://site2.com")

Pg2MdPlaywright.close_all()  # Close shared browser

API Reference

Pg2MdRequests

Pg2MdRequests(with_image=False, with_link=False)
Parameter Type Default Description
with_image bool False Include images in output
with_link bool False Include links in output

Pg2MdPlaywright

Pg2MdPlaywright(
    browser=None,       # Custom Browser instance
    headless=True,      # Headless mode
    with_image=False,
    with_link=False,
)

Methods

run(url, proxy=None, headers=None, cookies=None, user_agent=None, timeout=30)

Fetch URL and convert to Markdown.

Returns: str (Markdown)

fetch(url, proxy=None, headers=None, cookies=None, user_agent=None, timeout=30)

Fetch HTML only.

Returns: str (HTML)

convert(html)

Convert HTML to Markdown.

Returns: str (Markdown)

save(filepath, url, **kwargs)

Fetch, convert, and save to file.

close()

Close browser (Playwright only).

close_all() (classmethod, Playwright only)

Close all shared browsers.

When to Use Which Backend?

Use Requests Use Playwright
Static HTML pages SPA / JavaScript apps
Speed matters Need rendered content
Simple scraping Bypass anti-bot (sometimes)
Low memory Modern web apps

Examples

Scrape Multiple URLs

from pg2md import Pg2MdRequests

urls = [
    "https://blog.example.com/post1",
    "https://blog.example.com/post2",
    "https://blog.example.com/post3",
]

pg = Pg2MdRequests(with_image=False, with_link=False)

for i, url in enumerate(urls):
    pg.save(f"post_{i+1}.md", url)
    print(f"Saved: {url}")

Batch with Proxies

from pg2md import Pg2MdRequests

urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
proxies = [
    "http://user1:pass1@proxy1:8080",
    "http://user2:pass2@proxy2:8080",
]

pg = Pg2MdRequests()

for i, url in enumerate(urls):
    proxy = proxies[i % len(proxies)]
    md = pg.run(url, proxy=proxy)
    print(f"[{i+1}] {len(md)} chars")

Extract Article Content

from pg2md import Pg2MdPlaywright

with Pg2MdPlaywright() as pg:
    md = pg.run(
        "https://medium.com/some-article",
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    )
    
    # Save clean text
    with open("article.md", "w") as f:
        f.write(md)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pg2md-1.0.1.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pg2md-1.0.1-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file pg2md-1.0.1.tar.gz.

File metadata

  • Download URL: pg2md-1.0.1.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pg2md-1.0.1.tar.gz
Algorithm Hash digest
SHA256 9e85da8cbaae4b55fb8abcda62ae1658602a3dae5795bf2ba5f2b29260a90730
MD5 86dea7a6474d084f12b3f15a148ca966
BLAKE2b-256 51c7b86275944390cc47ca28bd3897a699c2ed77673fef70297aff02e1a15edb

See more details on using hashes here.

File details

Details for the file pg2md-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pg2md-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pg2md-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 65bb7ea447dad1e3c0d6da7a1177b3e00323a39d6ffc20b3b754c976d7006638
MD5 b8b607c1ceb6e1bb07fed3f072cd7fa3
BLAKE2b-256 da78923998f5706d15adf35fb7a85999c97f1fadc9603c82087427661a139837

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page