HTML to Markdown converter with Requests or Playwright backend
Project description
pg2md — Page to Markdown
HTML to Markdown converter with Requests or Playwright backend.
Convert any webpage to clean Markdown. Choose between fast requests or full browser playwright for JavaScript-rendered pages.
Features
- Two backends:
Pg2MdRequests(fast) orPg2MdPlaywright(JS support) - Browser reuse: Playwright instances share a single browser
- Proxy support: HTTP/HTTPS proxies with authentication
- Custom headers & cookies: Full control over requests
- Clean output: Optional removal of images and links
- Context manager: Auto-cleanup with
withstatement
Installation
pip install pg2md
# For Playwright backend:
pip install pg2md[playwright]
playwright install chromium
Quick Start
from pg2md import Pg2MdRequests, Pg2MdPlaywright
# Simple usage with Requests
pg = Pg2MdRequests()
markdown = pg.run("https://example.com")
print(markdown)
# Playwright for JS-heavy sites
pg = Pg2MdPlaywright()
markdown = pg.run("https://spa-example.com")
pg.close()
Usage
Basic Conversion
from pg2md import Pg2MdRequests
pg = Pg2MdRequests(with_image=False, with_link=False)
md = pg.run("https://news.ycombinator.com")
With Proxy
from pg2md import Pg2MdRequests, Pg2MdPlaywright
# Format: http://user:password@host:port
# Or: host:port:user:password
proxy = "http://user:pass@proxy.example.com:8080"
# Requests
pg = Pg2MdRequests()
md = pg.run("https://example.com", proxy=proxy)
# Playwright
pg = Pg2MdPlaywright()
md = pg.run("https://example.com", proxy=proxy)
pg.close()
Custom Headers & User-Agent
from pg2md import Pg2MdRequests
pg = Pg2MdRequests()
md = pg.run(
"https://api.example.com/data",
headers={
"X-API-Key": "secret123",
"Accept": "application/json",
},
user_agent="MyBot/1.0",
)
With Cookies
from pg2md import Pg2MdRequests
pg = Pg2MdRequests()
md = pg.run(
"https://example.com/dashboard",
cookies={
"session": "abc123",
"auth_token": "xyz789",
},
)
Save to File
from pg2md import Pg2MdRequests
pg = Pg2MdRequests()
pg.save("output.md", "https://example.com")
# With options
pg.save(
"article.md",
"https://blog.example.com/post",
proxy="http://user:pass@host:port",
user_agent="MyBot/1.0",
)
Context Manager
from pg2md import Pg2MdPlaywright
with Pg2MdPlaywright() as pg:
md1 = pg.run("https://site1.com")
md2 = pg.run("https://site2.com")
# Browser closed automatically
Multiple Instances
from pg2md import Pg2MdPlaywright
# Both share the same browser (efficient)
pg1 = Pg2MdPlaywright()
pg2 = Pg2MdPlaywright()
md1 = pg1.run("https://site1.com")
md2 = pg2.run("https://site2.com")
Pg2MdPlaywright.close_all() # Close shared browser
API Reference
Pg2MdRequests
Pg2MdRequests(with_image=False, with_link=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
with_image |
bool | False | Include images in output |
with_link |
bool | False | Include links in output |
Pg2MdPlaywright
Pg2MdPlaywright(
browser=None, # Custom Browser instance
headless=True, # Headless mode
with_image=False,
with_link=False,
)
Methods
run(url, proxy=None, headers=None, cookies=None, user_agent=None, timeout=30)
Fetch URL and convert to Markdown.
Returns: str (Markdown)
fetch(url, proxy=None, headers=None, cookies=None, user_agent=None, timeout=30)
Fetch HTML only.
Returns: str (HTML)
convert(html)
Convert HTML to Markdown.
Returns: str (Markdown)
save(filepath, url, **kwargs)
Fetch, convert, and save to file.
close()
Close browser (Playwright only).
close_all() (classmethod, Playwright only)
Close all shared browsers.
When to Use Which Backend?
| Use Requests | Use Playwright |
|---|---|
| Static HTML pages | SPA / JavaScript apps |
| Speed matters | Need rendered content |
| Simple scraping | Bypass anti-bot (sometimes) |
| Low memory | Modern web apps |
Examples
Scrape Multiple URLs
from pg2md import Pg2MdRequests
urls = [
"https://blog.example.com/post1",
"https://blog.example.com/post2",
"https://blog.example.com/post3",
]
pg = Pg2MdRequests(with_image=False, with_link=False)
for i, url in enumerate(urls):
pg.save(f"post_{i+1}.md", url)
print(f"Saved: {url}")
Batch with Proxies
from pg2md import Pg2MdRequests
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
proxies = [
"http://user1:pass1@proxy1:8080",
"http://user2:pass2@proxy2:8080",
]
pg = Pg2MdRequests()
for i, url in enumerate(urls):
proxy = proxies[i % len(proxies)]
md = pg.run(url, proxy=proxy)
print(f"[{i+1}] {len(md)} chars")
Extract Article Content
from pg2md import Pg2MdPlaywright
with Pg2MdPlaywright() as pg:
md = pg.run(
"https://medium.com/some-article",
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
)
# Save clean text
with open("article.md", "w") as f:
f.write(md)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pg2md-1.0.2.tar.gz.
File metadata
- Download URL: pg2md-1.0.2.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71e945ce574acfd9f8dd9efc466e34e0019261511f5a27a4496c9619f530cdc0
|
|
| MD5 |
b729c459c409858d3424a6bdf46e3afb
|
|
| BLAKE2b-256 |
b4879e2945b15979ec90d5527b6393b2eca851ac59d33e6b7ad6498b27cc9bdf
|
File details
Details for the file pg2md-1.0.2-py3-none-any.whl.
File metadata
- Download URL: pg2md-1.0.2-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72a84d2b0819940586462051ccd85d08d7caffcea06fe55bb89e71d961ea42b5
|
|
| MD5 |
9da6a3d42d12b9db31b6bd4f87a7db99
|
|
| BLAKE2b-256 |
301cddb49279aa2cd4f0164c9790398917f2334dac2be3569048382de4410364
|