Universal web content extraction — URL to LLM-ready markdown
Project description
MarkGrab
Universal web content extraction — any URL to LLM-ready markdown.
from markgrab import extract
result = await extract("https://example.com/article")
print(result.markdown) # clean markdown
print(result.title) # "Article Title"
print(result.word_count) # 1234
print(result.language) # "en"
Features
- HTML — BeautifulSoup + content density filtering (removes nav, sidebar, ads)
- YouTube — transcript extraction with timestamps
- PDF — text extraction with page structure
- DOCX — paragraph and heading extraction
- Auto-fallback — tries lightweight httpx first, falls back to Playwright for JS-heavy pages
- Async-first — built on httpx and Playwright async APIs
Install
pip install markgrab
Optional extras for specific content types:
pip install "markgrab[browser]" # Playwright for JS-rendered pages
pip install "markgrab[youtube]" # YouTube transcript extraction
pip install "markgrab[pdf]" # PDF text extraction
pip install "markgrab[docx]" # DOCX text extraction
pip install "markgrab[all]" # everything
Usage
Python API
import asyncio
from markgrab import extract
async def main():
# HTML (auto-detects content type)
result = await extract("https://example.com/article")
# YouTube transcript
result = await extract("https://youtube.com/watch?v=dQw4w9WgXcQ")
# PDF
result = await extract("https://arxiv.org/pdf/1706.03762")
# Options
result = await extract(
"https://example.com",
max_chars=30_000, # limit output length (default: 50K)
use_browser=True, # force Playwright rendering
stealth=True, # anti-bot stealth scripts (opt-in)
timeout=60.0, # request timeout in seconds
proxy="http://proxy:8080",
)
asyncio.run(main())
CLI
markgrab https://example.com # markdown output
markgrab https://example.com -f text # plain text
markgrab https://example.com -f json # structured JSON
markgrab https://example.com --browser # force browser rendering
markgrab https://example.com --max-chars 10000 # limit output
ExtractResult
result.title # page title
result.text # plain text
result.markdown # LLM-ready markdown
result.word_count # word count
result.language # detected language ("en", "ko", ...)
result.content_type # "article", "video", "pdf", "docx"
result.source_url # final URL (after redirects)
result.metadata # extra metadata (video_id, page_count, etc.)
How it works
flowchart TD
A["🔗 URL Input"] --> B{"Content\nType?"}
B -->|"HTML"| C["HTTP fetch\n(httpx)"]
C --> D{"JS\nrequired?"}
D -->|"no"| E["HTML Parser\n→ clean markdown"]
D -->|"yes"| F["Playwright\nfallback"]
F --> E
B -->|"YouTube"| G["Transcript API\n→ timestamped markdown"]
B -->|"PDF"| H["PDF Parser\n→ structured markdown"]
B -->|"DOCX"| I["DOCX Parser\n→ markdown"]
E --> J["✅ LLM-ready\nMarkdown"]
G --> J
H --> J
I --> J
For HTML pages, if the initial httpx fetch yields fewer than 50 words, MarkGrab automatically retries with Playwright to handle JavaScript-rendered content.
Disclaimer
This software is provided for legitimate purposes only. By using MarkGrab, you agree to the following:
-
robots.txt: MarkGrab does not check or enforce
robots.txt. Users are solely responsible for checking and respectingrobots.txtdirectives and the terms of service of any website they access. -
Rate limiting: MarkGrab does not include built-in rate limiting or request throttling. Users must implement their own rate limiting to avoid overloading target servers. Abusive request patterns may violate applicable laws and website terms of service.
-
YouTube transcripts: YouTube transcript extraction relies on the third-party
youtube-transcript-apilibrary, which uses YouTube's internal (unofficial) caption API. This may not comply with YouTube's Terms of Service. Use at your own discretion and risk. -
Stealth mode: The optional
stealth=Truefeature modifies browser fingerprinting signals to reduce bot detection. This feature is intended for legitimate use cases such as testing, research, and accessing content that is publicly available to regular browser users. Users are responsible for ensuring their use complies with applicable laws and the terms of service of target websites. -
Legal compliance: Users are responsible for ensuring that their use of MarkGrab complies with all applicable laws, including but not limited to the Computer Fraud and Abuse Act (CFAA), the Digital Millennium Copyright Act (DMCA), GDPR, and equivalent legislation in their jurisdiction.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND. See the LICENSE file for the full MIT license text.
Acknowledgments
MarkGrab builds on excellent open-source work and well-established techniques:
- puppeteer-extra-plugin-stealth — stealth evasion patterns (webdriver removal, plugin mocking, WebGL spoofing) that inspired the opt-in
anti_bot/stealth.pymodule - Mozilla Readability — content area detection priority (
article > main > body) and link density filtering concepts used in the density filter - Boilerpipe (Kohlschutter et al., 2010) — the academic origin of link density ratio algorithms for boilerplate removal
- Jina Reader — validated the market need for URL-to-markdown extraction; MarkGrab aims to be a lightweight, self-hosted alternative
Built with httpx, BeautifulSoup, markdownify, Playwright, youtube-transcript-api, pdfplumber, and python-docx.
Used in
- newswatch — RSS news monitoring pipeline (feedkit → markgrab → embgrep → diffgrab)
- watchdeck — Web page monitoring with visual diffs and safety guards
License
Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markgrab-0.2.0.tar.gz.
File metadata
- Download URL: markgrab-0.2.0.tar.gz
- Upload date:
- Size: 35.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1cb7576cc331d051fe5611b7445e5000b68d7bfcb72a912bb061b33932509d4e
|
|
| MD5 |
ee10d35e288d49d1633e488292ea16ed
|
|
| BLAKE2b-256 |
9636c71662cfff91fcda62f1c9d92ced2e7633b64510d2d4a8c2523e82fd049a
|
Provenance
The following attestation bundles were made for markgrab-0.2.0.tar.gz:
Publisher:
publish.yml on ArkNill/markgrab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markgrab-0.2.0.tar.gz -
Subject digest:
1cb7576cc331d051fe5611b7445e5000b68d7bfcb72a912bb061b33932509d4e - Sigstore transparency entry: 1366862845
- Sigstore integration time:
-
Permalink:
ArkNill/markgrab@bfea515dd422e9622c7b226a1fbba93ae1adc990 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ArkNill
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bfea515dd422e9622c7b226a1fbba93ae1adc990 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file markgrab-0.2.0-py3-none-any.whl.
File metadata
- Download URL: markgrab-0.2.0-py3-none-any.whl
- Upload date:
- Size: 26.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d88b505e35adb0aad1fc6d968363b60fa1fcbcf0afac9e5f7a5b53e7a60346de
|
|
| MD5 |
0af4b2986521174c705adef5f2b4a992
|
|
| BLAKE2b-256 |
751f5a7ebfab438ae19a91ee6bf111b1cb6a6f06ebf13f81c7ddf91581e1d647
|
Provenance
The following attestation bundles were made for markgrab-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on ArkNill/markgrab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markgrab-0.2.0-py3-none-any.whl -
Subject digest:
d88b505e35adb0aad1fc6d968363b60fa1fcbcf0afac9e5f7a5b53e7a60346de - Sigstore transparency entry: 1366862885
- Sigstore integration time:
-
Permalink:
ArkNill/markgrab@bfea515dd422e9622c7b226a1fbba93ae1adc990 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ArkNill
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bfea515dd422e9622c7b226a1fbba93ae1adc990 -
Trigger Event:
workflow_dispatch
-
Statement type: