Convert hosted documentation sites to local Markdown — built for feeding LLMs and AI Skills.
Project description
docmark
Convert hosted documentation sites to local Markdown — built for feeding LLMs and AI Skills.
Currently optimized for Mintlify-hosted docs (Anthropic, Polymarket, many web3 / crypto sites), which expose the source markdown of any page at <url>.md. The architecture is built around a single downloader strategy, so other doc platforms (Docusaurus, MkDocs, GitBook, ReadMe, generic HTML) can be added without rewriting the rest of the pipeline.
Install
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .
Use
docmark https://docs.polymarket.com/sitemap.xml --output ./output/polymarket
Or without installing:
python -m docmark https://docs.polymarket.com/sitemap.xml --output ./output/polymarket
Options
| Flag | Default | Description |
|---|---|---|
--output, -o |
output |
Directory to write markdown files into |
--concurrency, -c |
10 |
Parallel downloads |
--include-locales |
off | Include localized variants (/cn/, /es/, ...). Filtered out by default. |
--include |
none | Only crawl URLs whose path starts with this prefix |
--exclude |
none | Skip URLs whose path starts with this prefix (repeatable) |
--timeout |
30 |
Per-request timeout in seconds |
Examples
Only API reference pages, higher concurrency:
docmark https://docs.polymarket.com/sitemap.xml -o ./out -c 20 --include /api-reference/
Include Chinese variant and skip the /builders/ section:
docmark https://docs.polymarket.com/sitemap.xml -o ./out --include-locales --exclude /builders/
How URL paths map to files
https://docs.polymarket.com/ -> output/index.md
https://docs.polymarket.com/quickstart -> output/quickstart.md
https://docs.polymarket.com/api-reference/trade/cancel-all-orders
-> output/api-reference/trade/cancel-all-orders.md
How it works
Mintlify renders HTML for users, but also serves the raw MDX source whenever a request appends .md to a page URL:
https://docs.example.com/quickstart -> rendered HTML
https://docs.example.com/quickstart.md -> raw markdown source
The crawler reads the site's sitemap.xml, requests <url>.md for every entry in parallel, and writes each response to disk preserving the URL path. No HTML parsing, no headless browser, no conversion loss — output matches what the docs author wrote.
Detecting Mintlify
A site is likely Mintlify if any of these hold:
<meta name="generator" content="Mintlify">in the HTML- Assets served from
mintcdn.com - A
llms.txtorllms-full.txtfile exists at the site root - Appending
.mdto a doc URL returns plain markdown (not HTML)
If .md requests return HTML, the site is not Mintlify and a different strategy is needed.
Supported platforms
| Platform | Status | Strategy |
|---|---|---|
| Mintlify | Implemented | Append .md to each page URL |
| Docusaurus | Possible | Fetch source .md / .mdx from the docs repo on GitHub |
| MkDocs | Possible | Same — fetch source from the GitHub repo |
| GitBook | Possible | GitBook API (with token), or HTML scrape |
| ReadMe | Possible | ReadMe API (with token), or HTML scrape |
| Generic / custom | Possible | HTML scrape (markdownify or html2text) |
The downloader (src/docmark/downloader.py) is the only piece that knows about a specific platform. Adding a new strategy means writing a small module with a fetch(page_url, client) -> DownloadResult function and wiring it as a --strategy choice in the CLI. Sitemap parsing, filters, file writing, and concurrency stay untouched.
Strategies are added on demand — when a concrete site needs them — not speculatively.
Notes
- Sitemap-driven. URLs not listed in
sitemap.xmlare not crawled. - Pages are saved as raw MDX. Mintlify components (
<Steps>,<Tabs>,<CardGroup>, ...) are preserved verbatim — Claude and other LLMs read them fine. - A best-effort fetch of
llms.txtandllms-full.txtfrom the site root is included.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docmark-0.1.0.tar.gz.
File metadata
- Download URL: docmark-0.1.0.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
345bcb0d06c8981ebd9ae576ab8a5e5fcad6a675f8f682bb66d250d228a24105
|
|
| MD5 |
70c3d39a37c19c7652efb35a9acda048
|
|
| BLAKE2b-256 |
188207cf0205e4ab84bc02780dec90d29c2f5298116cf9a20cbdab8db6c28882
|
Provenance
The following attestation bundles were made for docmark-0.1.0.tar.gz:
Publisher:
publish.yml on eduardodoege/docmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docmark-0.1.0.tar.gz -
Subject digest:
345bcb0d06c8981ebd9ae576ab8a5e5fcad6a675f8f682bb66d250d228a24105 - Sigstore transparency entry: 1506868093
- Sigstore integration time:
-
Permalink:
eduardodoege/docmark@a72a98c502fbbeb054763185567014a3f02a4cdf -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/eduardodoege
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a72a98c502fbbeb054763185567014a3f02a4cdf -
Trigger Event:
push
-
Statement type:
File details
Details for the file docmark-0.1.0-py3-none-any.whl.
File metadata
- Download URL: docmark-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5753b9bffe1d716851a81adccb5e0d060da66783a154def52c805b1ce7a49648
|
|
| MD5 |
467500cebd409bdfbbadecdbc03fe499
|
|
| BLAKE2b-256 |
2348fd5ecdd06b5b615525d866947b61d507c2fc8a6e4c70b85e39fbafde6284
|
Provenance
The following attestation bundles were made for docmark-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on eduardodoege/docmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docmark-0.1.0-py3-none-any.whl -
Subject digest:
5753b9bffe1d716851a81adccb5e0d060da66783a154def52c805b1ce7a49648 - Sigstore transparency entry: 1506868328
- Sigstore integration time:
-
Permalink:
eduardodoege/docmark@a72a98c502fbbeb054763185567014a3f02a4cdf -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/eduardodoege
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a72a98c502fbbeb054763185567014a3f02a4cdf -
Trigger Event:
push
-
Statement type: