Skip to main content

Convert hosted documentation sites to local Markdown — built for feeding LLMs and AI Skills.

Project description

docmark

Convert hosted documentation sites to local Markdown — built for feeding LLMs and AI Skills.

Currently optimized for Mintlify-hosted docs (Anthropic, Polymarket, many web3 / crypto sites), which expose the source markdown of any page at <url>.md. The architecture is built around a single downloader strategy, so other doc platforms (Docusaurus, MkDocs, GitBook, ReadMe, generic HTML) can be added without rewriting the rest of the pipeline.

Install

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .

Use

docmark https://docs.polymarket.com/sitemap.xml --output ./output/polymarket

Or without installing:

python -m docmark https://docs.polymarket.com/sitemap.xml --output ./output/polymarket

Options

Flag Default Description
--output, -o output Directory to write markdown files into
--concurrency, -c 10 Parallel downloads
--include-locales off Include localized variants (/cn/, /es/, ...). Filtered out by default.
--include none Only crawl URLs whose path starts with this prefix
--exclude none Skip URLs whose path starts with this prefix (repeatable)
--timeout 30 Per-request timeout in seconds

Examples

Only API reference pages, higher concurrency:

docmark https://docs.polymarket.com/sitemap.xml -o ./out -c 20 --include /api-reference/

Include Chinese variant and skip the /builders/ section:

docmark https://docs.polymarket.com/sitemap.xml -o ./out --include-locales --exclude /builders/

How URL paths map to files

https://docs.polymarket.com/                           -> output/index.md
https://docs.polymarket.com/quickstart                 -> output/quickstart.md
https://docs.polymarket.com/api-reference/trade/cancel-all-orders
                                                       -> output/api-reference/trade/cancel-all-orders.md

How it works

Mintlify renders HTML for users, but also serves the raw MDX source whenever a request appends .md to a page URL:

https://docs.example.com/quickstart       -> rendered HTML
https://docs.example.com/quickstart.md    -> raw markdown source

The crawler reads the site's sitemap.xml, requests <url>.md for every entry in parallel, and writes each response to disk preserving the URL path. No HTML parsing, no headless browser, no conversion loss — output matches what the docs author wrote.

Detecting Mintlify

A site is likely Mintlify if any of these hold:

  • <meta name="generator" content="Mintlify"> in the HTML
  • Assets served from mintcdn.com
  • A llms.txt or llms-full.txt file exists at the site root
  • Appending .md to a doc URL returns plain markdown (not HTML)

If .md requests return HTML, the site is not Mintlify and a different strategy is needed.

Supported platforms

Platform Status Strategy
Mintlify Implemented Append .md to each page URL
Docusaurus Possible Fetch source .md / .mdx from the docs repo on GitHub
MkDocs Possible Same — fetch source from the GitHub repo
GitBook Possible GitBook API (with token), or HTML scrape
ReadMe Possible ReadMe API (with token), or HTML scrape
Generic / custom Possible HTML scrape (markdownify or html2text)

The downloader (src/docmark/downloader.py) is the only piece that knows about a specific platform. Adding a new strategy means writing a small module with a fetch(page_url, client) -> DownloadResult function and wiring it as a --strategy choice in the CLI. Sitemap parsing, filters, file writing, and concurrency stay untouched.

Strategies are added on demand — when a concrete site needs them — not speculatively.

Notes

  • Sitemap-driven. URLs not listed in sitemap.xml are not crawled.
  • Pages are saved as raw MDX. Mintlify components (<Steps>, <Tabs>, <CardGroup>, ...) are preserved verbatim — Claude and other LLMs read them fine.
  • A best-effort fetch of llms.txt and llms-full.txt from the site root is included.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docmark-0.1.0.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docmark-0.1.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file docmark-0.1.0.tar.gz.

File metadata

  • Download URL: docmark-0.1.0.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docmark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 345bcb0d06c8981ebd9ae576ab8a5e5fcad6a675f8f682bb66d250d228a24105
MD5 70c3d39a37c19c7652efb35a9acda048
BLAKE2b-256 188207cf0205e4ab84bc02780dec90d29c2f5298116cf9a20cbdab8db6c28882

See more details on using hashes here.

Provenance

The following attestation bundles were made for docmark-0.1.0.tar.gz:

Publisher: publish.yml on eduardodoege/docmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docmark-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docmark-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docmark-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5753b9bffe1d716851a81adccb5e0d060da66783a154def52c805b1ce7a49648
MD5 467500cebd409bdfbbadecdbc03fe499
BLAKE2b-256 2348fd5ecdd06b5b615525d866947b61d507c2fc8a6e4c70b85e39fbafde6284

See more details on using hashes here.

Provenance

The following attestation bundles were made for docmark-0.1.0-py3-none-any.whl:

Publisher: publish.yml on eduardodoege/docmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page