Crawl API documentation (OpenAPI, Swagger, ReadMe, Mintlify, Fern, llms.txt, plain HTML) into structured, searchable markdown
Project description
ApiCrawl
Crawl API documentation into structured, searchable markdown.
Point it at any API docs URL — an OpenAPI/Swagger spec, a Swagger UI / Redoc /
Stoplight / Scalar page, ReadMe / Mintlify / Fern hosted docs, a Postman
collection, a Google Discovery document, an llms.txt index, or plain HTML
docs — and it discovers the underlying spec where one exists, crawls the
pages where one doesn't, classifies the content with an LLM, extracts
authentication instructions, and writes everything as a local markdown tree
you can grep, read, or feed to any tool.
pip install apicrawl
playwright install chromium # used to render JS-heavy docs sites
export GOOGLE_API_KEY=... # LLM access is required (Gemini primary)
export GROQ_API_KEY=... # optional fallback provider
apicrawl https://petstore3.swagger.io --output ./api-docs
Output layout:
api-docs/<catalog_id>/
index.md # API name, metadata, description, auth instructions
manifest.json # listing of everything ingested (also the completion marker)
sections/<slug>.md # docs pages / spec tag groups (markdown + frontmatter)
endpoints/<slug>.md # one file per endpoint: parameters, examples, TypeScript types
Library usage:
import asyncio
from apicrawl import ingest_to_dir
result = asyncio.run(ingest_to_dir("https://docs.example.com/api", "./api-docs"))
print(result.entry.name, result.pages_ingested)
Custom storage — implement IngestionSink and receive the parsed catalog
entry, sections, and endpoints as plain pydantic models, streamed in batches:
from apicrawl import IngestionSink, ingest
class MySink(IngestionSink):
async def emit_sections(self, sections): ...
async def emit_endpoints(self, endpoints): ...
asyncio.run(ingest("https://docs.example.com/api", MySink()))
Notes:
- LLM keys are required — page classification and auth extraction are
LLM-powered. Set
GOOGLE_API_KEY(and optionallyGROQ_API_KEY) in the environment or a.envfile. - Node.js is optional — if a
nodebinary is on PATH, endpoint pages include generated TypeScript request/response types (via a bundled openapi-typescript). Without Node, ingestion still works; the TS sections are simply omitted.
License: Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file apicrawl-0.1.0.tar.gz.
File metadata
- Download URL: apicrawl-0.1.0.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.13.3 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ba91ddb6ca4f351019ea489af9a1ce343bda2e7918c99452f149eec21e5d9a8
|
|
| MD5 |
df3f01070b6c29b98364d33d99404312
|
|
| BLAKE2b-256 |
ba3d2ecb929da1b02ed9fea1ff467294a43f449446884365792ea43edcc74877
|
File details
Details for the file apicrawl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: apicrawl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 2.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.13.3 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7526409910e5fab5c07a26f536f4b3a314496a64a39aa662500eef63fbf3ce97
|
|
| MD5 |
8f65a8da94702f03bef10faa1294d93c
|
|
| BLAKE2b-256 |
56bd4c765bebf2209c54547737e2a2a1295f898680b1c3617bc50e8f4e6d7f4b
|