Skip to main content

Crawl API documentation (OpenAPI, Swagger, ReadMe, Mintlify, Fern, llms.txt, plain HTML) into structured, searchable markdown

Project description

ApiCrawl

Crawl API documentation into structured, searchable markdown.

Point it at any API docs URL — an OpenAPI/Swagger spec, a Swagger UI / Redoc / Stoplight / Scalar page, ReadMe / Mintlify / Fern hosted docs, a Postman collection, a Google Discovery document, an llms.txt index, or plain HTML docs — and it discovers the underlying spec where one exists, crawls the pages where one doesn't, classifies the content with an LLM, extracts authentication instructions, and writes everything as a local markdown tree you can grep, read, or feed to any tool.

pip install apicrawl
playwright install chromium   # used to render JS-heavy docs sites

export GOOGLE_API_KEY=...     # LLM access is required (Gemini primary)
export GROQ_API_KEY=...       # optional fallback provider

apicrawl https://petstore3.swagger.io --output ./api-docs

Output layout:

api-docs/<catalog_id>/
  index.md            # API name, metadata, description, auth instructions
  manifest.json       # listing of everything ingested (also the completion marker)
  sections/<slug>.md  # docs pages / spec tag groups (markdown + frontmatter)
  endpoints/<slug>.md # one file per endpoint: parameters, examples, TypeScript types

Library usage:

import asyncio
from apicrawl import ingest_to_dir

result = asyncio.run(ingest_to_dir("https://docs.example.com/api", "./api-docs"))
print(result.entry.name, result.pages_ingested)

Custom storage — implement IngestionSink and receive the parsed catalog entry, sections, and endpoints as plain pydantic models, streamed in batches:

from apicrawl import IngestionSink, ingest

class MySink(IngestionSink):
    async def emit_sections(self, sections): ...
    async def emit_endpoints(self, endpoints): ...

asyncio.run(ingest("https://docs.example.com/api", MySink()))

Notes:

  • LLM keys are required — page classification and auth extraction are LLM-powered. Set GOOGLE_API_KEY (and optionally GROQ_API_KEY) in the environment or a .env file.
  • Node.js is optional — if a node binary is on PATH, endpoint pages include generated TypeScript request/response types (via a bundled openapi-typescript). Without Node, ingestion still works; the TS sections are simply omitted.

License: Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apicrawl-0.1.0.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

apicrawl-0.1.0-py3-none-any.whl (2.4 MB view details)

Uploaded Python 3

File details

Details for the file apicrawl-0.1.0.tar.gz.

File metadata

  • Download URL: apicrawl-0.1.0.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.3 Darwin/24.6.0

File hashes

Hashes for apicrawl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5ba91ddb6ca4f351019ea489af9a1ce343bda2e7918c99452f149eec21e5d9a8
MD5 df3f01070b6c29b98364d33d99404312
BLAKE2b-256 ba3d2ecb929da1b02ed9fea1ff467294a43f449446884365792ea43edcc74877

See more details on using hashes here.

File details

Details for the file apicrawl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: apicrawl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.3 Darwin/24.6.0

File hashes

Hashes for apicrawl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7526409910e5fab5c07a26f536f4b3a314496a64a39aa662500eef63fbf3ce97
MD5 8f65a8da94702f03bef10faa1294d93c
BLAKE2b-256 56bd4c765bebf2209c54547737e2a2a1295f898680b1c3617bc50e8f4e6d7f4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page