A light-weight package to convert virtually any file and youtube links to formatted markdown

These details have not been verified by PyPI

Project description

any-to-markdown

any-to-markdown is a lightweight Python package and CLI that converts a broad set of local files and YouTube links into Markdown.

It is designed for documentation pipelines, retrieval-augmented generation (RAG) workflows, and any scenario where you need to normalize diverse data sources into clean, structured text.

Author: Sankalp Joshi
License: MIT

Key Features

Broad File Support: Converts PDF, DOCX, PPTX, XLSX, HTML, Jupyter Notebooks (.ipynb), Images (OCR), Audio/Video (Transcription), and many source code file types.
Command-Line Interface: The any-to-markdown command converts files, directories, and YouTube URLs straight from your shell.
Structured Results: Every input yields a ConversionResult with an explicit success / error / skipped status (typed as a Literal for mypy users), the Markdown content, the output path, and a machine-readable error. No more parsing error prose out of markdown.
Batch Resilience: One failed input never aborts the batch. Failures emit warnings naming the offending file, with a batch summary and a suggested alternative function.
Opt-in Heavy Dependencies: PDF, OCR, audio transcription, and YouTube support are pip extras; the core install stays small.
Real HTML Conversion: .html/.htm files are converted into structured Markdown (headings, lists, links, tables, code blocks), not just fenced as code. Scripts and styles are stripped.
Optional PDF Layout Mode: Layout analysis via pymupdf4llm heuristics (table detection and structure-aware Markdown). The default PDF engine uses PyMuPDF with font-size and bold-flag heuristics, plus an OCR fallback for image-heavy pages.
YouTube Integration: Fetches transcripts directly via the YouTube transcript API, or transcribes locally with Whisper via handle_yt_local (audio-only download, no FFmpeg required).
Configurable Whisper: Pick the transcription model size per call (whisper_model="medium") or globally via the ANY_TO_MARKDOWN_WHISPER_MODEL environment variable.
Honest Concurrency: Small files run in parallel, files over 200MB run sequentially, and Whisper transcription jobs are limited to one at a time by default.
Secure & Private: Sanitizes error messages to prevent leaking system paths and sensitive information; URLs in error messages are preserved for context.
No Overwrites: Collision-resistant, race-free output naming.
Typed: Ships a py.typed marker, so mypy and pyright consume the package's annotations.

Supported Formats

This list matches the code's ALLOWED_EXTENSIONS exactly (derived directly from the handler registry):

Documents: .pdf, .docx, .pptx, .txt, .md
Web Pages: .html, .htm (converted to structured Markdown)
Jupyter Notebooks: .ipynb (extracts Markdown and code cells)
Source Code & Markup: .py, .js, .ts, .cpp, .c, .h, .hpp, .rs, .go, .java, .rb, .php, .sh, .sql, .yaml, .yml, .json, .xml, .css
Data: .xlsx, .xls, .csv
Images (OCR): .png, .jpg, .jpeg, .tiff, .tif, .bmp
Multimedia (Transcription): .mp3, .wav, .m4a, .mp4
Web: YouTube URLs (transcripts)

Installation

The core install covers text, code, notebooks, CSV/Excel, DOCX, PPTX, HTML, and the CLI:

pip install any-to-markdown

Heavy capabilities are opt-in extras:

pip install any-to-markdown[pdf]      # PDF conversion (PyMuPDF + pymupdf4llm)
pip install any-to-markdown[ocr]      # Image OCR (pytesseract + Pillow)
pip install any-to-markdown[audio]    # Audio/video transcription (faster-whisper)
pip install any-to-markdown[youtube]  # YouTube transcripts + local download
pip install any-to-markdown[all]      # Everything

If you call a handler without its extra installed, you get a clear MissingDependencyError telling you the exact pip install command to run.

External Dependencies

Tesseract OCR: Required for image OCR and the PDF visual fallback (the PDF engine degrades gracefully with a warning if OCR is unavailable).
FFmpeg: Required only for local video files (.mp4). It is not required for handle_yt_local, which downloads an audio-only stream and feeds it directly to Whisper.

Command-Line Usage

Installing the package provides the any-to-markdown command:

# Convert files and YouTube links (writes to ./raw_data by default)
any-to-markdown report.pdf notes.docx "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Convert a whole directory into a custom output directory
any-to-markdown ./my_docs -o converted/

# PDF layout engine + a bigger Whisper model
any-to-markdown report.pdf lecture.mp3 --layout --whisper-model medium

# Show the version
any-to-markdown --version

Options:

Option	Meaning
`-o, --output-dir PATH`	Output directory (defaults to `./raw_data`)
`--layout`	Use the `pymupdf4llm` layout engine for PDFs
`--max-transcriptions N`	Concurrent Whisper jobs (default 1)
`--whisper-model SIZE`	Whisper model size (`tiny`, `small`, `medium`, ...)
`--version`	Print the version and exit

The command prints one line per input and exits with code 1 if any input errored (skipped inputs do not affect the exit code).

Public API

The package exports the following from any_to_markdown:

get_markdown(inputs, use_layout_engine=False, max_transcriptions=1, output_dir=None, whisper_model=None)
get_markdown_directory(directory_path, use_layout_engine=False, max_transcriptions=1, output_dir=None, whisper_model=None)
handle_yt_local(urls, max_transcriptions=1, output_dir=None, whisper_model=None)
handle_yt_local_async(...) (same signature, awaitable)
ConversionResult, ConversionStatus, MissingDependencyError, TranscriptUnavailableError

ConversionResult

Every conversion function returns one ConversionResult per input:

Field	Meaning
`input`	The original path or URL
`status`	`"success"`, `"error"`, or `"skipped"` (typed as `ConversionStatus`)
`ok`	Convenience property, `True` on success
`content`	The generated Markdown (on success)
`output_path`	`Path` to the written `.md` file (default output mode)
`message`	Human-readable success message (when you pass `output_dir`)
`error`	Sanitized, machine-readable error description
`suggestion`	Suggested alternative, e.g. `handle_yt_local` for failed transcripts

PDF Layout Mode

For PDFs with complex tables, enable the pymupdf4llm-based layout analysis:

results = await get_markdown("input.pdf", use_layout_engine=True)

This mode is heuristic-based (not AI-powered): it relies on pymupdf4llm's rules for detecting tables, headings, and document structure.

Choosing a Whisper Model

Audio and video transcription defaults to the small Faster-Whisper model. Override it per call or via an environment variable:

results = await get_markdown("lecture.mp3", whisper_model="medium")

export ANY_TO_MARKDOWN_WHISPER_MODEL=tiny   # fast, lower accuracy

Model instances are cached per size, so mixing sizes in one process is safe.

Usage Examples

Convert a list of files or URLs

import asyncio
from any_to_markdown import get_markdown

async def main():
    results = await get_markdown([
        "docs/report.pdf",
        "analysis.ipynb",
        "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    ], use_layout_engine=True)

    for result in results:
        if result.ok:
            print(f"Generated: {result.output_path}")  # pathlib.Path
        else:
            print(f"{result.status}: {result.input} -> {result.error}")
            if result.suggestion:
                print(f"  Try: {result.suggestion}()")

if __name__ == "__main__":
    asyncio.run(main())

Custom output directory

results = await get_markdown("docs/report.pdf", output_dir="converted/")
print(results[0].message)
# Success: 'docs/report.pdf' converted and written to 'converted/report_pdf.md'

Convert a directory recursively

import asyncio
from any_to_markdown import get_markdown_directory

async def main():
    # Always returns a list; empty if the directory has no supported files.
    results = await get_markdown_directory("./my_docs", use_layout_engine=True)
    print(f"{sum(r.ok for r in results)} of {len(results)} files converted.")

if __name__ == "__main__":
    asyncio.run(main())

Transcribe YouTube videos locally

Use handle_yt_local() when a YouTube transcript is unavailable or disabled. It downloads the audio-only stream (bestaudio[ext=m4a]/bestaudio/best) and transcribes it locally with Whisper. No FFmpeg is needed for this path.

from any_to_markdown import handle_yt_local

# In-memory by default; pass output_dir=... to also write youtube_<id>.md files.
results = handle_yt_local("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
if results[0].ok:
    print(results[0].content)  # The Markdown transcription string

From async code, await the async variant directly (URLs are processed concurrently, gated by max_transcriptions):

from any_to_markdown import handle_yt_local_async

results = await handle_yt_local_async(urls, max_transcriptions=2, output_dir="transcripts/")

Output Behavior

Default: files are written to ./raw_data/ in the current working directory (created only when at least one input succeeds), and each successful result carries the Path object in output_path plus the Markdown in content.
With output_dir: files are written to your directory, and each successful result additionally carries a human-readable message.
Naming: <filename>_<extension>.md for local files, youtube_<video_id>.md for videos. Collisions get a numeric suffix (e.g. report_pdf_1.md) using race-free exclusive-create writes.
Failed or skipped inputs never produce output files and never embed error prose into Markdown.

Error Handling

Each failed input produces a result with status="error", a sanitized error string, and (where applicable) a suggestion such as handle_yt_local for unavailable YouTube transcripts.
A UserWarning is emitted per failure naming the exact input, plus a batch summary: how many succeeded, failed, and were skipped.
One bad input never aborts the batch.

Concurrency Model

Up to 10 small files are processed concurrently.
Files larger than 200MB are processed sequentially to avoid out-of-memory errors.
Audio and video files (.mp3, .wav, .m4a, .mp4) always go through a dedicated transcription semaphore so only one Whisper job runs at a time by default. Tune this with max_transcriptions.
handle_yt_local downloads are capped at 200MB (MAX_DOWNLOAD_SIZE), independently of the concurrency threshold.

Troubleshooting & Tips

OCR Quality: Depends on your local Tesseract installation and image resolution.
Whisper Performance: On the first run, the selected Whisper model is downloaded (cached locally). CPU performance is optimized using int8 quantization.
Encoding: .txt/.md files are decoded as UTF-8 (BOM-aware) with a lossless Latin-1 fallback, so legacy files never fail the batch.
Privacy: Error messages are sanitized to remove absolute local paths before being surfaced. URLs are kept intact for context.

Releasing (maintainers)

Releases are published to PyPI automatically when a version tag (e.g. v0.3.0) is pushed, using PyPI Trusted Publishing - no API token is stored in CI.

One-time setup on PyPI: under the project's Publishing settings, add a GitLab trusted publisher with namespace sankalp-group, project any-to-markdown, and workflow file .gitlab-ci.yml.

Release steps:

Update the version in pyproject.toml and add a CHANGELOG.md entry.
Merge to master, then tag: git tag v0.x.y && git push origin v0.x.y.
The publish CI job builds with uv build and uploads via uv publish.

See CHANGELOG.md for the release history.

License

MIT License. See LICENSE for full terms.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jun 12, 2026

0.1.4

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

any_to_markdown-0.2.0.tar.gz (30.4 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

any_to_markdown-0.2.0-py3-none-any.whl (25.2 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file any_to_markdown-0.2.0.tar.gz.

File metadata

Download URL: any_to_markdown-0.2.0.tar.gz
Upload date: Jun 12, 2026
Size: 30.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for any_to_markdown-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0ac7ce0376a7c6aa9daca16c78fcfd4fd9f800422def9a9dbba3b09c826f4bb4`
MD5	`4fc03194361d1339fc6cfc8827e809f6`
BLAKE2b-256	`145b19c898f8f0274e012a589d338e8f0494a6d0cb83459dd0583d43a0b42416`

See more details on using hashes here.

File details

Details for the file any_to_markdown-0.2.0-py3-none-any.whl.

File metadata

Download URL: any_to_markdown-0.2.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 25.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for any_to_markdown-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`12c805dc8e1d92df2d94e250ad3d7473add157c6f229117a2961e893f420d37e`
MD5	`114389ea761590bd1f445483779a25a6`
BLAKE2b-256	`db934a35b8994af0436d86ef9443a66e4af9cc95ec966dd31915c3d2b1ae173d`

See more details on using hashes here.

any-to-markdown 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

any-to-markdown

Key Features

Supported Formats

Installation

External Dependencies

Command-Line Usage

Public API

ConversionResult

PDF Layout Mode

Choosing a Whisper Model

Usage Examples

Convert a list of files or URLs

Custom output directory

Convert a directory recursively

Transcribe YouTube videos locally

Output Behavior

Error Handling

Concurrency Model

Troubleshooting & Tips

Releasing (maintainers)

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes