A light-weight package to convert virtually any file and youtube links to formatted markdown
Project description
any-to-markdown
any-to-markdown is a lightweight Python package and CLI that converts a broad set of local files and YouTube links into Markdown.
It is designed for documentation pipelines, retrieval-augmented generation (RAG) workflows, and any scenario where you need to normalize diverse data sources into clean, structured text.
Author: Sankalp Joshi
License: MIT
Key Features
- Broad File Support: Converts PDF, DOCX, PPTX, XLSX, HTML, Jupyter Notebooks (.ipynb), Images (OCR), Audio/Video (Transcription), and many source code file types.
- Command-Line Interface: The
any-to-markdowncommand converts files, directories, and YouTube URLs straight from your shell. - Structured Results: Every input yields a
ConversionResultwith an explicitsuccess/error/skippedstatus (typed as aLiteralfor mypy users), the Markdown content, the output path, and a machine-readable error. No more parsing error prose out of markdown. - Batch Resilience: One failed input never aborts the batch. Failures emit warnings naming the offending file, with a batch summary and a suggested alternative function.
- Opt-in Heavy Dependencies: PDF, OCR, audio transcription, and YouTube support are pip extras; the core install stays small.
- Real HTML Conversion:
.html/.htmfiles are converted into structured Markdown (headings, lists, links, tables, code blocks), not just fenced as code. Scripts and styles are stripped. - Optional PDF Layout Mode: Layout analysis via
pymupdf4llmheuristics (table detection and structure-aware Markdown). The default PDF engine uses PyMuPDF with font-size and bold-flag heuristics, plus an OCR fallback for image-heavy pages. - YouTube Integration: Fetches transcripts directly via the YouTube transcript API, or transcribes locally with Whisper via
handle_yt_local(audio-only download, no FFmpeg required). - Configurable Whisper: Pick the transcription model size per call (
whisper_model="medium") or globally via theANY_TO_MARKDOWN_WHISPER_MODELenvironment variable. - Honest Concurrency: Small files run in parallel, files over 200MB run sequentially, and Whisper transcription jobs are limited to one at a time by default.
- Secure & Private: Sanitizes error messages to prevent leaking system paths and sensitive information; URLs in error messages are preserved for context.
- No Overwrites: Collision-resistant, race-free output naming.
- Typed: Ships a
py.typedmarker, so mypy and pyright consume the package's annotations.
Supported Formats
This list matches the code's ALLOWED_EXTENSIONS exactly (derived directly from the handler registry):
- Documents:
.pdf,.docx,.pptx,.txt,.md - Web Pages:
.html,.htm(converted to structured Markdown) - Jupyter Notebooks:
.ipynb(extracts Markdown and code cells) - Source Code & Markup:
.py,.js,.ts,.cpp,.c,.h,.hpp,.rs,.go,.java,.rb,.php,.sh,.sql,.yaml,.yml,.json,.xml,.css - Data:
.xlsx,.xls,.csv - Images (OCR):
.png,.jpg,.jpeg,.tiff,.tif,.bmp - Multimedia (Transcription):
.mp3,.wav,.m4a,.mp4 - Web: YouTube URLs (transcripts)
Installation
The core install covers text, code, notebooks, CSV/Excel, DOCX, PPTX, HTML, and the CLI:
pip install any-to-markdown
Heavy capabilities are opt-in extras:
pip install any-to-markdown[pdf] # PDF conversion (PyMuPDF + pymupdf4llm)
pip install any-to-markdown[ocr] # Image OCR (pytesseract + Pillow)
pip install any-to-markdown[audio] # Audio/video transcription (faster-whisper)
pip install any-to-markdown[youtube] # YouTube transcripts + local download
pip install any-to-markdown[all] # Everything
If you call a handler without its extra installed, you get a clear MissingDependencyError telling you the exact pip install command to run.
External Dependencies
- Tesseract OCR: Required for image OCR and the PDF visual fallback (the PDF engine degrades gracefully with a warning if OCR is unavailable).
- FFmpeg: Required only for local video files (
.mp4). It is not required forhandle_yt_local, which downloads an audio-only stream and feeds it directly to Whisper.
Command-Line Usage
Installing the package provides the any-to-markdown command:
# Convert files and YouTube links (writes to ./raw_data by default)
any-to-markdown report.pdf notes.docx "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Convert a whole directory into a custom output directory
any-to-markdown ./my_docs -o converted/
# PDF layout engine + a bigger Whisper model
any-to-markdown report.pdf lecture.mp3 --layout --whisper-model medium
# Show the version
any-to-markdown --version
Options:
| Option | Meaning |
|---|---|
-o, --output-dir PATH |
Output directory (defaults to ./raw_data) |
--layout |
Use the pymupdf4llm layout engine for PDFs |
--max-transcriptions N |
Concurrent Whisper jobs (default 1) |
--whisper-model SIZE |
Whisper model size (tiny, small, medium, ...) |
--version |
Print the version and exit |
The command prints one line per input and exits with code 1 if any input errored (skipped inputs do not affect the exit code).
Public API
The package exports the following from any_to_markdown:
get_markdown(inputs, use_layout_engine=False, max_transcriptions=1, output_dir=None, whisper_model=None)get_markdown_directory(directory_path, use_layout_engine=False, max_transcriptions=1, output_dir=None, whisper_model=None)handle_yt_local(urls, max_transcriptions=1, output_dir=None, whisper_model=None)handle_yt_local_async(...)(same signature, awaitable)ConversionResult,ConversionStatus,MissingDependencyError,TranscriptUnavailableError
ConversionResult
Every conversion function returns one ConversionResult per input:
| Field | Meaning |
|---|---|
input |
The original path or URL |
status |
"success", "error", or "skipped" (typed as ConversionStatus) |
ok |
Convenience property, True on success |
content |
The generated Markdown (on success) |
output_path |
Path to the written .md file (default output mode) |
message |
Human-readable success message (when you pass output_dir) |
error |
Sanitized, machine-readable error description |
suggestion |
Suggested alternative, e.g. handle_yt_local for failed transcripts |
PDF Layout Mode
For PDFs with complex tables, enable the pymupdf4llm-based layout analysis:
results = await get_markdown("input.pdf", use_layout_engine=True)
This mode is heuristic-based (not AI-powered): it relies on pymupdf4llm's rules for detecting tables, headings, and document structure.
Choosing a Whisper Model
Audio and video transcription defaults to the small Faster-Whisper model. Override it per call or via an environment variable:
results = await get_markdown("lecture.mp3", whisper_model="medium")
export ANY_TO_MARKDOWN_WHISPER_MODEL=tiny # fast, lower accuracy
Model instances are cached per size, so mixing sizes in one process is safe.
Usage Examples
Convert a list of files or URLs
import asyncio
from any_to_markdown import get_markdown
async def main():
results = await get_markdown([
"docs/report.pdf",
"analysis.ipynb",
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
], use_layout_engine=True)
for result in results:
if result.ok:
print(f"Generated: {result.output_path}") # pathlib.Path
else:
print(f"{result.status}: {result.input} -> {result.error}")
if result.suggestion:
print(f" Try: {result.suggestion}()")
if __name__ == "__main__":
asyncio.run(main())
Custom output directory
results = await get_markdown("docs/report.pdf", output_dir="converted/")
print(results[0].message)
# Success: 'docs/report.pdf' converted and written to 'converted/report_pdf.md'
Convert a directory recursively
import asyncio
from any_to_markdown import get_markdown_directory
async def main():
# Always returns a list; empty if the directory has no supported files.
results = await get_markdown_directory("./my_docs", use_layout_engine=True)
print(f"{sum(r.ok for r in results)} of {len(results)} files converted.")
if __name__ == "__main__":
asyncio.run(main())
Transcribe YouTube videos locally
Use handle_yt_local() when a YouTube transcript is unavailable or disabled. It downloads the audio-only stream (bestaudio[ext=m4a]/bestaudio/best) and transcribes it locally with Whisper. No FFmpeg is needed for this path.
from any_to_markdown import handle_yt_local
# In-memory by default; pass output_dir=... to also write youtube_<id>.md files.
results = handle_yt_local("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
if results[0].ok:
print(results[0].content) # The Markdown transcription string
From async code, await the async variant directly (URLs are processed concurrently, gated by max_transcriptions):
from any_to_markdown import handle_yt_local_async
results = await handle_yt_local_async(urls, max_transcriptions=2, output_dir="transcripts/")
Output Behavior
- Default: files are written to
./raw_data/in the current working directory (created only when at least one input succeeds), and each successful result carries thePathobject inoutput_pathplus the Markdown incontent. - With
output_dir: files are written to your directory, and each successful result additionally carries a human-readablemessage. - Naming:
<filename>_<extension>.mdfor local files,youtube_<video_id>.mdfor videos. Collisions get a numeric suffix (e.g.report_pdf_1.md) using race-free exclusive-create writes. - Failed or skipped inputs never produce output files and never embed error prose into Markdown.
Error Handling
- Each failed input produces a result with
status="error", a sanitizederrorstring, and (where applicable) asuggestionsuch ashandle_yt_localfor unavailable YouTube transcripts. - A
UserWarningis emitted per failure naming the exact input, plus a batch summary: how many succeeded, failed, and were skipped. - One bad input never aborts the batch.
Concurrency Model
- Up to 10 small files are processed concurrently.
- Files larger than 200MB are processed sequentially to avoid out-of-memory errors.
- Audio and video files (
.mp3,.wav,.m4a,.mp4) always go through a dedicated transcription semaphore so only one Whisper job runs at a time by default. Tune this withmax_transcriptions. handle_yt_localdownloads are capped at 200MB (MAX_DOWNLOAD_SIZE), independently of the concurrency threshold.
Troubleshooting & Tips
- OCR Quality: Depends on your local Tesseract installation and image resolution.
- Whisper Performance: On the first run, the selected Whisper model is downloaded (cached locally). CPU performance is optimized using
int8quantization. - Encoding:
.txt/.mdfiles are decoded as UTF-8 (BOM-aware) with a lossless Latin-1 fallback, so legacy files never fail the batch. - Privacy: Error messages are sanitized to remove absolute local paths before being surfaced. URLs are kept intact for context.
Releasing (maintainers)
Releases are published to PyPI automatically when a version tag (e.g. v0.3.0) is pushed, using PyPI Trusted Publishing - no API token is stored in CI.
One-time setup on PyPI: under the project's Publishing settings, add a GitLab trusted publisher with namespace sankalp-group, project any-to-markdown, and workflow file .gitlab-ci.yml.
Release steps:
- Update the version in
pyproject.tomland add aCHANGELOG.mdentry. - Merge to
master, then tag:git tag v0.x.y && git push origin v0.x.y. - The
publishCI job builds withuv buildand uploads viauv publish.
See CHANGELOG.md for the release history.
License
MIT License. See LICENSE for full terms.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file any_to_markdown-0.2.0.tar.gz.
File metadata
- Download URL: any_to_markdown-0.2.0.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ac7ce0376a7c6aa9daca16c78fcfd4fd9f800422def9a9dbba3b09c826f4bb4
|
|
| MD5 |
4fc03194361d1339fc6cfc8827e809f6
|
|
| BLAKE2b-256 |
145b19c898f8f0274e012a589d338e8f0494a6d0cb83459dd0583d43a0b42416
|
File details
Details for the file any_to_markdown-0.2.0-py3-none-any.whl.
File metadata
- Download URL: any_to_markdown-0.2.0-py3-none-any.whl
- Upload date:
- Size: 25.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12c805dc8e1d92df2d94e250ad3d7473add157c6f229117a2961e893f420d37e
|
|
| MD5 |
114389ea761590bd1f445483779a25a6
|
|
| BLAKE2b-256 |
db934a35b8994af0436d86ef9443a66e4af9cc95ec966dd31915c3d2b1ae173d
|