A light-weight package to convert virtually any file and youtube links to formatted markdown

These details have not been verified by PyPI

Project description

any-to-markdown

any-to-markdown is a lightweight Python package that converts a broad set of local files and YouTube links into Markdown.

It is designed for documentation pipelines, retrieval-augmented generation (RAG) workflows, and any scenario where you need to normalize diverse data sources into clean, structured text.

Author: Sankalp Joshi
License: MIT

Key Features

Broad File Support: Converts PDF, DOCX, PPTX, XLSX, Jupyter Notebooks (.ipynb), Images (OCR), Audio/Video (Transcription), and virtually any source code file.
Advanced PDF Engine: Built-in AI-powered layout analysis to remove "noise" (headers, footers, page numbers) and accurately extract tables.
YouTube Integration: Fetches transcripts directly via API or transcribes video locally using Whisper.
Smart Concurrency: Automatically manages resource usage, processing large files sequentially and small files in parallel.
Secure & Private: Sanitizes error messages to prevent leaking system paths and sensitive information.
No Overwrites: Saves results to a raw_data/ directory with collision-resistant naming.

Supported Formats

Documents: .pdf, .docx, .pptx, .txt, .md
Jupyter Notebooks: .ipynb (Extracts Markdown and Code cells)
Source Code: .py, .js, .ts, .cpp, .c, .rs, .go, .java, .rb, .php, .sh, .sql, .html, .css, .yaml, .json, .xml, etc.
Data: .xlsx, .xls, .csv
Images: .png, .jpg, .jpeg, .tiff, .bmp (via OCR)
Multimedia: .mp3, .mp4, .m4a, .wav (via Transcription)
Web: YouTube URLs (Transcripts)

Installation

pip install any-to-markdown

External Dependencies

For full functionality, ensure the following are installed on your system:

FFmpeg: Required for audio/video processing and local YouTube transcription.
Tesseract OCR: Required for image OCR and PDF visual fallback.

Public API

The package exports the following helpers from any_to_markdown:

get_markdown(inputs, use_layout_engine=False)
get_markdown_directory(directory_path, use_layout_engine=False)
handle_yt_local(urls)

Advanced PDF Layout

For significantly better PDF conversion (smart table detection, header/footer removal), enable the advanced layout engine:

results = await get_markdown("input.pdf", use_layout_engine=True)

Usage Examples

Convert a list of files or URLs

get_markdown() is asynchronous and accepts a single path/URL or a list of them.

import asyncio
from any_to_markdown import get_markdown

async def main():
    outputs = await get_markdown([
        "docs/report.pdf",
        "analysis.ipynb",
        "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    ], use_layout_engine=True)

    for path in outputs:
        print(f"Generated: {path}")

if __name__ == "__main__":
    asyncio.run(main())

Convert a directory recursively

import asyncio
from any_to_markdown import get_markdown_directory

async def main():
    # Automatically finds and processes all supported files in the folder
    outputs = await get_markdown_directory("./my_docs", use_layout_engine=True)
    print(f"Processed {len(outputs)} files.")

if __name__ == "__main__":
    asyncio.run(main())

Transcribe YouTube videos locally

Use handle_yt_local() when a YouTube transcript is unavailable or disabled. This downloads the audio and transcribes it locally using Whisper.

from any_to_markdown import handle_yt_local

# handle_yt_local is synchronous
outputs = handle_yt_local("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(outputs[0]) # Returns the raw markdown string

Output Behavior

Generated Markdown files are written to a ./raw_data/ directory in the current working directory.

Local Files: <filename>_<extension>.md (e.g., data.csv -> data_csv.md)
YouTube: youtube_<video_id>.md
Collisions: If a file exists, a numeric suffix is added (e.g., report_pdf_1.md).

Troubleshooting & Tips

Large Files: Files > 200MB are automatically processed sequentially to prevent memory issues.
OCR Quality: Depends on your local Tesseract installation and image resolution.
Whisper Performance: On the first run, the Whisper model will be downloaded (cached locally). CPU performance is optimized using int8 quantization.
Privacy: Errors caught during processing are sanitized to remove absolute local paths before being written to Markdown.

License

MIT License. See LICENSE for full terms.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Jun 12, 2026

This version

0.1.4

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

any_to_markdown-0.1.4.tar.gz (14.5 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

any_to_markdown-0.1.4-py3-none-any.whl (13.5 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file any_to_markdown-0.1.4.tar.gz.

File metadata

Download URL: any_to_markdown-0.1.4.tar.gz
Upload date: Jun 11, 2026
Size: 14.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for any_to_markdown-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`6e54fe8720d9be1ff80372fb8f46a38bf4f41afdb75e3ffe658d631890d4cb7d`
MD5	`af2e4b90e61d2a13d090fdc237d61660`
BLAKE2b-256	`4c4811236cff6f692919cff5af0ab9739cf4922eaa9b436ed38ae71dde848412`

See more details on using hashes here.

File details

Details for the file any_to_markdown-0.1.4-py3-none-any.whl.

File metadata

Download URL: any_to_markdown-0.1.4-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for any_to_markdown-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`926cd6b184b99d053ba3b2a0030027fbd72f92eaf94e25552fb61b09c6dcb91a`
MD5	`ca730ac189488bd418202a422721c541`
BLAKE2b-256	`4a818c55b737cc68fda4ef74b8e9ba237630447883173074627afa50029b7ff8`

See more details on using hashes here.

any-to-markdown 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

any-to-markdown

Key Features

Supported Formats

Installation

External Dependencies

Public API

Advanced PDF Layout

Usage Examples

Convert a list of files or URLs

Convert a directory recursively

Transcribe YouTube videos locally

Output Behavior

Troubleshooting & Tips

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes