Skip to main content

Bidirectional, Markdown-centric document conversion: reverse (X->Markdown) like markitdown, plus high-fidelity forward export (Markdown->PDF/Word/LaTeX/EPUB/Excel) and optional VLM image recognition.

Project description

DocumentStudio (Python)

A bidirectional, Markdown-centric document converter — a Python library and CLI in the spirit of Microsoft's markitdown, but going both ways:

Direction Formats Notes
Reverse X → Markdown PDF, Word, PPT, Excel, EPUB, HTML, CSV/TSV, JSON, ZIP, images like markitdown; can delegate to markitdown when installed
Forward Markdown → X HTML, PDF, Word (.docx), LaTeX, EPUB, Excel (.xlsx), text high-fidelity export — the part markitdown does not do
AI / VLM image & scanned-PDF recognition, "smart cleanup" any OpenAI-compatible endpoint; vision model optional
AI assistant polish, translate, summarise, expand, continue, grammar, formalise, titles, outline, fix-LaTeX, free-form one-shot ops on a document
Toolbox table of contents, merge PDFs, extract images headless, no browser
Templates academic, techdoc, minutes, readme, weekly, blog ready-to-edit Markdown

The design mirrors markitdown's: a small core, a converter registry that's open for extension, and optional dependency extras so a minimal install still works.

Install

pip install docstudio                 # core: csv/tsv/json/html  +  md→html/latex/text
pip install "docstudio[office]"       # docx, pptx, xlsx, epub
pip install "docstudio[pdf]"          # PDF text extraction (pdfminer.six)
pip install "docstudio[ocr]"          # scanned-PDF / image OCR (PyMuPDF, pytesseract)
pip install "docstudio[llm]"          # AI cleanup + VLM (requests)
pip install "docstudio[markitdown]"   # reuse Microsoft markitdown for the reverse path
pip install "docstudio[all]"

For Markdown → PDF/DOCX/EPUB with the best fidelity, install pandoc plus a TeX engine (xelatex):

sudo apt install pandoc texlive-xetex texlive-latex-recommended fonts-noto-cjk

PDF also has two pure-Python backends as fallbacks: weasyprint (docstudio[pdf-weasy]) and headless-Chrome via playwright (docstudio[pdf-chrome], full KaTeX math).

Library

from docstudio import DocumentStudio
ds = DocumentStudio()                       # use_markitdown=True by default

# anything → Markdown
md = ds.to_markdown("report.pdf")
md = ds.to_markdown("slides.pptx")

# Markdown → anything (non-md inputs are auto-converted first)
ds.convert("paper.md",  to="pdf",   out="paper.pdf")
ds.convert("paper.md",  to="docx",  out="paper.docx")
ds.convert("scan.pdf",  to="docx",  out="scan.docx")   # PDF → md → docx
ds.convert("table.png", to="xlsx",  out="table.xlsx")  # image → md → xlsx

AI + Vision (VLM)

from docstudio import DocumentStudio
from docstudio.llm import LLM

llm = LLM(base_url="https://api.openai.com", api_key="sk-...",
          model="gpt-4o-mini", vlm_model="gpt-4o")

print(LLM.fetch_models("https://api.openai.com", "sk-..."))   # pick from the list

ds = DocumentStudio(llm=llm)
md = ds.to_markdown("photographed_table.jpg")   # recognised by the vision model
md = ds.to_markdown("scanned_book.pdf")         # page-by-page VLM when no text layer
md = llm.cleanup_markdown(rough_text)           # turn messy OCR into clean Markdown

AI assistant (operate on a document)

One-shot AI operations on Markdown/text — the AI Assistant from the web app. Needs an llm (any OpenAI-compatible endpoint).

from docstudio import DocumentStudio
from docstudio.llm import LLM

# any OpenAI-compatible endpoint — OpenAI, DeepSeek, vLLM, Ollama, a gateway…
# you choose base_url + model; nothing is hard-coded to a provider
ds = DocumentStudio(llm=LLM(base_url="https://api.openai.com",
                            api_key="sk-...", model="gpt-4o-mini"))

ds.assist(md, action="polish")     # 润色
ds.assist(md, action="to_en")      # 翻译成英文(to_zh 反之)
ds.assist(md, action="summary")    # 摘要
ds.assist(md, action="outline")    # 生成大纲
ds.assist(md, instruction="把所有表格改成要点列表")   # 自由指令

DocumentStudio.assist_actions()
# polish, to_en, to_zh, summary, expand, condense, continue,
# grammar, formal, titles, outline, fix_latex

Toolbox

ds.generate_toc(md)                              # insert a Markdown table of contents
ds.merge_pdfs(["a.pdf", "b.pdf"], "all.pdf")     # concatenate PDFs (needs pypdf)
ds.extract_images("report.pdf", "./imgs")        # pull embedded images out (PDF/DOCX/PPTX/EPUB)

Templates

Six ready-to-edit Markdown templates: academic, techdoc, minutes, readme, weekly, blog.

ds.templates()                # {slug: (title, description)}
body = ds.template("academic")

CLI

docstudio report.pdf                      # → report.md   (prints to stdout)
docstudio report.pdf -o out.md
cat report.pdf | docstudio                # stdin → stdout
docstudio paper.md --to pdf -o paper.pdf  # Markdown → anything
docstudio scan.pdf --to docx              # PDF → md → docx
docstudio photo.jpg --vlm-model gpt-4o --base-url https://api.openai.com --api-key sk-...
docstudio --list-formats

docstudio paper.md --toc -o paper.md                     # insert a table of contents
docstudio notes.md --assist polish --base-url https://api.openai.com --model gpt-4o-mini --api-key sk-... -o clean.md
docstudio notes.md --instruction "翻译成英文" --base-url https://api.openai.com --model gpt-4o-mini --api-key sk-... -o en.md
docstudio --merge a.pdf b.pdf -o all.pdf                 # merge PDFs
docstudio report.pdf --extract-images ./imgs             # pull out images
docstudio --template academic                            # print a template
docstudio --list-templates

Extending

Register your own converter — exactly how the built-ins are defined:

from docstudio.core import registry

@registry.ingester("rtf")
def rtf_to_md(source, ds=None, **opts):
    ...
    return markdown_text

@registry.exporter("rst")
def md_to_rst(md, out=None, ds=None, **opts):
    ...
    return out

Relationship to markitdown

markitdown is excellent at X → Markdown for LLM pipelines. DocumentStudio reuses it for that direction when present (use_markitdown=True), and adds the missing half: turning Markdown back into polished, human-facing PDF / Word / LaTeX / EPUB / Excel, plus a vision-model path for images and scanned PDFs.

MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstudio-0.2.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docstudio-0.2.0-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file docstudio-0.2.0.tar.gz.

File metadata

  • Download URL: docstudio-0.2.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for docstudio-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3cebbdbb849894dc46c75df6db0bdb121991a19b3d67b4eed4a37993088228ab
MD5 d984c7c1d6f9a38b1fcf6d18e520ab39
BLAKE2b-256 80171341e692c34e2cfc99ff3885f783cc7a54935e6f9f0a4a58677c2ab90cea

See more details on using hashes here.

File details

Details for the file docstudio-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: docstudio-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for docstudio-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4204bb5db78e34d6f134dc6d170258a217ce92f96da60c325ef9103fe3b8646f
MD5 89fa284eeceae15d2d0d48c004ca5023
BLAKE2b-256 1f079ba92e2037e65c096f0a554f3921280fab64c6692c1441d73433ffe4a0f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page