Bidirectional, Markdown-centric document conversion: reverse (X->Markdown) like markitdown, plus high-fidelity forward export (Markdown->PDF/Word/LaTeX/EPUB/Excel) and optional VLM image recognition.
Project description
DocumentStudio (Python)
A bidirectional, Markdown-centric document converter — a Python library and
CLI in the spirit of Microsoft's markitdown,
but going both ways:
| Direction | Formats | Notes |
|---|---|---|
Reverse X → Markdown |
PDF, Word, PPT, Excel, EPUB, HTML, CSV/TSV, JSON, ZIP, images | like markitdown; can delegate to markitdown when installed |
Forward Markdown → X |
HTML, PDF, Word (.docx), LaTeX, EPUB, Excel (.xlsx), text | high-fidelity export — the part markitdown does not do |
| AI / VLM | image & scanned-PDF recognition, "smart cleanup" | any OpenAI-compatible endpoint; vision model optional |
| AI assistant | polish, translate, summarise, expand, continue, grammar, formalise, titles, outline, fix-LaTeX, free-form | one-shot ops on a document |
| Toolbox | table of contents, merge PDFs, extract images | headless, no browser |
| Templates | academic, techdoc, minutes, readme, weekly, blog | ready-to-edit Markdown |
The design mirrors markitdown's: a small core, a converter registry that's open for extension, and optional dependency extras so a minimal install still works.
Install
pip install docstudio # core: csv/tsv/json/html + md→html/latex/text
pip install "docstudio[office]" # docx, pptx, xlsx, epub
pip install "docstudio[pdf]" # PDF text extraction (pdfminer.six)
pip install "docstudio[ocr]" # scanned-PDF / image OCR (PyMuPDF, pytesseract)
pip install "docstudio[llm]" # AI cleanup + VLM (requests)
pip install "docstudio[markitdown]" # reuse Microsoft markitdown for the reverse path
pip install "docstudio[all]"
For Markdown → PDF/DOCX/EPUB with the best fidelity, install
pandoc plus a TeX engine (xelatex):
sudo apt install pandoc texlive-xetex texlive-latex-recommended fonts-noto-cjk
PDF also has two pure-Python backends as fallbacks: weasyprint
(docstudio[pdf-weasy]) and headless-Chrome via playwright
(docstudio[pdf-chrome], full KaTeX math).
Library
from docstudio import DocumentStudio
ds = DocumentStudio() # use_markitdown=True by default
# anything → Markdown
md = ds.to_markdown("report.pdf")
md = ds.to_markdown("slides.pptx")
# Markdown → anything (non-md inputs are auto-converted first)
ds.convert("paper.md", to="pdf", out="paper.pdf")
ds.convert("paper.md", to="docx", out="paper.docx")
ds.convert("scan.pdf", to="docx", out="scan.docx") # PDF → md → docx
ds.convert("table.png", to="xlsx", out="table.xlsx") # image → md → xlsx
AI + Vision (VLM)
from docstudio import DocumentStudio
from docstudio.llm import LLM
llm = LLM(base_url="https://api.openai.com", api_key="sk-...",
model="gpt-4o-mini", vlm_model="gpt-4o")
print(LLM.fetch_models("https://api.openai.com", "sk-...")) # pick from the list
ds = DocumentStudio(llm=llm)
md = ds.to_markdown("photographed_table.jpg") # recognised by the vision model
md = ds.to_markdown("scanned_book.pdf") # page-by-page VLM when no text layer
md = llm.cleanup_markdown(rough_text) # turn messy OCR into clean Markdown
AI assistant (operate on a document)
One-shot AI operations on Markdown/text — the AI Assistant from the web app.
Needs an llm (any OpenAI-compatible endpoint).
from docstudio import DocumentStudio
from docstudio.llm import LLM
# any OpenAI-compatible endpoint — OpenAI, DeepSeek, vLLM, Ollama, a gateway…
# you choose base_url + model; nothing is hard-coded to a provider
ds = DocumentStudio(llm=LLM(base_url="https://api.openai.com",
api_key="sk-...", model="gpt-4o-mini"))
ds.assist(md, action="polish") # 润色
ds.assist(md, action="to_en") # 翻译成英文(to_zh 反之)
ds.assist(md, action="summary") # 摘要
ds.assist(md, action="outline") # 生成大纲
ds.assist(md, instruction="把所有表格改成要点列表") # 自由指令
DocumentStudio.assist_actions()
# polish, to_en, to_zh, summary, expand, condense, continue,
# grammar, formal, titles, outline, fix_latex
Toolbox
ds.generate_toc(md) # insert a Markdown table of contents
ds.merge_pdfs(["a.pdf", "b.pdf"], "all.pdf") # concatenate PDFs (needs pypdf)
ds.extract_images("report.pdf", "./imgs") # pull embedded images out (PDF/DOCX/PPTX/EPUB)
Templates
Six ready-to-edit Markdown templates: academic, techdoc, minutes,
readme, weekly, blog.
ds.templates() # {slug: (title, description)}
body = ds.template("academic")
CLI
docstudio report.pdf # → report.md (prints to stdout)
docstudio report.pdf -o out.md
cat report.pdf | docstudio # stdin → stdout
docstudio paper.md --to pdf -o paper.pdf # Markdown → anything
docstudio scan.pdf --to docx # PDF → md → docx
docstudio photo.jpg --vlm-model gpt-4o --base-url https://api.openai.com --api-key sk-...
docstudio --list-formats
docstudio paper.md --toc -o paper.md # insert a table of contents
docstudio notes.md --assist polish --base-url https://api.openai.com --model gpt-4o-mini --api-key sk-... -o clean.md
docstudio notes.md --instruction "翻译成英文" --base-url https://api.openai.com --model gpt-4o-mini --api-key sk-... -o en.md
docstudio --merge a.pdf b.pdf -o all.pdf # merge PDFs
docstudio report.pdf --extract-images ./imgs # pull out images
docstudio --template academic # print a template
docstudio --list-templates
Extending
Register your own converter — exactly how the built-ins are defined:
from docstudio.core import registry
@registry.ingester("rtf")
def rtf_to_md(source, ds=None, **opts):
...
return markdown_text
@registry.exporter("rst")
def md_to_rst(md, out=None, ds=None, **opts):
...
return out
Relationship to markitdown
markitdown is excellent at X → Markdown for LLM pipelines. DocumentStudio
reuses it for that direction when present (use_markitdown=True), and adds the
missing half: turning Markdown back into polished, human-facing PDF / Word /
LaTeX / EPUB / Excel, plus a vision-model path for images and scanned PDFs.
MIT licensed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docstudio-0.2.0.tar.gz.
File metadata
- Download URL: docstudio-0.2.0.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cebbdbb849894dc46c75df6db0bdb121991a19b3d67b4eed4a37993088228ab
|
|
| MD5 |
d984c7c1d6f9a38b1fcf6d18e520ab39
|
|
| BLAKE2b-256 |
80171341e692c34e2cfc99ff3885f783cc7a54935e6f9f0a4a58677c2ab90cea
|
File details
Details for the file docstudio-0.2.0-py3-none-any.whl.
File metadata
- Download URL: docstudio-0.2.0-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4204bb5db78e34d6f134dc6d170258a217ce92f96da60c325ef9103fe3b8646f
|
|
| MD5 |
89fa284eeceae15d2d0d48c004ca5023
|
|
| BLAKE2b-256 |
1f079ba92e2037e65c096f0a554f3921280fab64c6692c1441d73433ffe4a0f8
|