Skip to main content

Incremental Markdown translation with LLMs — track blocks in PO files, only retranslate what changed

Project description

mdpo-llm

Python Version PyPI Version Tests License

Translate Markdown with LLMs — and only pay for what changed.

mdpo-llm splits your Markdown into blocks, tracks each one in a PO file, and sends only new or changed blocks to your LLM. Edit one paragraph in a 50-block document? One API call, not fifty.

How It Works

flowchart LR
    A["Markdown\nSource"] --> B["Parse\ninto blocks"]
    B --> C["Track\nin PO file"]
    C --> D{"Changed?"}
    D -- Yes --> E["Send to\nLLM"]
    D -- No --> F["Reuse existing\ntranslation"]
    E --> G["Reconstruct\nMarkdown"]
    F --> G

Each block (heading, paragraph, code block, list, table) is tracked independently. On subsequent runs, only blocks whose source text changed get sent to the LLM — the rest are served from the PO cache.

Incremental processing in practice

First run:    8 blocks parsed → 8 API calls → full document translated
Edit source:  change 1 paragraph
Second run:   8 blocks parsed → 1 API call  → only the changed block retranslated

Translation Context

Blocks aren't translated in isolation. As each block is translated, it's added to a reference pool. Subsequent blocks receive the most similar previous translations as few-shot examples, so the LLM maintains consistent tone, terminology, and style across the entire document.

Block 1: "Introduction"     → translated (no context yet)
Block 2: "Getting Started"  → translated with Block 1 as reference
Block 3: "Installation"     → translated with Blocks 1–2 as reference
...

On re-runs, the pool is seeded from all existing translations in the PO file, so even a single changed paragraph benefits from the full document's context.

Installation

pip install mdpo-llm

Quick Start

1. Translate a document

No subclassing, no boilerplate. Pass a model string and go.

from pathlib import Path
from mdpo_llm import MdpoLLM

processor = MdpoLLM(
    model="gpt-4",            # any LiteLLM model string
    target_lang="ko",         # baked into the system prompt
    temperature=0.3,          # forwarded to litellm.completion()
)

result = processor.process_document(
    source_path=Path("docs/README.md"),
    target_path=Path("docs/README_ko.md"),
    # po_path defaults to docs/README_ko.po
)

print(f"Processed {result['translation_stats']['processed']} blocks")
print(f"Coverage: {result['coverage']['coverage_percentage']}%")

Run it again after editing the source — only the changed paragraphs get reprocessed.

2. Process a directory

result = processor.process_directory(
    source_dir=Path("docs/"),
    target_dir=Path("docs_ko/"),
    glob="**/*.md",
    max_workers=4,  # files processed concurrently
    # po_dir defaults to target_dir (PO files next to translated files)
)

print(f"{result['files_processed']} files processed")
print(f"{result['files_skipped']} files unchanged")

The directory structure is mirrored into target_dir. Each file gets its own PO file and its own reference pool. By default, PO files are placed next to the target files; pass po_dir to store them separately.

3. Use any provider

LiteLLM supports 100+ providers. Just change the model string:

# OpenAI
MdpoLLM(model="gpt-4", target_lang="ko")

# Anthropic
MdpoLLM(model="anthropic/claude-sonnet-4-5-20250929", target_lang="ko")

# Google
MdpoLLM(model="gemini/gemini-pro", target_lang="ko")

# Azure OpenAI
MdpoLLM(model="azure/my-deployment", target_lang="ko", api_base="https://...")

Language Handling

target_lang — tell the LLM which language to produce

A BCP 47 locale string (e.g. "ko", "ja", "zh-CN") baked into the system prompt. The source language is auto-detected by the LLM — you only specify the target.

processor = MdpoLLM(model="gpt-4", target_lang="ja")

When target_lang is set, new PO files will include a Language header (e.g. Language: ja).

Glossary

Protect brand names, trademarks, and proper nouns from translation — or force specific translations for them.

Inline glossary

processor = MdpoLLM(
    model="gpt-4",
    target_lang="ko",
    glossary={
        "GitHub": None,                # None = do not translate
        "Markdown": None,
        "pull request": "풀 리퀘스트",  # force specific translation
        "API": "API",
    },
)

JSON glossary file

For multi-locale projects, keep a single glossary.json:

{
  "GitHub": null,
  "Markdown": null,
  "pull request": {
    "ko": "풀 리퀘스트",
    "ja": "プルリクエスト"
  },
  "API": "API"
}
  • null — do not translate (any locale)
  • "string" — use this translation for all locales
  • {"ko": "...", "ja": "..."} — per-locale; if the current locale isn't listed, the term is kept as-is
processor = MdpoLLM(
    model="gpt-4",
    target_lang="ko",
    glossary_path="glossary.json",
)

If both glossary and glossary_path are provided, inline entries override the file.

Only glossary terms that actually appear in each block are injected into the prompt, so a large glossary doesn't waste tokens on irrelevant blocks.

See examples/glossary.json for a full example with brand names, technical terms, and per-locale translations.

Comparison

mdpo-llm mdpo md-translator llm-translator
LLM-powered Yes No Yes Yes
Incremental (block-level) Yes Yes No No
PO file tracking Yes Yes No No
Translation context (few-shot) Yes No Partial No
LLM-agnostic Yes Multi-provider OpenAI only
Batch directory processing Yes Yes No No

mdpo pioneered PO-based Markdown translation but targets manual/MT workflows, not LLMs. md-translator and llm-translator use LLMs but reprocess entire files on every run. mdpo-llm combines both: PO-tracked incremental processing with LLM-powered translation and cross-block context.

API Reference

MdpoLLM

Constructor:

MdpoLLM(
    model,                     # any LiteLLM model string (required)
    target_lang,               # BCP 47 string, baked into system prompt (required)
    max_reference_pairs=5,     # max similar pairs passed as few-shot context
    extra_instructions=None,   # appended to the built-in translation prompt
    post_process=None,         # Callable[[str], str] applied to every LLM response
    glossary=None,             # dict[str, str | None] — inline glossary
    glossary_path=None,        # path to JSON glossary file (multi-locale)
    **litellm_kwargs,          # temperature, api_key, api_base, etc.
)
Method Description
process_document(source_path, target_path, po_path=None, inplace=False) Process a single Markdown file. po_path defaults to target_path with .po extension.
process_directory(source_dir, target_dir, po_dir=None, glob, inplace, max_workers) Process a directory tree concurrently. po_dir defaults to target_dir.
get_translation_stats(source_path, po_path) Return coverage and block statistics
export_report(source_path, po_path) Generate a detailed text report

Prompts

The Prompts class exposes all built-in prompt templates used by the processor:

from mdpo_llm import Prompts

# See the default translation instruction
print(Prompts.TRANSLATE_INSTRUCTION)

Working with PO Files

PO files (GNU gettext) track the state of each content block:

  • Untranslated — new content, will be sent to the LLM
  • Translated — completed, reused on subsequent runs
  • Fuzzy — source changed since last run, will be retranslated
  • Obsolete — source block was removed, cleaned up automatically

You can inspect and edit PO files with any standard gettext tool (Poedit, Lokalize, etc.).

When target_lang is set, new PO files include a Language metadata header so tools can identify the target language.

Development

# Install with dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdpo_llm-0.2.1.tar.gz (200.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mdpo_llm-0.2.1-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file mdpo_llm-0.2.1.tar.gz.

File metadata

  • Download URL: mdpo_llm-0.2.1.tar.gz
  • Upload date:
  • Size: 200.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for mdpo_llm-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b4ccb431e0723fcad44dd5e1cdec2401592cfed69f8b695415bae403cb891371
MD5 bf89fe05e83cea4f7cb500f2f9c7f5a8
BLAKE2b-256 e5610aad05d8041d1cce5deed25ec4bff98e914d52dd7163cbf2452028f8d9e3

See more details on using hashes here.

File details

Details for the file mdpo_llm-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: mdpo_llm-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for mdpo_llm-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2426a15160ba399c34cd1c75840f14e7a0bfb17b480dc00d182055e867a157fb
MD5 a4feeb8c07a2b1af2e324b407ee21474
BLAKE2b-256 cf4330465858865c9edd77e78663108051a7fa5c709fc6e1143c0161a3fc36e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page