yadt · PyPI

Yet Another Document Translator

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

funstoryai

These details have not been verified by PyPI

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

Yet Another Document Translator

PDF scientific paper translation and bilingual comparison library.

Provides a simple command line interface.
Provides a Python API.
Mainly designed to be embedded into other programs, but can also be used directly for simple translation tasks.

Preview

Getting Started

Install from PyPI

We recommend using the Tool feature of uv to install yadt.

First, you need to refer to uv installation to install uv and set up the PATH environment variable as prompted.
Use the following command to install yadt:

uv tool install --python 3.12 yadt

yadt --help

Use the yadt command. For example:

yadt --bing  --files example.pdf

# multiple files
yadt --bing  --files example1.pdf --files example2.pdf

Install from Source

We still recommend using uv to manage virtual environments.

First, you need to refer to uv installation to install uv and set up the PATH environment variable as prompted.
Use the following command to install yadt:

# clone the project
git clone https://github.com/funstory-ai/yadt

# enter the project directory
cd yadt

# install dependencies and run yadt
uv run yadt --help

Use the uv run yadt command. For example:

uv run yadt --bing --files example.pdf

# multiple files
uv run yadt --bing --files example.pdf --files example2.pdf

[!TIP] The absolute path is recommended.

Advanced Options

Language Options

--lang-in, -li: Source language code (default: en)
--lang-out, -lo: Target language code (default: zh)

[!TIP] Currently, this project mainly focuses on English-to-Chinese translation, and other scenarios have not been tested yet.

PDF Processing Options

--files: One or more file paths to input PDF documents.
--pages, -p: Specify pages to translate (e.g., "1,2,1-,-3,3-5"). If not set, translate all pages
--split-short-lines: Force split short lines into different paragraphs (may cause poor typesetting & bugs)
--short-line-split-factor: Split threshold factor (default: 0.8). The actual threshold is the median length of all lines on the current page * this factor

Translation Service Options

--qps: QPS (Queries Per Second) limit for translation service (default: 4)
--ignore-cache: Ignore translation cache and force retranslation
--no-dual: Do not output bilingual PDF files
--no-mono: Do not output monolingual PDF files
--openai: Use OpenAI for translation (default: False)
--bing: Use Bing for translation (default: False)
--google: Use Google Translate for translation (default: False)

[!TIP]

You must specify one translation service among --openai, --bing, --google.

It is recommended to use models with strong compatibility with OpenAI, such as: glm-4-flash, deepseek-chat, etc.

Currently, it has not been optimized for traditional translation engines like Bing/Google, it is recommended to use LLMs.

OpenAI Specific Options

--openai-model: OpenAI model to use (default: gpt-4o-mini)
--openai-base-url: Base URL for OpenAI API
--openai-api-key: API key for OpenAI service

Output Control

--output, -o: Output directory for translated files. If not set, use current working directory.
--debug, -d: Enable debug logging level and export detailed intermediate results in ~/.cache/yadt/working.

Configuration File

--config, -c: Configuration file path. Use the TOML format.

Example Configuration:

[yadt]
debug = true
lang-in = "en-US"
lang-out = "zh-CN"
qps = 20
# this is a comment
# pages = 4
openai = true
openai-model = "SOME_ALSOME_MODEL"
openai-base-url = "https://example.example/v1"
openai-api-key = "[KEY]"
# All other options can also be set in the configuration file.

Python API

You can refer to the example in main.py to use YADT's Python API.

Please note:

Make sure all font files described in main.download_font_assets exist
The current TranslationConfig does not fully validate input parameters, so you need to ensure the validity of input parameters

Background

There are a lot projects and teams working on to make document editing and translating easier like:

There are also some solutions to solve specific parts of the problem like:

layoutreader: the read order of the text block in a pdf
Surya: the structure of the pdf

This project hopes to promote a standard pipeline and interface to solve the problem.

In fact, there are two main stages of a PDF parser or translator:

Parsing: A stage of parsing means to get the structure of the pdf such as text blocks, images, tables, etc.
Rendering: A stage of rendering means to render the structure into a new pdf or other format.

For a service like mathpix, it will parse the pdf into a structure may be in a XML format, and then render them using a single column reader order as layoutreader does. The bad news is that the original structure lost.

Some people will use Adobe PDF Parser because it will generate a Word document and it keeps the original structure. But it is somewhat expensive. And you know, a pdf or word document is not a good format for reading in mobile devices.

We offer an intermediate representation of the results from parser and can be rendered into a new pdf or other format. The pipeline is also a plugin-based system which everybody can add their new model, ocr, renderer, etc.

Roadmap

Add line support
Add table support
Add cross-page/cross-column paragraph support
More advanced typesetting features
Outline support
...

Our first 1.0 version goal is to finish a translation from PDF Reference, Version 1.7 to the following language version:

Simplified Chinese
Traditional Chinese
Japanese
Spanish

And meet the following requirements:

layout error less than 1%
content loss less than 1%

Known Issues

Parsing errors in the author and reference sections; they get merged into one paragraph after translation.
Lines are not supported.
Does not support drop caps.

How to Contribute

We encourage you to contribute to YADT! Please check out the CONTRIBUTING guide.

Everyone interacting in YADT and its sub-projects' codebases, issue trackers, chat rooms, and mailing lists is expected to follow the YADT Code of Conduct.

Acknowledgements

Star History

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

funstoryai

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.5

Feb 6, 2025

0.1.4

Feb 6, 2025

0.1.3

Feb 4, 2025

0.1.2

Feb 3, 2025

0.1.0

Jan 24, 2025

0.1.0rc2 pre-release

Jan 24, 2025

0.1.0rc1 pre-release

Jan 24, 2025

0.0.1a28 pre-release

Jan 23, 2025

0.0.1a27 pre-release

Jan 23, 2025

0.0.1a26 pre-release

Jan 21, 2025

0.0.1a25 pre-release

Jan 21, 2025

0.0.1a24 pre-release

Jan 21, 2025

0.0.1a23 pre-release

Jan 21, 2025

0.0.1a22 pre-release

Jan 20, 2025

0.0.1a21 pre-release

Jan 20, 2025

0.0.1a20 pre-release

Jan 20, 2025

0.0.1a19 pre-release

Jan 17, 2025

0.0.1a18 pre-release

Jan 17, 2025

0.0.1a17 pre-release

Jan 17, 2025

0.0.1a16 pre-release

Jan 17, 2025

0.0.1a15 pre-release

Jan 16, 2025

0.0.1a14 pre-release

Jan 16, 2025

0.0.1a13 pre-release

Jan 15, 2025

0.0.1a12 pre-release

Jan 15, 2025

0.0.1a11 pre-release

Jan 15, 2025

0.0.1a10 pre-release

Jan 15, 2025

0.0.1a9 pre-release

Jan 15, 2025

0.0.1a7 pre-release

Jan 15, 2025

0.0.1a6 pre-release

Jan 15, 2025

0.0.1a5 pre-release

Jan 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yadt-0.1.5.tar.gz (4.3 MB view details)

Uploaded Feb 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yadt-0.1.5-py3-none-any.whl (91.8 kB view details)

Uploaded Feb 6, 2025 Python 3

File details

Details for the file yadt-0.1.5.tar.gz.

File metadata

Download URL: yadt-0.1.5.tar.gz
Upload date: Feb 6, 2025
Size: 4.3 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for yadt-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`e7087d6166cb6a4931662ed2037b75bbc883d33cde1af5b0d76a4d8ce0bf615e`
MD5	`ee3ff50be2e8a52291078918deea1343`
BLAKE2b-256	`3264c309288fee25e03af721feed056d170be5fdb189a3b5a9c72cc5a9e8d7fe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yadt-0.1.5.tar.gz:

Publisher: publish-to-pypi.yml on funstory-ai/yadt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yadt-0.1.5.tar.gz
- Subject digest: e7087d6166cb6a4931662ed2037b75bbc883d33cde1af5b0d76a4d8ce0bf615e
- Sigstore transparency entry: 169340405
- Sigstore integration time: Feb 6, 2025
Source repository:
- Permalink: funstory-ai/yadt@54dcb6426d6e2e9145ca9f7ade3bebd8ca33be48
- Branch / Tag: refs/heads/main
- Owner: https://github.com/funstory-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@54dcb6426d6e2e9145ca9f7ade3bebd8ca33be48
- Trigger Event: push

File details

Details for the file yadt-0.1.5-py3-none-any.whl.

File metadata

Download URL: yadt-0.1.5-py3-none-any.whl
Upload date: Feb 6, 2025
Size: 91.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for yadt-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2787cda05bcff5ab9c5e0f6d152298e2b8b37f41f4667abca32f159af8088b21`
MD5	`bf5e03d3c12d91c86efa1bc391781a74`
BLAKE2b-256	`72d08b30940005409ff0adca543a4e96de12305a69458c3fc6640a24229f4090`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yadt-0.1.5-py3-none-any.whl:

Publisher: publish-to-pypi.yml on funstory-ai/yadt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yadt-0.1.5-py3-none-any.whl
- Subject digest: 2787cda05bcff5ab9c5e0f6d152298e2b8b37f41f4667abca32f159af8088b21
- Sigstore transparency entry: 169340407
- Sigstore integration time: Feb 6, 2025
Source repository:
- Permalink: funstory-ai/yadt@54dcb6426d6e2e9145ca9f7ade3bebd8ca33be48
- Branch / Tag: refs/heads/main
- Owner: https://github.com/funstory-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@54dcb6426d6e2e9145ca9f7ade3bebd8ca33be48
- Trigger Event: push

yadt 0.1.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Yet Another Document Translator

Preview

Getting Started

Install from PyPI

Install from Source

Advanced Options

Language Options

PDF Processing Options

Translation Service Options

OpenAI Specific Options

Output Control

Configuration File

Python API

Background

Roadmap

Known Issues

How to Contribute

Acknowledgements

Star History

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance