A simple tool to transform PDF and DOCX to Markdown using marker-pdf
Project description
NuoYi
A simple tool to transform PDF and DOCX to Markdown.
NuoYi uses marker-pdf for high-quality PDF conversion with OCR and layout detection. All processing is done fully offline after the initial model download.
Features
- PDF to Markdown: High-quality conversion using marker-pdf with surya OCR
- DOCX to Markdown: Native support for Microsoft Word documents
- Automatic GPU/CPU Selection: Detects available VRAM and falls back to CPU if needed
- Batch Processing: Convert entire directories of documents
- GUI Interface: PySide6-based graphical interface for easy batch conversion
- Image Extraction: Automatically extracts and saves images from PDFs
- Multi-language Support: Built-in support for Chinese and English (configurable)
Installation
From PyPI
pip install nuoyi
With GUI support
pip install nuoyi[gui]
From source
git clone https://github.com/cycleuser/NuoYi.git
cd NuoYi
pip install -e .
Usage
Command Line Interface
# Convert a single PDF file
nuoyi paper.pdf
# Specify output file
nuoyi paper.pdf -o output/result.md
# Convert a DOCX file
nuoyi document.docx -o document.md
# Batch convert all files in a directory
nuoyi ./papers --batch
# Batch convert with custom output directory
nuoyi ./papers --batch -o ./output
# Force CPU mode (for low VRAM GPUs)
nuoyi paper.pdf --device cpu
# Force OCR even for digital PDFs
nuoyi paper.pdf --force-ocr
# Specify page range
nuoyi paper.pdf --page-range "0-5,10,15-20"
# Specify languages
nuoyi paper.pdf --langs "zh,en,ja"
GUI Mode
nuoyi --gui
The GUI provides:
- Directory selection for input/output
- File list with status tracking
- Device selection (auto/CPU/CUDA)
- Force OCR option
- Page range and language configuration
- Real-time progress and logging
Python API
from nuoyi import MarkerPDFConverter, DocxConverter
# Convert PDF
pdf_converter = MarkerPDFConverter(
force_ocr=False,
langs="zh,en",
device="auto" # or "cpu", "cuda", "mps"
)
markdown_text, images = pdf_converter.convert_file("input.pdf")
# Convert DOCX
docx_converter = DocxConverter()
markdown_text = docx_converter.convert_file("input.docx")
Command Line Options
| Option | Description |
|---|---|
input |
Input PDF/DOCX file or directory (with --batch) |
-o, --output |
Output file path (single file) or directory (batch mode) |
--force-ocr |
Force OCR even for digital PDFs with embedded text |
--page-range |
Page range to convert, e.g. '0-5,10,15-20' |
--langs |
Comma-separated languages (default: zh,en) |
--batch |
Process all PDF/DOCX files in the input directory |
--device |
Device for model inference: auto (default), cpu, cuda, or mps |
--gui |
Launch PySide6 GUI mode |
-V, --version |
Show version and exit |
Memory Management
NuoYi automatically manages GPU memory:
- Auto mode (default): Detects available VRAM and uses GPU if sufficient (>6GB free)
- CPU mode: Forces CPU processing (slower but no VRAM limit)
- CUDA mode: Forces GPU processing (may OOM on large PDFs)
- MPS mode: For Apple Silicon Macs
If CUDA out of memory occurs during conversion, NuoYi automatically falls back to CPU.
Dependencies
Required
marker-pdf>=1.0.0- PDF conversion enginePyMuPDF>=1.23.0- PDF page countingpython-docx>=0.8.11- DOCX conversionPillow>=9.0.0- Image processing
Optional
PySide6>=6.5.0- GUI support (install withpip install nuoyi[gui])
Model Download
Download Location
Models are downloaded automatically on first run and stored in:
~/.cache/huggingface/hub/
The models are from Hugging Face and include:
vikp/surya_det- Layout detection modelvikp/surya_rec- Text recognition modelvikp/surya_order- Reading order model- Other marker-pdf related models
Total size: approximately 2-3 GB.
For Users in China
Hugging Face may be blocked or slow in mainland China due to GFW. You can use a mirror:
# Set Hugging Face mirror (add to ~/.bashrc or run before nuoyi)
export HF_ENDPOINT=https://hf-mirror.com
# Then run nuoyi normally
nuoyi paper.pdf
Alternatively, you can download models manually and place them in the cache directory.
Custom Model Path
The current version does not support custom model paths to keep the tool simple and avoid configuration complexity. Models are always stored in the default Hugging Face cache location.
Notes
- After initial model download, everything works fully offline
- Use
--device cpuif you encounter CUDA out of memory errors - Legacy
.docformat is not supported; convert to.docxfirst
License
GPL-3.0 License - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
- marker-pdf - The excellent PDF conversion engine
- surya - OCR and layout detection models
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nuoyi-0.2.0.tar.gz.
File metadata
- Download URL: nuoyi-0.2.0.tar.gz
- Upload date:
- Size: 27.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
749fe965468e61c26ac6220427bf1f9048551d12fd3f4ef5ab947a6a5168f651
|
|
| MD5 |
22afb4ee6e13ff05d1df53eb2576084e
|
|
| BLAKE2b-256 |
260dcf4da179cba81aba37a581fd6645c7e713c5a48d1b6327f714b03db8ffd9
|
File details
Details for the file nuoyi-0.2.0-py3-none-any.whl.
File metadata
- Download URL: nuoyi-0.2.0-py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52111e22bd8adca78a29ba0751b68517fd7c47baaa1168261e8732893f742e88
|
|
| MD5 |
7bf2908f12ef72f98756a4fb4ddd7623
|
|
| BLAKE2b-256 |
9bdbc7405f5f8c2cdd43531506a9ed9dfbfefe5fcfc4f86f8a4a4d09409b9750
|