Academic paper PDF renaming tool - 学术论文PDF重命名工具
Project description
Chou (瞅) - Academic Paper PDF Renamer
A Python tool to automatically rename academic PDF papers to citation-style filenames by extracting title, author, and year information from the PDF content.
Features
- Extracts title and authors from PDF first page using font size analysis
- OCR support for scanned PDFs (5 OCR backends available)
- Extracts publication year using 10 different strategies (supports English and Chinese)
- Chinese name handling - automatically uses full names for Chinese authors
- Chinese thesis/dissertation support - detects labeled fields like "论文题目", "作者姓名"
- Multiple author format options
- Dry-run mode for safe preview
- Handles special characters and Unicode in author names
- Logs all operations and exports results to CSV
Requirements
- Python >= 3.10
- PyMuPDF (required)
- OCR backend (optional, for scanned PDFs)
Installation
From PyPI
pip install chou
From Source
git clone https://github.com/cycleuser/Chou.git
cd Chou
pip install -e .
With OCR Support
Choose one or more OCR backends based on your needs:
# Install with all OCR backends
pip install -e ".[ocr-surya,ocr-paddle,ocr-rapid,ocr-easy,ocr-tesseract]"
# Or install specific backends:
pip install surya-ocr # Surya - Best accuracy, transformer-based (recommended)
pip install paddleocr paddlepaddle # PaddleOCR - Good for Chinese
pip install rapidocr-onnxruntime # RapidOCR - Lightweight, fast
pip install easyocr # EasyOCR - Easy to use
pip install pytesseract Pillow # Tesseract - Classic OCR
Quick Start
After installation, the chou command is available:
# Preview changes (dry-run mode, default)
chou --dir /path/to/papers --dry-run
# Actually rename files
chou --dir /path/to/papers --execute
# Show version
chou --version
Usage
chou [options]
Options
| Option | Short | Description |
|---|---|---|
--dir DIR |
-d |
Directory containing PDF files (default: current) |
--dry-run |
-n |
Preview without renaming (default: True) |
--execute |
-x |
Actually rename files |
--format FMT |
-f |
Author name format (see below) |
--num-authors N |
-N |
Number of authors for n_* formats (default: 3) |
--recursive |
-r |
Process subdirectories recursively (default: True) |
--no-recursive |
Only process the specified directory | |
--ocr-engine |
Specify OCR engine (default: auto-detect) | |
--no-ocr |
Disable OCR fallback | |
--output FILE |
-o |
Export results to CSV file |
--log-file FILE |
-l |
Log file path |
--verbose |
-v |
Verbose output |
Author Format Options (-f)
| Format | Example Output |
|---|---|
first_surname |
Wang et al. (2023) - Title.pdf |
first_full |
Weihao Wang et al. (2023) - Title.pdf |
all_surnames |
Wang, Zhang, You (2023) - Title.pdf |
all_full |
Weihao Wang, Rufeng Zhang, Mingyu You (2023) - Title.pdf |
n_surnames |
Wang, Zhang et al. (2023) - Title.pdf |
n_full |
Weihao Wang, Rufeng Zhang et al. (2023) - Title.pdf |
Note: For Chinese authors, full names are always used (e.g., 张三 instead of just 张) since single-character surnames are not meaningful.
Examples
# Use first author's full name
chou -d /path/to/papers -f first_full --dry-run
# Use first 2 authors' surnames
chou -d /path/to/papers -f n_surnames -N 2 --dry-run
# Process and export results
chou -d /path/to/papers --execute -o results.csv
# Use specific OCR engine
chou -d /path/to/papers --ocr-engine rapidocr --dry-run
# Disable OCR
chou -d /path/to/papers --no-ocr --dry-run
OCR Support
For scanned PDFs without embedded text, the tool automatically uses OCR. Available backends (in priority order):
| Backend | Install Command | Notes |
|---|---|---|
| Surya | pip install surya-ocr |
Best accuracy, transformer-based |
| PaddleOCR | pip install paddleocr paddlepaddle |
Good for Chinese |
| RapidOCR | pip install rapidocr-onnxruntime |
Lightweight, fast |
| EasyOCR | pip install easyocr |
Easy to use |
| Tesseract | pip install pytesseract Pillow |
Classic OCR |
The tool automatically selects the best available backend. To disable a specific backend:
# Disable Surya OCR (e.g., on low-memory systems)
export CHOU_DISABLE_SURYA=1
chou --dry-run
Year Extraction Strategies
The tool uses 10 strategies to extract publication year, ranked by confidence:
- Conference + year (100):
CVPR 2023,NeurIPS'22,AAAI-23 - Ordinal edition (90):
Thirty-Seventh AAAI Conference - Copyright notice (85):
Copyright 2023,(c) 2023 - Publication date (80):
Published: 2023,Accepted: Jan 2023 - Chinese year (78):
2023年,二〇二三年 - arXiv ID (75):
arXiv:2301.12345 - DOI with year (75):
10.1109/CVPR.2023.xxx - Journal volume (70):
Vol. 35, 2023 - Date pattern (60-65):
March 2023,2023/03 - Frequent year (20-50): Most common year in text
Supported Conferences
AAAI, IJCAI, NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, ACL, EMNLP, NAACL, SIGIR, KDD, WWW, CHI, USENIX, and 50+ more.
Project Structure
Chou/
├── chou/ # Main package
│ ├── core/ # Core functionality
│ │ ├── processor.py # PDF processing
│ │ ├── ocr_extractor.py # OCR backends
│ │ ├── author_parser.py # Author name parsing
│ │ ├── year_parser.py # Year extraction
│ │ └── filename_gen.py # Filename generation
│ ├── cli/ # Command-line interface
│ └── gui/ # GUI (optional)
├── tests/ # pytest tests
├── requirements.txt # Dependencies
├── pyproject.toml # Package configuration
├── README.md # This file
└── README_CN.md # Chinese documentation
GUI (Optional)
A graphical user interface is available:
pip install chou[gui]
chou-gui
Development
# Install development dependencies
pip install -e ".[test]"
# Run tests
pytest
# Run with verbose output
pytest -v
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chou-0.1.0.tar.gz.
File metadata
- Download URL: chou-0.1.0.tar.gz
- Upload date:
- Size: 42.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b18f23350c15b9b8f21a8d4b87459a750bff3fe46fa372e48eb4ae67d3d503ab
|
|
| MD5 |
d40f0b59f69f047c95259e24e0dc6ca4
|
|
| BLAKE2b-256 |
0ddb47fb3136316dc13481636dcc12f9c0551c9a09654ea97b151cfa242a8721
|
File details
Details for the file chou-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chou-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d97c675de7e114a1dde67f4aa5d89dbaddab1c9f0c668fd938ab4902a799a0b5
|
|
| MD5 |
139fd220e9eef1b6c7ae196dd2f8c6c6
|
|
| BLAKE2b-256 |
ee2b58f5b7437f16fb7df0933e6cd6085c175aa5ed5350f544448f0e3364bc77
|