Skip to main content

Academic paper PDF renaming tool - 学术论文PDF重命名工具

Project description

Chou (瞅) - Academic Paper PDF Renamer

A Python tool to automatically rename academic PDF papers to citation-style filenames by extracting title, author, and year information from the PDF content.

Features

  • Extracts title and authors from PDF first page using font size analysis
  • OCR support for scanned PDFs (5 OCR backends available)
  • Extracts publication year using 10 different strategies (supports English and Chinese)
  • Chinese name handling - automatically uses full names for Chinese authors
  • Chinese thesis/dissertation support - detects labeled fields like "论文题目", "作者姓名"
  • Multiple author format options
  • Dry-run mode for safe preview
  • Handles special characters and Unicode in author names
  • Logs all operations and exports results to CSV

Requirements

  • Python >= 3.10
  • PyMuPDF (required)
  • OCR backend (optional, for scanned PDFs)

Installation

From PyPI

pip install chou

From Source

git clone https://github.com/cycleuser/Chou.git
cd Chou
pip install -e .

With OCR Support

Choose one or more OCR backends based on your needs:

# Install with all OCR backends
pip install -e ".[ocr-surya,ocr-paddle,ocr-rapid,ocr-easy,ocr-tesseract]"

# Or install specific backends:
pip install surya-ocr          # Surya - Best accuracy, transformer-based (recommended)
pip install paddleocr paddlepaddle  # PaddleOCR - Good for Chinese
pip install rapidocr-onnxruntime    # RapidOCR - Lightweight, fast
pip install easyocr                 # EasyOCR - Easy to use
pip install pytesseract Pillow      # Tesseract - Classic OCR

Quick Start

After installation, the chou command is available:

# Preview changes (dry-run mode, default)
chou --dir /path/to/papers --dry-run

# Actually rename files
chou --dir /path/to/papers --execute

# Show version
chou --version

Usage

chou [options]

Options

Option Short Description
--dir DIR -d Directory containing PDF files (default: current)
--dry-run -n Preview without renaming (default: True)
--execute -x Actually rename files
--format FMT -f Author name format (see below)
--num-authors N -N Number of authors for n_* formats (default: 3)
--recursive -r Process subdirectories recursively (default: True)
--no-recursive Only process the specified directory
--ocr-engine Specify OCR engine (default: auto-detect)
--no-ocr Disable OCR fallback
--output FILE -o Export results to CSV file
--log-file FILE -l Log file path
--verbose -v Verbose output

Author Format Options (-f)

Format Example Output
first_surname Wang et al. (2023) - Title.pdf
first_full Weihao Wang et al. (2023) - Title.pdf
all_surnames Wang, Zhang, You (2023) - Title.pdf
all_full Weihao Wang, Rufeng Zhang, Mingyu You (2023) - Title.pdf
n_surnames Wang, Zhang et al. (2023) - Title.pdf
n_full Weihao Wang, Rufeng Zhang et al. (2023) - Title.pdf

Note: For Chinese authors, full names are always used (e.g., 张三 instead of just ) since single-character surnames are not meaningful.

Examples

# Use first author's full name
chou -d /path/to/papers -f first_full --dry-run

# Use first 2 authors' surnames
chou -d /path/to/papers -f n_surnames -N 2 --dry-run

# Process and export results
chou -d /path/to/papers --execute -o results.csv

# Use specific OCR engine
chou -d /path/to/papers --ocr-engine rapidocr --dry-run

# Disable OCR
chou -d /path/to/papers --no-ocr --dry-run

OCR Support

For scanned PDFs without embedded text, the tool automatically uses OCR. Available backends (in priority order):

Backend Install Command Notes
Surya pip install surya-ocr Best accuracy, transformer-based
PaddleOCR pip install paddleocr paddlepaddle Good for Chinese
RapidOCR pip install rapidocr-onnxruntime Lightweight, fast
EasyOCR pip install easyocr Easy to use
Tesseract pip install pytesseract Pillow Classic OCR

The tool automatically selects the best available backend. To disable a specific backend:

# Disable Surya OCR (e.g., on low-memory systems)
export CHOU_DISABLE_SURYA=1
chou --dry-run

Year Extraction Strategies

The tool uses 10 strategies to extract publication year, ranked by confidence:

  1. Conference + year (100): CVPR 2023, NeurIPS'22, AAAI-23
  2. Ordinal edition (90): Thirty-Seventh AAAI Conference
  3. Copyright notice (85): Copyright 2023, (c) 2023
  4. Publication date (80): Published: 2023, Accepted: Jan 2023
  5. Chinese year (78): 2023年, 二〇二三年
  6. arXiv ID (75): arXiv:2301.12345
  7. DOI with year (75): 10.1109/CVPR.2023.xxx
  8. Journal volume (70): Vol. 35, 2023
  9. Date pattern (60-65): March 2023, 2023/03
  10. Frequent year (20-50): Most common year in text

Supported Conferences

AAAI, IJCAI, NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, ACL, EMNLP, NAACL, SIGIR, KDD, WWW, CHI, USENIX, and 50+ more.

Project Structure

Chou/
├── chou/                  # Main package
│   ├── core/             # Core functionality
│   │   ├── processor.py       # PDF processing
│   │   ├── ocr_extractor.py   # OCR backends
│   │   ├── author_parser.py   # Author name parsing
│   │   ├── year_parser.py     # Year extraction
│   │   └── filename_gen.py    # Filename generation
│   ├── cli/              # Command-line interface
│   └── gui/              # GUI (optional)
├── tests/                # pytest tests
├── requirements.txt      # Dependencies
├── pyproject.toml        # Package configuration
├── README.md             # This file
└── README_CN.md          # Chinese documentation

GUI (Optional)

A graphical user interface is available:

pip install chou[gui]
chou-gui

Development

# Install development dependencies
pip install -e ".[test]"

# Run tests
pytest

# Run with verbose output
pytest -v

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chou-0.1.0.tar.gz (42.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chou-0.1.0-py3-none-any.whl (39.1 kB view details)

Uploaded Python 3

File details

Details for the file chou-0.1.0.tar.gz.

File metadata

  • Download URL: chou-0.1.0.tar.gz
  • Upload date:
  • Size: 42.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for chou-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b18f23350c15b9b8f21a8d4b87459a750bff3fe46fa372e48eb4ae67d3d503ab
MD5 d40f0b59f69f047c95259e24e0dc6ca4
BLAKE2b-256 0ddb47fb3136316dc13481636dcc12f9c0551c9a09654ea97b151cfa242a8721

See more details on using hashes here.

File details

Details for the file chou-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chou-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for chou-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d97c675de7e114a1dde67f4aa5d89dbaddab1c9f0c668fd938ab4902a799a0b5
MD5 139fd220e9eef1b6c7ae196dd2f8c6c6
BLAKE2b-256 ee2b58f5b7437f16fb7df0933e6cd6085c175aa5ed5350f544448f0e3364bc77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page