Skip to main content

Academic paper PDF renaming tool - 学术论文PDF重命名工具

Project description

Chou (瞅) - Academic Paper PDF Renamer

A Python tool to automatically rename academic PDF papers to citation-style filenames by extracting title, author, and year information from the PDF content.

Features

  • Extracts title and authors from PDF first page using font size analysis
  • OCR support for scanned PDFs (5 OCR backends available)
  • Extracts publication year using 10 different strategies (supports English and Chinese)
  • Chinese name handling - automatically uses full names for Chinese authors
  • Chinese thesis/dissertation support - detects labeled fields like "论文题目", "作者姓名"
  • Multiple author format options
  • Dry-run mode for safe preview
  • Handles special characters and Unicode in author names
  • Logs all operations and exports results to CSV

Requirements

  • Python >= 3.10
  • PyMuPDF (required)
  • OCR backend (optional, for scanned PDFs)

Installation

From PyPI

pip install chou

From Source

git clone https://github.com/cycleuser/Chou.git
cd Chou
pip install -e .

With OCR Support

Choose one or more OCR backends based on your needs:

# Install with all OCR backends
pip install -e ".[ocr-surya,ocr-paddle,ocr-rapid,ocr-easy,ocr-tesseract]"

# Or install specific backends:
pip install surya-ocr          # Surya - Best accuracy, transformer-based (recommended)
pip install paddleocr paddlepaddle  # PaddleOCR - Good for Chinese
pip install rapidocr-onnxruntime    # RapidOCR - Lightweight, fast
pip install easyocr                 # EasyOCR - Easy to use
pip install pytesseract Pillow      # Tesseract - Classic OCR

Quick Start

After installation, the chou command is available:

# Preview changes (dry-run mode, default)
chou --dir /path/to/papers --dry-run

# Actually rename files
chou --dir /path/to/papers --execute

# Show version
chou --version

Usage

chou [options]

Options

Option Short Description
--dir DIR -d Directory containing PDF files (default: current)
--dry-run -n Preview without renaming (default: True)
--execute -x Actually rename files
--format FMT -f Author name format (see below)
--num-authors N -N Number of authors for n_* formats (default: 3)
--recursive -r Process subdirectories recursively (default: True)
--no-recursive Only process the specified directory
--ocr-engine Specify OCR engine (default: auto-detect)
--no-ocr Disable OCR fallback
--output FILE -o Export results to CSV file
--log-file FILE -l Log file path
--verbose -v Verbose output

Author Format Options (-f)

Format Example Output
first_surname Wang et al. (2023) - Title.pdf
first_full Weihao Wang et al. (2023) - Title.pdf
all_surnames Wang, Zhang, You (2023) - Title.pdf
all_full Weihao Wang, Rufeng Zhang, Mingyu You (2023) - Title.pdf
n_surnames Wang, Zhang et al. (2023) - Title.pdf
n_full Weihao Wang, Rufeng Zhang et al. (2023) - Title.pdf

Note: For Chinese authors, full names are always used (e.g., 张三 instead of just ) since single-character surnames are not meaningful.

Examples

# Use first author's full name
chou -d /path/to/papers -f first_full --dry-run

# Use first 2 authors' surnames
chou -d /path/to/papers -f n_surnames -N 2 --dry-run

# Process and export results
chou -d /path/to/papers --execute -o results.csv

# Use specific OCR engine
chou -d /path/to/papers --ocr-engine rapidocr --dry-run

# Disable OCR
chou -d /path/to/papers --no-ocr --dry-run

OCR Support

For scanned PDFs without embedded text, the tool automatically uses OCR. Available backends (in priority order):

Backend Install Command Notes
Surya pip install surya-ocr Best accuracy, transformer-based
PaddleOCR pip install paddleocr paddlepaddle Good for Chinese
RapidOCR pip install rapidocr-onnxruntime Lightweight, fast
EasyOCR pip install easyocr Easy to use
Tesseract pip install pytesseract Pillow Classic OCR

The tool automatically selects the best available backend. To disable a specific backend:

# Disable Surya OCR (e.g., on low-memory systems)
export CHOU_DISABLE_SURYA=1
chou --dry-run

Year Extraction Strategies

The tool uses 10 strategies to extract publication year, ranked by confidence:

  1. Conference + year (100): CVPR 2023, NeurIPS'22, AAAI-23
  2. Ordinal edition (90): Thirty-Seventh AAAI Conference
  3. Copyright notice (85): Copyright 2023, (c) 2023
  4. Publication date (80): Published: 2023, Accepted: Jan 2023
  5. Chinese year (78): 2023年, 二〇二三年
  6. arXiv ID (75): arXiv:2301.12345
  7. DOI with year (75): 10.1109/CVPR.2023.xxx
  8. Journal volume (70): Vol. 35, 2023
  9. Date pattern (60-65): March 2023, 2023/03
  10. Frequent year (20-50): Most common year in text

Supported Conferences

AAAI, IJCAI, NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, ACL, EMNLP, NAACL, SIGIR, KDD, WWW, CHI, USENIX, and 50+ more.

Project Structure

Chou/
├── chou/                  # Main package
│   ├── core/             # Core functionality
│   │   ├── processor.py       # PDF processing
│   │   ├── ocr_extractor.py   # OCR backends
│   │   ├── author_parser.py   # Author name parsing
│   │   ├── year_parser.py     # Year extraction
│   │   └── filename_gen.py    # Filename generation
│   ├── cli/              # Command-line interface
│   └── gui/              # GUI (optional)
├── tests/                # pytest tests
├── requirements.txt      # Dependencies
├── pyproject.toml        # Package configuration
├── README.md             # This file
└── README_CN.md          # Chinese documentation

GUI (Optional)

A graphical user interface is available:

pip install chou[gui]
chou-gui

Screenshots

1. Initial Window - Drag & drop PDFs or use toolbar to add files:

Initial Window

2. After Processing - Extracted title, authors, year with preview of new filenames:

Processed Results

3. Renamed Files - Files renamed to citation-style format in file manager:

Renamed Files

Development

# Install development dependencies
pip install -e ".[test]"

# Run tests
pytest

# Run with verbose output
pytest -v

Python API

from chou import rename_papers

result = rename_papers(
    "./papers",
    author_format="first_surname",
    dry_run=True,
)
print(result.success)    # True / False
print(result.data)       # list of paper dicts
print(result.metadata)   # summary stats

Agent Integration (OpenAI Function Calling)

Chou exposes an OpenAI-compatible tool for LLM agents:

from chou.tools import TOOLS, dispatch

# Pass TOOLS to the OpenAI chat completion API
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=TOOLS,
)

# Dispatch the tool call
result = dispatch(
    tool_call.function.name,
    tool_call.function.arguments,
)

CLI Help

CLI Help

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chou-0.1.6.tar.gz (58.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chou-0.1.6-py3-none-any.whl (51.0 kB view details)

Uploaded Python 3

File details

Details for the file chou-0.1.6.tar.gz.

File metadata

  • Download URL: chou-0.1.6.tar.gz
  • Upload date:
  • Size: 58.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for chou-0.1.6.tar.gz
Algorithm Hash digest
SHA256 639a25e3890f5d1916ed27fa95d6de5a4f82899de18223d0086121aad1dc5ee9
MD5 c3a2e6c19fcd94b8d48592867bbaa190
BLAKE2b-256 0d679678ecf98b996eb9baa7be1dbc55455f4994400769d40a79533e24b0bf52

See more details on using hashes here.

File details

Details for the file chou-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: chou-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 51.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for chou-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 70fbc934d5b0eceba19cb385267a779f9b56fbbd111634e045045c5dae550965
MD5 ec784e70c61671a77afe0d0c677d9ca3
BLAKE2b-256 68a57263f17f2c643dc3f858d968928d9a6cc15e9392c2dcdff21cee605505f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page