Intelligent academic paper and media renaming tool with multi-source metadata extraction
Project description
CiteWright
Anybody else have a huge folder full of files with names like 235680_download.PDF and smith_et_al_2008_full.pdf(2)?
... yeah.
I wrote this because I got mass-downloading papers from Sci-Hub and then staring at a folder of cryptic filenames wondering which one was the paper about transformer attention mechanisms and which one was about soil bacteria. Life's too short.
What It Does
- Strips text from documents and uses arXiv, Semantic Scholar, Crossref, PubMed, OpenLibrary, and Unpaywall to find the actual source
- Renames files to
Author_Year_Title.extlike a civilized person - Handles PDF, TXT, Markdown, DOC/DOCX, and Python files - throw it at it, let's find out
- Maintains a BibTeX database so you don't have to
- Logs everything, doesn't break anything, asks before doing anything destructive
- Optionally uses a local LLM (Ollama) or cloud providers (OpenAI, Anthropic, Gemini) if the free APIs come up empty
The Philosophy
I built this with a "try the free stuff first" approach. Why pay for API calls when CrossRef is right there?
| Tier | What Happens |
|---|---|
| 1 | Check if the PDF already has metadata embedded. Usually garbage, but sometimes you get lucky. |
| 2 | Extract DOIs, arXiv IDs, ISBNs from the text and look them up. This is where the magic happens. |
| 3 | Search academic APIs using whatever title/author text it can scrape. Works more often than you'd think. |
| 4 | (Optional) Throw the text at an LLM and ask nicely. Costs money unless you're running Ollama locally. |
Installation
git clone https://github.com/lukeslp/citewright.git
cd citewright
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install .
Want the LLM-powered features and media processing?
pip install ".[all]"
Usage
Preview what would happen (dry run, safe):
citewright pdf ~/papers
Actually rename things:
citewright pdf ~/papers --execute
Go recursive and spit out a BibTeX file:
citewright pdf ~/papers -r --execute --bibtex library.bib
Let the LLM analyze the stubborn ones:
citewright pdf ~/papers --ai --execute
Rename photos and videos too (uses EXIF data):
citewright media ~/photos --execute
Use vision models to describe images:
citewright media ~/photos --ai --execute
Oh no go back:
citewright undo
Configuration
Config lives at ~/.config/citewright/config.json, or use the CLI:
citewright config --show
citewright config --ai-provider openai # Select LLM provider
citewright config --ai-enabled
citewright config --unpaywall-email "you@example.com"
The Unpaywall email is optional but they appreciate it. Be cool.
License
MIT. Do whatever.
Author
Luke Steuber
https://github.com/lukeslp
luke@actuallyuseful.ai
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file citewright-0.1.1.tar.gz.
File metadata
- Download URL: citewright-0.1.1.tar.gz
- Upload date:
- Size: 26.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb27c3e77573d3736e4185e0c1be9969d8aacaea5248698105c3e36277022ef7
|
|
| MD5 |
0f93c6f307402fa83c470fb893769b4b
|
|
| BLAKE2b-256 |
d6c3efaee32f3872ed22ce158746a5a9ba7b29f0b3901e13560971a9b9947d8e
|
File details
Details for the file citewright-0.1.1-py3-none-any.whl.
File metadata
- Download URL: citewright-0.1.1-py3-none-any.whl
- Upload date:
- Size: 33.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f10220ed226cb550bb3d245cfb576af1e4cd60e0c3f4659d4bbf63068ae8462
|
|
| MD5 |
24a3397b432c3936757bf51a0399e7cf
|
|
| BLAKE2b-256 |
1038c738280fe14ff839760cf8e8e2cc74ff1ba14d1100261ef9330917622f47
|