Skip to main content

Intelligent academic paper and media renaming tool with multi-source metadata extraction

Project description

CiteWright

Anybody else have a huge folder full of files with names like 235680_download.PDF and smith_et_al_2008_full.pdf(2)?

... yeah.

I wrote this because I got mass-downloading papers from Sci-Hub and then staring at a folder of cryptic filenames wondering which one was the paper about transformer attention mechanisms and which one was about soil bacteria. Life's too short.

What It Does

  • Strips text from documents and uses arXiv, Semantic Scholar, Crossref, PubMed, OpenLibrary, and Unpaywall to find the actual source
  • Renames files to Author_Year_Title.ext like a civilized person
  • Handles PDF, TXT, Markdown, DOC/DOCX, and Python files - throw it at it, let's find out
  • Maintains a BibTeX database so you don't have to
  • Logs everything, doesn't break anything, asks before doing anything destructive
  • Optionally uses a local LLM (Ollama) or cloud providers (OpenAI, Anthropic, Gemini) if the free APIs come up empty

The Philosophy

I built this with a "try the free stuff first" approach. Why pay for API calls when CrossRef is right there?

Tier What Happens
1 Check if the PDF already has metadata embedded. Usually garbage, but sometimes you get lucky.
2 Extract DOIs, arXiv IDs, ISBNs from the text and look them up. This is where the magic happens.
3 Search academic APIs using whatever title/author text it can scrape. Works more often than you'd think.
4 (Optional) Throw the text at an LLM and ask nicely. Costs money unless you're running Ollama locally.

Installation

git clone https://github.com/lukeslp/citewright.git
cd citewright
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install .

Want the LLM-powered features and media processing?

pip install ".[all]"

Usage

Preview what would happen (dry run, safe):

citewright pdf ~/papers

Actually rename things:

citewright pdf ~/papers --execute

Go recursive and spit out a BibTeX file:

citewright pdf ~/papers -r --execute --bibtex library.bib

Let the LLM analyze the stubborn ones:

citewright pdf ~/papers --ai --execute

Rename photos and videos too (uses EXIF data):

citewright media ~/photos --execute

Use vision models to describe images:

citewright media ~/photos --ai --execute

Oh no go back:

citewright undo

Configuration

Config lives at ~/.config/citewright/config.json, or use the CLI:

citewright config --show
citewright config --ai-provider openai  # Select LLM provider
citewright config --ai-enabled
citewright config --unpaywall-email "you@example.com"

The Unpaywall email is optional but they appreciate it. Be cool.

License

MIT. Do whatever.

Author

Luke Steuber
https://github.com/lukeslp
luke@actuallyuseful.ai

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citewright-0.1.1.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citewright-0.1.1-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file citewright-0.1.1.tar.gz.

File metadata

  • Download URL: citewright-0.1.1.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for citewright-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fb27c3e77573d3736e4185e0c1be9969d8aacaea5248698105c3e36277022ef7
MD5 0f93c6f307402fa83c470fb893769b4b
BLAKE2b-256 d6c3efaee32f3872ed22ce158746a5a9ba7b29f0b3901e13560971a9b9947d8e

See more details on using hashes here.

File details

Details for the file citewright-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: citewright-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 33.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for citewright-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0f10220ed226cb550bb3d245cfb576af1e4cd60e0c3f4659d4bbf63068ae8462
MD5 24a3397b432c3936757bf51a0399e7cf
BLAKE2b-256 1038c738280fe14ff839760cf8e8e2cc74ff1ba14d1100261ef9330917622f47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page