Skip to main content

Intelligent academic paper and media renaming tool with multi-source metadata extraction

Project description

CiteWright

Anybody else have a huge folder full of files with names like 235680_download.PDF and smith_et_al_2008_full.pdf(2)?

... yeah.

I wrote this because I got mass-downloading papers from Sci-Hub and then staring at a folder of cryptic filenames wondering which one was the paper about transformer attention mechanisms and which one was about soil bacteria. Life's too short.

What It Does

  • Strips text from PDFs and uses arXiv, Semantic Scholar, Crossref, PubMed, OpenLibrary, and Unpaywall to find the actual source
  • Renames files to Author_Year_Title.pdf like a civilized person
  • Handles PDFs, ebooks, and most document formats - throw it at it, let's find out
  • Maintains a BibTeX database so you don't have to
  • Logs everything, doesn't break anything, asks before doing anything destructive
  • Optionally uses a local LLM (Ollama) or cloud providers (OpenAI, Anthropic, Gemini) if the free APIs come up empty

The Philosophy

I built this with a "try the free stuff first" approach. Why pay for API calls when CrossRef is right there?

Tier What Happens
1 Check if the PDF already has metadata embedded. Usually garbage, but sometimes you get lucky.
2 Extract DOIs, arXiv IDs, ISBNs from the text and look them up. This is where the magic happens.
3 Search academic APIs using whatever title/author text it can scrape. Works more often than you'd think.
4 (Optional) Throw the text at an LLM and ask nicely. Costs money unless you're running Ollama locally.

Installation

git clone https://github.com/lukeslp/citewright.git
cd citewright
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install .

Want the AI features and media processing?

pip install ".[all]"

Usage

Preview what would happen (dry run, safe):

citewright pdf ~/papers

Actually rename things:

citewright pdf ~/papers --execute

Go recursive and spit out a BibTeX file:

citewright pdf ~/papers -r --execute --bibtex library.bib

Let AI take a crack at the stubborn ones:

citewright pdf ~/papers --ai --execute

Rename photos and videos too (uses EXIF data):

citewright media ~/photos --execute

Use AI vision to describe images:

citewright media ~/photos --ai --execute

Oh no go back:

citewright undo

Configuration

Config lives at ~/.config/citewright/config.json, or use the CLI:

citewright config --show
citewright config --ai-provider openai
citewright config --ai-enabled
citewright config --unpaywall-email "you@example.com"

The Unpaywall email is optional but they appreciate it. Be cool.

License

MIT. Do whatever.

Author

Luke Steuber
https://github.com/lukeslp
luke@actuallyuseful.ai

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citewright-0.1.0.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citewright-0.1.0-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file citewright-0.1.0.tar.gz.

File metadata

  • Download URL: citewright-0.1.0.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for citewright-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e7de702e04aa0cc732aca539efffd12eceffa12526d7ffa3f978b4e3b35ab92f
MD5 dedb0c813e725067e640efb1982304f5
BLAKE2b-256 12430151513e402b3e2b3572cf6382158f10b9fcec9897552d7bbf84bea771e3

See more details on using hashes here.

File details

Details for the file citewright-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: citewright-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for citewright-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9227d05c4746b897dec4482e81c48e1c7e4a5519a8d61c195565b26b4176857
MD5 46df785e91f049c19f51f99c16c3c0b8
BLAKE2b-256 b872607c5dac4e91d9fee388e10be755c48da48fec81add4cba2110d9ae156e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page