Skip to main content

Find, acquire, and manage academic references — for AI agents and humans.

Project description

citeget

Find, acquire, and manage academic references — for AI agents and humans.

citeget automates the tedious work of tracking down PDFs for academic papers. Point it at a document with a references section, and it will try every available source — direct URLs, arxiv, Library Genesis, Sci-Hub — to download each one.

It also handles the more general case: any list of URLs (blog posts, docs, specs, product pages) saved as Markdown or PDF — useful when references aren't peer-reviewed papers. See citeget fetch below.

Install

pip install citeget
python -m playwright install chromium   # one-time browser setup

Quick start

CLI

# Search for papers
citeget search "graph theory" --topic articles

# Download top results
citeget download "python programming" --download-dir ~/papers --max-downloads 3

# Acquire all references from a document (academic mode)
citeget acquire my_paper.md

# Fetch arbitrary URLs — markdown by default, pdf if you ask for it
citeget fetch "https://example.com/article" --output-dir ~/Downloads/refs
citeget fetch refs.md --prefer md   # accepts a file with URLs in any form

The acquire command reads the references section, resolves a working directory, and downloads every reference it can find:

$ citeget acquire paper.md
Parsed 34 references
Work dir: paper -- acquired_references/

Skipping 5 already-downloaded reference(s):
  [1] Efficiently modeling long sequences...
  [3] High-speed parallel architectures...
        -> (To re-download, rename or move the existing file.)

[1/29] Ref [2]: A two-step computation of cyclic redundancy code...
  SUCCESS (libgen_articles) -> references/A two-step computation... (Glaise, 1997).pdf
[2/29] Ref [4]: High-speed parallel LFSR architectures...
  SUCCESS (libgen_articles) -> references/High-speed parallel... (Hu et al., 2017).pdf
...

Acquired: 30/34
Output: paper -- acquired_references/

Output files:

  • references/ — downloaded PDFs, named {title} ({authors}, {year}).pdf
  • references.md — all acquired references with clickable local links
  • {datetime}_missed_references.md — what couldn't be found and why
  • {datetime}__acquisition_log.txt — every search attempt (TSV)

Python API

from citeget import search, search_and_download

# Search and get metadata
results = search("machine learning", topic="articles")
for r in results[:3]:
    print(f"{r['title'][:60]}  ({r['year']})")

# One-shot search + download
search_and_download("python programming", download_dir="~/papers", max_downloads=5)

For bulk reference acquisition:

from citeget import (
    parse_references_section,
    resolve_work_dir,
    acquire_all_references,
    write_references_md,
    write_missed_references_md,
)

# Parse references from any text
refs = parse_references_section(my_paper_text)

# Resolve working directory (auto-derived from filename)
work_dir = resolve_work_dir(reference_file="paper.md")

# Acquire — tries direct URL → libgen → arxiv → sci-hub
successes, failures, log = acquire_all_references(
    refs,
    download_dir=work_dir / "references",
    work_dir=work_dir,
)

# Write output files
write_references_md(successes, work_dir / "references", work_dir / "references.md")

AI agent usage (Claude Code skills)

citeget ships with Claude Code skills — structured prompts that let an AI agent use the tools interactively. The skills live in .claude/skills/ inside this repository.

To use in Claude Code, either work in the citeget project directory (skills are auto-discovered), or copy the skill folders into your project's .claude/skills/ directory. Then invoke them by name:

> /acquire-references my_paper.md
> /research-topic "linear recurrence substitution"
> /review-article draft.md ieee_software
> /check-submission-fit draft.md
> /format-for-journal draft.md cacm_practice
> /prepare-submission draft.md ieee_software

To use skills in other systems, the SKILL.md files are self-contained markdown documents that describe the workflow, tools needed, and expected output. Any AI agent system that supports tool-use prompts can consume them — read the SKILL.md file and include it in your system prompt alongside the relevant tool definitions. The skills call into citeget's Python API, so the agent needs access to a Python environment with citeget installed.

Available skills:

Skill What it does
/fetch-resources Download arbitrary URLs as Markdown / PDF (general)
/acquire-references Download PDFs for every reference in an academic document
/research-topic Deep literature survey with structured research brief
/review-article Peer-review style critique with scored dimensions
/check-submission-fit Journal venue recommendation with fit scores
/format-for-journal Reformat a draft for a specific journal's requirements
/prepare-submission Generate cover letter, checklist, and submission guide

Acquisition strategy

For each reference, citeget tries these sources in order:

  1. Direct URL — if the reference includes an arxiv, OpenReview, or other direct link, download the PDF.
  2. Library Genesis — search by title with progressively adjusted specificity (full title → title + author → short title → author + year).
  3. Arxiv API — structured search by author + title keywords.
  4. Sci-Hub — DOI lookup via Crossref, then Sci-Hub download.
  5. Fetch fallback — if no PDF is reachable but the reference has a URL, the page is fetched and saved as Markdown. Catches non-paper references (blog posts, docs, product pages). Disable with fetch_fallback=False.

Files are named in APA 7 citation style: {title} ({authors_apa7}, {year}).pdf — e.g., Retiming synchronous circuitry (Leiserson & Saxe, 1991).pdf

General-purpose fetch

Not all "references" are papers. For lists of arbitrary web URLs, use citeget fetch (or citeget.fetch()) — it accepts a URL, a list, a file of URLs, or prose with embedded URLs, and saves each one as Markdown (default), PDF, or original bytes.

from citeget import fetch

# Pass anything — citeget figures out what URLs are in there
results = fetch(
    "/path/to/links.md",
    output_dir="~/Downloads/refs",
    prefer="md",  # "md" (default), "pdf", "original", or "auto"
)
for r in results:
    print(r.status, r.format, r.output_file)

URL parsing recognizes markdown links [anchor](url), reference-style [1] ... https://url citations, and bare URLs in prose. Filenames are inferred from anchor text → URL path → domain hash.

PDF rendering is opt-in (HTML→PDF needs wkhtmltopdf):

pip install 'citeget[fetch]'
brew install wkhtmltopdf      # or apt-get install wkhtmltopdf

Without it, --prefer pdf quietly falls back to Markdown.

Article publication toolkit

Beyond reference acquisition, citeget includes tools for the full publication workflow. These are primarily used through Claude Code skills, backed by machine-readable journal profiles in citeget/article_pub/data/journal_profiles.json.

Supported journals: IEEE Software, CACM (Practice/Research/Viewpoints), IEEE TSE, ACM Queue.

Standalone scripts in citeget/article_pub/scripts/:

# Check article against journal requirements
python -m citeget.article_pub.scripts.check_article draft.md ieee_software

# Word count with section breakdown
python -m citeget.article_pub.scripts.word_count draft.md --breakdown

# Reference consistency check
python -m citeget.article_pub.scripts.extract_references draft.md

How it works

Library Genesis renders search results via JavaScript, so citeget uses Playwright (headless Chromium) to load pages. Ad domains are blocked for speed. Downloads use session keys extracted from intermediate pages.

The acquisition log records every attempt in TSV format, making it easy to audit what was tried, what matched, and what failed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citeget-0.2.4.tar.gz (140.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citeget-0.2.4-py3-none-any.whl (67.6 kB view details)

Uploaded Python 3

File details

Details for the file citeget-0.2.4.tar.gz.

File metadata

  • Download URL: citeget-0.2.4.tar.gz
  • Upload date:
  • Size: 140.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for citeget-0.2.4.tar.gz
Algorithm Hash digest
SHA256 6c0149d67ec733701cf2fbd97ecd3d7d2d8203feb1f899b8dda38b15fa046f85
MD5 b96b46e1b209fb0f674525334d2d40f1
BLAKE2b-256 2c6bd7a50a47a04f5d44d2fad8729bbfdfcb3fc272e0411e380e084a303b2354

See more details on using hashes here.

File details

Details for the file citeget-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: citeget-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 67.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for citeget-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 30e8cc98f589e1d885f61f8eedd2f1fea0a36298a7d41c7d51644038e785cb3f
MD5 7ac4c484af2652cbe29f047a6d4550d2
BLAKE2b-256 f10cbb2165744f09bfa13933575e6fd8820c02c22e05eb9cf8a87b983e40207a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page