CLI that fetches all 9 Google Scholar citation formats (BibTeX / EndNote / RefMan / RefWorks / MLA / APA / Chicago / Harvard / Vancouver) for a paper title.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yitianlian

These details have not been verified by PyPI

Project description

scholar-cite

English · 简体中文

A Python CLI that searches Google Scholar by paper title and returns all nine citation formats — BibTeX, EndNote, RefMan (RIS), RefWorks, MLA, APA, Chicago, Harvard, Vancouver.

Status: MVP. The nine formats are verified end-to-end against live Google Scholar via a Playwright browser backend. See docs/test-run-2026-04-19.md for the evidence and docs/ARCHITECTURE.md for how the code is organised.

Why this tool exists
Install
Quick start
Usage
How it works
Missing-format handling
Source-quality ranking
Claude Code & Codex integration
What's implemented vs planned
Running the tests
Project layout
Documentation index
License

Why this tool exists

Google Scholar's "Cite" popup produces nine clean citation formats for any paper. Getting them in bulk is painful though: there's no public API, the HTML surface is rate-limited within a request or two, and export URLs serve text/plain downloads that don't play well with either requests or a headless browser. scholar-cite wraps all of that so you can type:

scholar-cite cite "Attention Is All You Need" --format bibtex

…and get a working BibTeX entry.

Install

Installing always takes two steps:

Install the Python package (pulls every Python dependency in automatically).
Download the Chromium browser binary that Playwright drives (~150 MB, one-off).

Python 3.10 or later is required (tested on 3.10 – 3.14).

FAQ before you install

Can I just pip install scholar-cite? Not on PyPI yet (planned for v0.1.0). For now, install directly from the git repo (option A below) or from a locally-built wheel (option B). Once published, the command becomes pipx install scholar-cite.

Do I need an API key or token? No. Google Scholar has no public API. The tool drives a real browser and parses Scholar's own HTML; nothing authenticates. You may need to solve a captcha once in the visible browser window, after which cookies carry the session for days.

Do I have to install dependencies manually? No. pip / pipx reads pyproject.toml and pulls in every Python dep automatically (typer, scholarly, requests, beautifulsoup4, lxml, playwright). The only manual step is step 2 — downloading the Chromium binary. pip can't ship 150 MB of browser inside a Python wheel, so Playwright exposes playwright install chromium to fetch it separately.

What is Playwright and why does scholar-cite need it? Playwright is a Python library that drives a real Chromium browser programmatically. We use it because:

Google Scholar 403s plain HTTP requests within a request or two, even through the scholarly library.
Scholar detects headless browsers and shows a "please show you're not a robot" page to them.
A real headful Chromium with light stealth patches (hide navigator.webdriver etc.) reliably survives. When Scholar does show a captcha, it appears in the visible window and you can click through it once; cookies are cached at ~/.cache/scholar-cite/cookies.json and reused silently for subsequent runs.

We do not use Selenium, pyppeteer, or plain requests for the main path. See docs/ARCHITECTURE.md for how the pieces fit together.

Option A — install from the git repo (recommended for now)

pipx install git+https://github.com/yitianlian/scholar-cite.git

# Then, once per machine:
playwright install chromium

pipx isolates scholar-cite into its own virtualenv and puts the scholar-cite binary on your PATH. If you prefer pip, replace pipx with pip and manage the venv yourself. SSH (git+ssh://git@github.com/...) also works if you already have an SSH key set up for GitHub.

Option B — build a wheel locally and install it

Useful if you want a single .whl you can copy to other machines.

git clone https://github.com/yitianlian/scholar-cite.git
cd scholar-cite
pip install build
python -m build                     # produces dist/scholar_cite-0.1.0-*.whl

pipx install dist/scholar_cite-0.1.0-py3-none-any.whl
playwright install chromium

Option C — editable install for development

git clone https://github.com/yitianlian/scholar-cite.git
cd scholar-cite
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"             # includes pytest + ruff
playwright install chromium
pytest -q                           # 38 tests, all offline

First-run behaviour (all options)

The first scholar-cite cite "..." call opens a visible Chromium window. If Scholar shows "Please show you're not a robot", click through the challenge once. The tool waits up to 5 minutes, harvests the resulting cookies to ~/.cache/scholar-cite/cookies.json, and reuses them silently on later runs. Run scholar-cite auth status any time to see the cached cookie state, and scholar-cite auth reset to force a fresh login.

Quick start

# Default: browser path, BibTeX on stdout
scholar-cite cite "Attention Is All You Need"

# All nine formats
scholar-cite cite "Attention Is All You Need" --format all

See install above — the first-run captcha note applies.

Usage

# Single paper → BibTeX on stdout (default format)
scholar-cite cite "<paper title>"

# Pick the formats you want (comma-separated or 'all')
scholar-cite cite "..." --format all
scholar-cite cite "..." --format apa,mla,bibtex

# Cap or expand the candidate pool (Scholar may have multiple clusters)
scholar-cite cite "..." --limit 3

# Machine-readable output (includes citation_errors on partial results)
scholar-cite cite "..." --format all --json

# Write the output to a file instead of stdout
scholar-cite cite "..." --format bibtex -o refs.bib

# Skip the browser and use scholarly's HTTP backend only — no silent fallback
scholar-cite cite "..." --no-browser

# Fail loudly (exit code 4) if any requested format is missing
scholar-cite cite "..." --format all --strict

# Inspect or clear the browser's cookie cache
scholar-cite auth status
scholar-cite auth reset

Exit codes

Code	Meaning
0	Success (even if some formats were missing — they're reported)
2	Search returned no results
4	`--strict` set and at least one requested format was missing

Example output (`--format all`)

[1] Attention is all you need
    A Vaswani — proceedings.neurips.cc
    cluster_id: 5Gohgn6QFikJ
    ──────────────────────────────────────────────────
    MLA:       Vaswani, Ashish, et al. "Attention is all you need." …
    APA:       Vaswani, A., Shazeer, N., Parmar, N., … (2017). …
    Chicago:   …
    Harvard:   …
    Vancouver: …
    Bibtex:
        @article{vaswani2017attention,
          title={Attention is all you need},
          author={Vaswani, Ashish and Shazeer, Noam and …},
          …
        }
    Endnote:
        %0 Journal Article
        …
    Refman:
        TY  - JOUR
        …
    Refworks:
        # Google Scholar's RefWorks export is an external redirect.
        # Import URL:
        http://www.refworks.com/express?sid=google&…

How it works

scholar-cite has two backends with very different reliability profiles:

Playwright browser (default, most reliable). A real headful Chromium navigates Scholar's search page, cite popup, and export URLs. Light stealth patches reduce anti-bot flags; if Scholar still asks for a captcha, the user solves it once and the cookies carry the session for days.
scholarly HTTP (--no-browser, opt-in). The scholarly library's plain HTTP session. Fast when it works, but Scholar blocks it aggressively. This path does not silently fall back to the browser — failures surface per format instead.

Both paths converge on the same pipeline inside search.py, and both rank candidate clusters by source quality before applying --limit (see below). For the gritty details, read docs/ARCHITECTURE.md.

Missing-format handling

When Scholar returns a format incompletely (some clusters are missing export links, some URLs 403, etc.), scholar-cite never drops it silently:

Plain-text output renders [MISSING: <reason>] inline for each failed format.
JSON output adds a citation_errors field per paper.
Stderr gets a short warning summary listing every paper with missing formats.
--strict elevates this to a non-zero exit code (4) for scripts that can't afford partial results.

Source-quality ranking

Google Scholar often indexes a paper multiple times (arXiv preprint, official conference version, third-party mirrors). Citation quality varies wildly — some mirrors produce metadata with reversed author order, fabricated volume numbers, and mangled venue strings. scholar-cite ranks candidates by host:

Tier	Example hosts
Trusted venues	`openaccess.thecvf.com`, `aclanthology.org`, `proceedings.neurips.cc`, `proceedings.mlr.press`, `ieeexplore.ieee.org`, `dl.acm.org`, `nature.com`
Preprints	`arxiv.org`, `biorxiv.org`
Unknown	Everything else (kept in Scholar's original order)
Known low-quality	`sandbox.getindico.io`, `scholar.google.com` self-refs

The real-world consequence: searching "Deep Residual Learning for Image Recognition" with --limit 1 used to land on a sandbox indico mirror that produced @inproceedings{kaiming2016deep, ..., volume={34}}. With ranking on, the same query lands on the clean cluster he2016deep from the official CVPR host. See tests/test_ranking.py::test_rank_papers_handles_resnet_style_scenario.

Claude Code & Codex integration

This repo ships a ready-to-use agent skill so that Claude Code or OpenAI Codex CLI can call the scholar-cite CLI on your behalf when you ask for a citation, without you having to explain the tool every time.

Where the skill lives

Both agent runtimes auto-discover project-scoped skills from their own directory. The content is identical, so the repo keeps a single source in .claude/skills/ and symlinks it for Codex:

scholar-cite/
├── .claude/
│   └── skills/
│       └── scholar-cite/
│           ├── SKILL.md     ← the real file (Claude Code reads here)
│           └── flags.md
└── .agents/
    └── skills/
        └── scholar-cite  →  ../../.claude/skills/scholar-cite   (symlink)
                           (Codex CLI reads here)

How to "install" the skill

Nothing to install. Both agents scan for skills when a session starts in this directory. Just clone the repo and open it in your agent of choice:

Agent	Skill root it looks at	What you do
Claude Code	`.claude/skills/<name>/SKILL.md` (project), `~/.claude/skills/<name>/SKILL.md` (user)	Open the repo in Claude Code. The skill is auto-discovered; it's listed in the available-skills section and Claude invokes it via the `Skill` tool when your request matches the description.
Codex CLI	`.agents/skills/<name>/SKILL.md` (project), `~/.agents/skills/<name>/SKILL.md` (user)	Open the repo in Codex CLI (`codex` in this directory). Skills are scanned at session start and Codex also watches for changes at runtime.

To make the skill globally available (every project, not just this one):

# Claude Code
ln -s "$PWD/.claude/skills/scholar-cite" "$HOME/.claude/skills/scholar-cite"

# Codex CLI
mkdir -p "$HOME/.agents/skills"
ln -s "$PWD/.claude/skills/scholar-cite" "$HOME/.agents/skills/scholar-cite"

The CLI itself still needs to be on PATH — see the Install section above.

What the skill tells the agent

When to invoke (trigger phrases in English and Chinese).
When not to use it (arXiv preprints → arxiv skill; headless CI → it won't work; users asking for PDFs → this tool doesn't fetch PDFs).
The common invocations and their flags.
First-run captcha behaviour and the 5-minute wait.
A troubleshooting table for the six recurring failure modes.
The exit-code contract so the agent can branch correctly on failure.

Read .claude/skills/scholar-cite/SKILL.md (and flags.md for the full flag reference and a Python-API snippet) to see exactly what the agent is taught.

What's implemented vs planned

Feature	Status
Scholar search (browser + scholarly paths)	✅
`cluster_id` extraction	✅
Cite-popup HTML parse (five text formats)	✅
Four export formats via `BrowserContext.request`	✅
Playwright cookie persistence / captcha recovery	✅
Source-quality ranking of candidate clusters	✅
`auth status` / `auth reset` subcommands	✅
`--format`, `--limit`, `-o`, `--json`, `--no-browser`, `--strict`	✅
Batch mode (`-f titles.txt`)	⏳ planned
Interactive picker (`-i`)	⏳ planned
Clipboard (`-c`)	⏳ planned
SQLite cache keyed by `cluster_id`	⏳ planned
SerpAPI fallback backend	⏳ planned

Running the tests

pip install -e ".[dev]"
pytest -q

All parsing and fetching logic is covered by 31 unit tests. No live Google Scholar calls in CI — tests use a saved HTML fixture and fake fetchers.

ruff check src/ tests/      # lint
ruff format src/ tests/     # format

Project layout

scholar-cite/
├── LICENSE                    MIT
├── CHANGELOG.md               Release notes
├── README.md                  ← you are here
├── pyproject.toml
├── docs/
│   ├── ARCHITECTURE.md        Current code map (start here to hack)
│   ├── design.md              Original design spec (planning doc)
│   └── test-run-2026-04-19.md Live end-to-end evidence
├── examples/
│   └── demo_five_papers.py    Fetches BibTeX for 5 classic ML papers
├── src/scholar_cite/
│   ├── cli.py                 Typer CLI (`cite`, `auth status`, `auth reset`)
│   ├── search.py              Browser + scholarly orchestration
│   ├── citation.py            Cite-popup parser + 9-format assembly
│   ├── browser_fetcher.py     Playwright session with cookie persistence
│   ├── ranking.py             Source-quality hostname ranking
│   └── models.py              Paper / CitationSet dataclasses
└── tests/
    ├── fixtures/
    │   └── cite_popup_sample.html
    ├── test_browser_fetcher.py
    ├── test_citation.py
    ├── test_cli.py
    ├── test_ranking.py
    └── test_search.py

Documentation index

Doc	What you'll find there
`docs/ARCHITECTURE.md`	Module map, one-query lifecycle, exception policy, cache layout
`docs/design.md`	Original 14-section design specification (planning-era snapshot)
`docs/test-run-2026-04-19.md`	First live 9-format pipeline run
`docs/e2e-verification.md`	Post-fix E2E evidence + wheel install smoke test
`.claude/skills/scholar-cite/SKILL.md`	Agent skill — auto-discovered by Claude Code from `.claude/skills/` and by Codex CLI from `.agents/skills/` (symlinked to the same file); teaches the agent when and how to call the CLI
`.claude/skills/scholar-cite/flags.md`	Flag reference + Python API snippet referenced by the skill
`CHANGELOG.md`	Release-level summary of what changed and why

License

MIT. Google Scholar's HTML structure and Terms of Service govern your use of the upstream data.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yitianlian

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Apr 20, 2026

0.1.1

Apr 20, 2026

This version

0.1.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scholar_cite-0.1.0.tar.gz (46.0 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scholar_cite-0.1.0-py3-none-any.whl (26.8 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file scholar_cite-0.1.0.tar.gz.

File metadata

Download URL: scholar_cite-0.1.0.tar.gz
Upload date: Apr 20, 2026
Size: 46.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scholar_cite-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`aef2eb42fc209033af1592ae0bf158a7553bf008db2da925bd00cfadbcb43e2d`
MD5	`28fcd8aa369dda968d91d00a3afdc640`
BLAKE2b-256	`254b1bb8e95ebd2eab67bc364fe1e5113879633f5bd70917d5a4f91a4bdc8914`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scholar_cite-0.1.0.tar.gz:

Publisher: publish-to-pypi.yml on yitianlian/scholar-cite

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scholar_cite-0.1.0.tar.gz
- Subject digest: aef2eb42fc209033af1592ae0bf158a7553bf008db2da925bd00cfadbcb43e2d
- Sigstore transparency entry: 1341613280
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: yitianlian/scholar-cite@419e926b127775958d92f269d0888b2a5b15a6c9
- Branch / Tag: refs/heads/main
- Owner: https://github.com/yitianlian
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@419e926b127775958d92f269d0888b2a5b15a6c9
- Trigger Event: workflow_dispatch

File details

Details for the file scholar_cite-0.1.0-py3-none-any.whl.

File metadata

Download URL: scholar_cite-0.1.0-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 26.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scholar_cite-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a76107543ce6cd2fdc7198fbac1edede57b92806e0f252151ef4ccd10193b6d3`
MD5	`61fdb0e3d596739bacc8b0657c0b85c9`
BLAKE2b-256	`b7af68a690913cf5577898616d95a378207f56c7929139025b29d5e12c0edc89`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scholar_cite-0.1.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on yitianlian/scholar-cite

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scholar_cite-0.1.0-py3-none-any.whl
- Subject digest: a76107543ce6cd2fdc7198fbac1edede57b92806e0f252151ef4ccd10193b6d3
- Sigstore transparency entry: 1341613282
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: yitianlian/scholar-cite@419e926b127775958d92f269d0888b2a5b15a6c9
- Branch / Tag: refs/heads/main
- Owner: https://github.com/yitianlian
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@419e926b127775958d92f269d0888b2a5b15a6c9
- Trigger Event: workflow_dispatch

scholar-cite 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

scholar-cite

Table of contents

Why this tool exists

Install

FAQ before you install

Option A — install from the git repo (recommended for now)

Option B — build a wheel locally and install it

Option C — editable install for development

First-run behaviour (all options)

Quick start

Usage

Exit codes

Example output (--format all)

How it works

Missing-format handling

Source-quality ranking

Claude Code & Codex integration

Where the skill lives

How to "install" the skill

What the skill tells the agent

What's implemented vs planned

Running the tests

Project layout

Documentation index

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Example output (`--format all`)