Skip to main content

Google Scholar citation analysis tool for identifying high-impact citations, influential authors, and notable peers for grant applications and CV building.

Project description

WhoCitedMe

Python 3.10+ License: MIT Code Style: Black

WhoCitedMe is a powerful Python library and CLI tool designed for researchers and academics. It automates the process of scraping Google Scholar citations, identifying who is citing your work, and analyzing the impact of those citations.

It goes beyond simple citation counts by enriching author data, matching missing Scholar IDs, and identifying the "top scholar" (highest-cited author) for each citing paper.

๐Ÿš€ Key Features

  • ๐Ÿ“„ Citing Papers Scraper: Automatically scrape all papers citing a specific Google Scholar profile within a given year range.
  • ๐Ÿงฉ Author Enricher: Handles truncated author lists (e.g., "J Smith, A Doe...") by parsing full citation data.
  • ๐Ÿ“Š Author Info Fetcher: High-performance, parallelized fetching of author metrics (Citation Count, h-index, Fellow status).
  • ๐Ÿ†” ID Matcher: Uses fuzzy matching logic to resolve missing Google Scholar IDs for citing authors.
  • ๐Ÿ† Top Scholar Finder: Identifies the most influential author on every citing paper to help you understand who is citing you.

๐ŸŽฏ Use Cases

  • Grant Applications: Demonstrate impact by listing high-profile researchers who cite your work.
  • Tenure & Promotion: Provide detailed metrics on the quality of your citations, not just the quantity.
  • Networking: Identify potential collaborators who are already building on your research.

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.10 or higher.
  • Google Chrome installed (required for Selenium scraping).

๐Ÿ“ฆ From PyPI (Recommended)

pip install whocitedme

๐Ÿ’ป Local Development (using uv)

We use uv for fast dependency management.

  1. Clone the repository:

    git clone https://github.com/KyanChen/WhoCitedMe.git
    cd WhoCitedMe
    
  2. Setup environment with uv:

    # Install uv (if not installed)
    pip install uv
    
    # Create virtual environment
    uv venv
    
    # Activate virtual environment
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install in editable mode:

    uv pip install -e .
    

๐Ÿ“– Usage

You can use WhoCitedMe either via the command line interface (CLI) or as a Python library.

Command Line Interface (CLI)

The easiest way to run the tool is using the pipeline command, which runs all steps in order.

# Run the full analysis pipeline
whocitedme pipeline --user-id "YOUR_SCHOLAR_ID" --start-year 2018 --end-year 2024

# With custom output directory and worker count
whocitedme pipeline -u "YOUR_SCHOLAR_ID" -s 2018 -e 2024 -o my_output --workers 32

# Run in headless mode with proxy support
whocitedme pipeline -u "YOUR_SCHOLAR_ID" -s 2018 -e 2024 --headless --proxy http://127.0.0.1:7890

Individual Steps

If you prefer to run steps individually:

  1. Scrape Citing Papers:

    whocitedme scrape --user-id "YOUR_SCHOLAR_ID" --start-year 2020 --end-year 2024 --output output/citations.csv
    
    # Run in headless mode (no visible browser window)
    whocitedme scrape -u "YOUR_SCHOLAR_ID" -s 2020 -e 2024 --headless
    
  2. Enrich Author Data:

    whocitedme enrich --input output/citations.csv --output output/citations_enriched.csv
    
    # Start fresh (disable resume from previous run)
    whocitedme enrich -i output/citations.csv -o output/citations_enriched.csv --no-resume
    
  3. Fetch Author Metrics (Parallelized):

    whocitedme fetch-authors --input output/citations_enriched.csv --output output/scholar_database.csv --workers 16
    
    # With proxy support
    whocitedme fetch-authors -i output/citations_enriched.csv --proxy http://127.0.0.1:7890
    
  4. Match Missing IDs:

    whocitedme match-ids --citing output/citations_enriched.csv --scholars output/scholar_database.csv --output output/citations_verified.csv
    
    # With custom matching threshold (0-1, default: 0.7)
    whocitedme match-ids -c output/citations_enriched.csv -s output/scholar_database.csv --threshold 0.8
    
  5. Find Top Scholars:

    whocitedme top-scholar --input output/citations_verified.csv --scholars output/scholar_database.csv --output output/citations_final.csv
    

Python API

For custom workflows, import the classes directly:

from whocitedme import (
    CitingPapersScraper,
    AuthorEnricher,
    AuthorInfoFetcher,
    IDMatcher,
    TopScholarProcessor,
)

# Step 1: Scrape citing papers
scraper = CitingPapersScraper(
    user_id="YOUR_SCHOLAR_ID",
    start_year=2020,
    end_year=2024,
    output_file="output/citations.csv",
    headless=False,  # Set True for headless browser
)
scraper.run()
scraper.close()

# Step 2: Enrich truncated author information
enricher = AuthorEnricher(
    input_file="output/citations.csv",
    output_file="output/citations_enriched.csv",
)
enricher.run(resume=True)  # Resume from previous run if interrupted
enricher.close()

# Step 3: Fetch author metrics (parallelized)
fetcher = AuthorInfoFetcher(
    input_file="output/citations_enriched.csv",
    output_file="output/scholar_database.csv",
)
fetcher.run(max_workers=16)

# Step 4: Match missing Scholar IDs
matcher = IDMatcher(
    citing_file="output/citations_enriched.csv",
    scholar_file="output/scholar_database.csv",
    output_file="output/citations_verified.csv",
    match_threshold=0.7,
)
matcher.run()

# Step 5: Find top scholars for each citation
processor = TopScholarProcessor(
    main_file="output/citations_verified.csv",
    scholar_file="output/scholar_database.csv",
    output_file="output/citations_final.csv",
)
processor.run()

See examples/basic_usage.py for a complete runnable script.

๐Ÿ“‚ Project Structure

WhoCitedMe/
โ”œโ”€โ”€ whocitedme/
โ”‚   โ”œโ”€โ”€ __init__.py         # Package exports
โ”‚   โ”œโ”€โ”€ cli.py              # Command-line entry point
โ”‚   โ”œโ”€โ”€ scrapers/           # Web scrapers using Selenium
โ”‚   โ”‚   โ”œโ”€โ”€ citing_papers.py    # CitingPapersScraper
โ”‚   โ”‚   โ”œโ”€โ”€ author_enricher.py  # AuthorEnricher
โ”‚   โ”‚   โ””โ”€โ”€ author_info.py      # AuthorInfoFetcher
โ”‚   โ”œโ”€โ”€ processors/         # Data processing logic
โ”‚   โ”‚   โ”œโ”€โ”€ id_matcher.py       # IDMatcher
โ”‚   โ”‚   โ””โ”€โ”€ top_scholar.py      # TopScholarProcessor
โ”‚   โ””โ”€โ”€ utils/              # Helper utilities
โ”‚       โ”œโ”€โ”€ browser.py          # Browser driver creation
โ”‚       โ””โ”€โ”€ captcha.py          # CAPTCHA handling & random sleep
โ”œโ”€โ”€ examples/               # Usage examples
โ”‚   โ””โ”€โ”€ basic_usage.py
โ”œโ”€โ”€ output/                 # Default output directory (git-ignored)
โ”œโ”€โ”€ pyproject.toml          # Project configuration and dependencies
โ”œโ”€โ”€ LICENSE                 # MIT License
โ””โ”€โ”€ README.md               # This file

โš ๏ธ Troubleshooting & Limits

  • Google Scholar Rate Limits: If you scrape too fast, Google will block your IP.
    • Solution: The tool has built-in delays, but for massive jobs, consider using a VPN or proxy.
  • CAPTCHA: If the scraper gets stuck, check the opened Chrome window. You may need to manually solve a CAPTCHA.
  • Chrome Version: Ensure your installed Chrome browser matches the ChromeDriver version (usually handled automatically by undetected-chromedriver).

๐Ÿค Contributing

Contributions are welcome!

  1. Fork the repo.
  2. Create a feature branch (git checkout -b feature/amazing-feature).
  3. Commit your changes.
  4. Push to the branch.
  5. Open a Pull Request.

๐Ÿ“„ License

Distributed under the MIT License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whocitedme-0.1.1.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whocitedme-0.1.1-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file whocitedme-0.1.1.tar.gz.

File metadata

  • Download URL: whocitedme-0.1.1.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for whocitedme-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fd3d0de1d8047543d986524fd505c99ce72b1b8e976c7ca73df25fcda1fc5504
MD5 3b83f05b2de284ddf49e36c68326af89
BLAKE2b-256 07bd6ee8f5c06542ad05952f03858b574cb6e9a8349050d13004a554323595f1

See more details on using hashes here.

File details

Details for the file whocitedme-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: whocitedme-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for whocitedme-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c6257a252f40f18dd4bac3a8c7f3ea0b1f251d90e776a1508ee7bd5c80fe3d4f
MD5 ef2d32b261722262b08a7e5adbd88cc4
BLAKE2b-256 19c8c45ef709f3dedc64a94290e006e76d457726a5a3c2e8248f40d3ab916059

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page