Skip to main content

An async Python library for scraping Google Scholar profiles

Project description

PyScholarly

An async Python library for scraping Google Scholar profiles using Playwright, with support for proxy rotation and detailed citation tracking.

⚠️ Disclaimer: This project is for academic and research purposes only. Please be mindful of Google Scholar's terms of service and rate limiting. Use responsibly and at your own risk.

Installation

pip install pyscholarly

Features

  • Async/await support for efficient data fetching
  • Proxy support with rotation strategies:
    • Sequential rotation
    • Random rotation
    • Support for authenticated proxies (username/password)
  • Comprehensive citation metrics:
    • All-time citations
    • Recent citations (last 5 years)
    • Year-to-date citations per publication
    • H-index and i10-index (all-time and recent)
  • Publication details:
    • Title, authors, venue, year
    • Citation counts (all-time and YTD)
  • Debug logging and HTML page saving for troubleshooting
  • Headless mode support
  • Modern Playwright-based scraping

Usage

Basic Usage

from pyscholarly import fetch_scholar_data
import asyncio

async def main():
    # Fetch data for a Google Scholar profile
    author_id = "SCHOLAR_ID"  # Replace with actual Google Scholar ID
    data = await fetch_scholar_data(author_id)
    
    print(f"Author: {data['name']}")
    print(f"Total citations: {data['citedby']}")
    print(f"Recent citations: {data['citedby_recent']}")
    print(f"h-index: {data['hindex']}")
    
    # Print publications with YTD citations
    print("\nPublications:")
    for pub in data['publications']:
        print(f"- {pub['bib']['title']} ({pub['year']}):")
        print(f"  - Total citations: {pub['num_citations']}")
        print(f"  - YTD citations: {pub['ytd_citations']}")

if __name__ == "__main__":
    asyncio.run(main())

Advanced Usage with Proxies

import logging
from pyscholarly import fetch_scholar_data

# Setup logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

async def main():
    # Single proxy
    proxy = "username:password@host:port"
    
    # Or multiple proxies
    proxies = [
        "username1:password1@host1:port1",
        "username2:password2@host2:port2"
    ]
    
    data = await fetch_scholar_data(
        scholar_id="SCHOLAR_ID",
        proxies=proxies,
        proxy_rotation="random",  # or "sequential"
        headless=True,
        logger=logger
    )

Return Data Structure

{
    'name': str,
    'citedby': int,              # All-time citations
    'citedby_recent': int,       # Citations in last 5 years
    'hindex': int,               # All-time h-index
    'hindex_recent': int,        # Recent h-index
    'i10index': int,             # All-time i10-index
    'i10index_recent': int,      # Recent i10-index
    'publications': [
        {
            'bib': {
                'title': str,
                'authors': str,
                'venue': str
            },
            'num_citations': int,    # Total citations
            'ytd_citations': int,    # Year-to-date citations
            'year': str
        },
        # ...
    ]
}

Configuration Options

Parameter Type Description
scholar_id str Google Scholar profile ID
proxies Optional[Union[str, List[str]]] Single proxy or list of proxies
proxy_rotation str Proxy rotation strategy ('sequential' or 'random')
headless bool Run browser in headless mode
logger Optional[Logger] Custom logger instance

Requirements

  • Python 3.8+
  • Playwright

Debug Mode

The library saves HTML pages for debugging purposes in a debug_pages directory when running with a logger at DEBUG level.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyscholarly-0.1.77.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyscholarly-0.1.77-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file pyscholarly-0.1.77.tar.gz.

File metadata

  • Download URL: pyscholarly-0.1.77.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.5

File hashes

Hashes for pyscholarly-0.1.77.tar.gz
Algorithm Hash digest
SHA256 afea0cc5c08eeefcfd8a593a8dbd290f018012554fd0a0280b903b6d048bd83f
MD5 d307c45e11420dd5fc238540e486c656
BLAKE2b-256 c9348eec995d4b7ca9596b90afccefcfc6e54f33389eff7b5e800e933dfce242

See more details on using hashes here.

File details

Details for the file pyscholarly-0.1.77-py3-none-any.whl.

File metadata

  • Download URL: pyscholarly-0.1.77-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.5

File hashes

Hashes for pyscholarly-0.1.77-py3-none-any.whl
Algorithm Hash digest
SHA256 aa62ce235e514b0d240594bfb376dcec14c7ae03d8548800b127d7ff24ee8b80
MD5 c3665cb1656e7495bee142ba3c609b74
BLAKE2b-256 64845def9ca1783af2d7b9e3598fa8a0618cc5b1c67b443687092df1106f722c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page