An async Python library for scraping Google Scholar profiles

These details have not been verified by PyPI

Project links

Project description

PyScholarly

An async Python library for scraping Google Scholar profiles using Playwright, with support for proxy rotation and detailed citation tracking.

⚠️ Disclaimer: This project is for academic and research purposes only. Please be mindful of Google Scholar's terms of service and rate limiting. Use responsibly and at your own risk.

Installation

pip install pyscholarly

Features

Async/await support for efficient data fetching
Proxy support with rotation strategies:
- Sequential rotation
- Random rotation
- Support for authenticated proxies (username/password)
Comprehensive citation metrics:
- All-time citations
- Recent citations (last 5 years)
- Year-to-date citations per publication
- H-index and i10-index (all-time and recent)
Publication details:
- Title, authors, venue, year
- Citation counts (all-time and YTD)
Debug logging and HTML page saving for troubleshooting
Headless mode support
Modern Playwright-based scraping

Usage

Basic Usage

from pyscholarly import fetch_scholar_data
import asyncio

async def main():
    # Fetch data for a Google Scholar profile
    author_id = "SCHOLAR_ID"  # Replace with actual Google Scholar ID
    data = await fetch_scholar_data(author_id)
    
    print(f"Author: {data['name']}")
    print(f"Total citations: {data['citedby']}")
    print(f"Recent citations: {data['citedby_recent']}")
    print(f"h-index: {data['hindex']}")
    
    # Print publications with YTD citations
    print("\nPublications:")
    for pub in data['publications']:
        print(f"- {pub['bib']['title']} ({pub['year']}):")
        print(f"  - Total citations: {pub['num_citations']}")
        print(f"  - YTD citations: {pub['ytd_citations']}")

if __name__ == "__main__":
    asyncio.run(main())

Advanced Usage with Proxies

import logging
from pyscholarly import fetch_scholar_data

# Setup logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

async def main():
    # Single proxy
    proxy = "username:password@host:port"
    
    # Or multiple proxies
    proxies = [
        "username1:password1@host1:port1",
        "username2:password2@host2:port2"
    ]
    
    data = await fetch_scholar_data(
        scholar_id="SCHOLAR_ID",
        proxies=proxies,
        proxy_rotation="random",  # or "sequential"
        headless=True,
        logger=logger
    )

Return Data Structure

{
    'name': str,
    'citedby': int,              # All-time citations
    'citedby_recent': int,       # Citations in last 5 years
    'hindex': int,               # All-time h-index
    'hindex_recent': int,        # Recent h-index
    'i10index': int,             # All-time i10-index
    'i10index_recent': int,      # Recent i10-index
    'publications': [
        {
            'bib': {
                'title': str,
                'authors': str,
                'venue': str
            },
            'num_citations': int,    # Total citations
            'ytd_citations': int,    # Year-to-date citations
            'year': str
        },
        # ...
    ]
}

Configuration Options

Parameter	Type	Description
`scholar_id`	str	Google Scholar profile ID
`proxies`	Optional[Union[str, List[str]]]	Single proxy or list of proxies
`proxy_rotation`	str	Proxy rotation strategy ('sequential' or 'random')
`headless`	bool	Run browser in headless mode
`logger`	Optional[Logger]	Custom logger instance

Requirements

Python 3.8+
Playwright

Debug Mode

The library saves HTML pages for debugging purposes in a debug_pages directory when running with a logger at DEBUG level.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.77

Jan 7, 2025

0.1.75

Jan 6, 2025

0.1.74

Jan 2, 2025

0.1.73

Jan 2, 2025

0.1.72

Nov 8, 2024

0.1.71

Nov 8, 2024

0.1.70

Nov 8, 2024

0.1.69

Nov 8, 2024

0.1.68

Nov 8, 2024

0.1.67

Nov 8, 2024

0.1.66

Nov 5, 2024

0.1.65

Nov 5, 2024

0.1.64

Oct 28, 2024

0.1.63

Oct 23, 2024

0.1.62

Oct 22, 2024

0.1.61

Oct 22, 2024

0.1.6

Oct 22, 2024

0.1.5

Oct 22, 2024

0.1.1

Oct 22, 2024

0.1.0

Oct 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyscholarly-0.1.77.tar.gz (8.9 kB view details)

Uploaded Jan 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyscholarly-0.1.77-py3-none-any.whl (7.6 kB view details)

Uploaded Jan 7, 2025 Python 3

File details

Details for the file pyscholarly-0.1.77.tar.gz.

File metadata

Download URL: pyscholarly-0.1.77.tar.gz
Upload date: Jan 7, 2025
Size: 8.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.5

File hashes

Hashes for pyscholarly-0.1.77.tar.gz
Algorithm	Hash digest
SHA256	`afea0cc5c08eeefcfd8a593a8dbd290f018012554fd0a0280b903b6d048bd83f`
MD5	`d307c45e11420dd5fc238540e486c656`
BLAKE2b-256	`c9348eec995d4b7ca9596b90afccefcfc6e54f33389eff7b5e800e933dfce242`

See more details on using hashes here.

File details

Details for the file pyscholarly-0.1.77-py3-none-any.whl.

File metadata

Download URL: pyscholarly-0.1.77-py3-none-any.whl
Upload date: Jan 7, 2025
Size: 7.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.5

File hashes

Hashes for pyscholarly-0.1.77-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa62ce235e514b0d240594bfb376dcec14c7ae03d8548800b127d7ff24ee8b80`
MD5	`c3665cb1656e7495bee142ba3c609b74`
BLAKE2b-256	`64845def9ca1783af2d7b9e3598fa8a0618cc5b1c67b443687092df1106f722c`

See more details on using hashes here.

pyscholarly 0.1.77

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyScholarly

Installation

Features

Usage

Basic Usage

Advanced Usage with Proxies

Return Data Structure

Configuration Options

Requirements

Debug Mode

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes