An async Python library for scraping Google Scholar profiles
Project description
PyScholarly
An async Python library for scraping Google Scholar profiles using Playwright, with support for proxy rotation and detailed citation tracking.
⚠️ Disclaimer: This project is for academic and research purposes only. Please be mindful of Google Scholar's terms of service and rate limiting. Use responsibly and at your own risk.
Installation
pip install pyscholarly
Features
- Async/await support for efficient data fetching
- Proxy support with rotation strategies:
- Sequential rotation
- Random rotation
- Support for authenticated proxies (username/password)
- Comprehensive citation metrics:
- All-time citations
- Recent citations (last 5 years)
- Year-to-date citations per publication
- H-index and i10-index (all-time and recent)
- Publication details:
- Title, authors, venue, year
- Citation counts (all-time and YTD)
- Debug logging and HTML page saving for troubleshooting
- Headless mode support
- Modern Playwright-based scraping
Usage
Basic Usage
from pyscholarly import fetch_scholar_data
import asyncio
async def main():
# Fetch data for a Google Scholar profile
author_id = "SCHOLAR_ID" # Replace with actual Google Scholar ID
data = await fetch_scholar_data(author_id)
print(f"Author: {data['name']}")
print(f"Total citations: {data['citedby']}")
print(f"Recent citations: {data['citedby_recent']}")
print(f"h-index: {data['hindex']}")
# Print publications with YTD citations
print("\nPublications:")
for pub in data['publications']:
print(f"- {pub['bib']['title']} ({pub['year']}):")
print(f" - Total citations: {pub['num_citations']}")
print(f" - YTD citations: {pub['ytd_citations']}")
if __name__ == "__main__":
asyncio.run(main())
Advanced Usage with Proxies
import logging
from pyscholarly import fetch_scholar_data
# Setup logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
async def main():
# Single proxy
proxy = "username:password@host:port"
# Or multiple proxies
proxies = [
"username1:password1@host1:port1",
"username2:password2@host2:port2"
]
data = await fetch_scholar_data(
scholar_id="SCHOLAR_ID",
proxies=proxies,
proxy_rotation="random", # or "sequential"
headless=True,
logger=logger
)
Return Data Structure
{
'name': str,
'citedby': int, # All-time citations
'citedby_recent': int, # Citations in last 5 years
'hindex': int, # All-time h-index
'hindex_recent': int, # Recent h-index
'i10index': int, # All-time i10-index
'i10index_recent': int, # Recent i10-index
'publications': [
{
'bib': {
'title': str,
'authors': str,
'venue': str
},
'num_citations': int, # Total citations
'ytd_citations': int, # Year-to-date citations
'year': str
},
# ...
]
}
Configuration Options
| Parameter | Type | Description |
|---|---|---|
scholar_id |
str | Google Scholar profile ID |
proxies |
Optional[Union[str, List[str]]] | Single proxy or list of proxies |
proxy_rotation |
str | Proxy rotation strategy ('sequential' or 'random') |
headless |
bool | Run browser in headless mode |
logger |
Optional[Logger] | Custom logger instance |
Requirements
- Python 3.8+
- Playwright
Debug Mode
The library saves HTML pages for debugging purposes in a debug_pages directory when running with a logger at DEBUG level.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyscholarly-0.1.77.tar.gz.
File metadata
- Download URL: pyscholarly-0.1.77.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afea0cc5c08eeefcfd8a593a8dbd290f018012554fd0a0280b903b6d048bd83f
|
|
| MD5 |
d307c45e11420dd5fc238540e486c656
|
|
| BLAKE2b-256 |
c9348eec995d4b7ca9596b90afccefcfc6e54f33389eff7b5e800e933dfce242
|
File details
Details for the file pyscholarly-0.1.77-py3-none-any.whl.
File metadata
- Download URL: pyscholarly-0.1.77-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa62ce235e514b0d240594bfb376dcec14c7ae03d8548800b127d7ff24ee8b80
|
|
| MD5 |
c3665cb1656e7495bee142ba3c609b74
|
|
| BLAKE2b-256 |
64845def9ca1783af2d7b9e3598fa8a0618cc5b1c67b443687092df1106f722c
|