Simple access to Google Scholar authors and citations
Project description
scholarly2
scholarly2 is a fork of scholarly, maintained independently and strictly for academic and nonprofit purposes. It retrieves author and publication data from Google Scholar and returns plain Python dictionaries. The current public workflow is:
- author profiles by Scholar ID with
search_author_id(...) - publication lookup with
search_single_pub(...) - publication search iterators with
search_pubs(...) - citation traversal with
citedby(...)andsearch_citedby(...) - BibTeX export with
bibtex(...) - journal ranking and mandate CSV endpoints
- proxy configuration, including automatic
.env.socks5loading and explicitload_socks5_proxy_file(path)
Google Scholar behavior changes over time, so exact ranking, citation counts, snippets, and query-token URLs can vary between runs. The parsed result examples below are representative outputs from the current code.
Installation
Install the latest release from PyPI:
pip3 install scholarly2
Install from GitHub:
pip3 install -U git+https://github.com/ma-ji/scholarly2.git
scholarly2 follows Semantic Versioning.
Optional dependencies
Tor support is deprecated since v1.5 and is not actively tested or supported. If you still want it:
pip3 install scholarly2[tor]
For zsh, quote the extra:
pip3 install scholarly2'[tor]'
Quick Start
from itertools import islice
from scholarly2 import scholarly
# Best-match publication lookup.
pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
print(pub["bib"]["title"])
# Publication search returns an iterator, not a list.
results = list(islice(scholarly.search_pubs("machine learning"), 3))
for item in results:
print(item["gsrank"], item["bib"]["title"])
# Author profile lookup is ID-first.
author = scholarly.search_author_id("Smr99uEAAAAJ")
print(author["name"], author["citedby"])
What You Get Back
All main APIs return plain dictionaries.
Common publication fields:
container_typesourcebibfilledgsrankpub_urlauthor_idnum_citationscitedby_urlurl_related_articlesurl_scholarbibeprint_url
Common author fields:
container_typescholar_idsourcenameaffiliationinterestsemail_domainhomepagecitedbyfilled
filled is important:
filled: Falsemeans the object only contains the fields parsed from the current search result or profile page.filled: Truemeansscholarly.fill(...)fetched and merged extra metadata.
Parsed Result Examples
search_single_pub(...)
Best-match publication lookup is useful for exact titles and DOIs.
from scholarly2 import scholarly
pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
Representative parsed result:
{'container_type': 'Publication',
'source': <PublicationSource.PUBLICATION_SEARCH_SNIPPET: 'PUBLICATION_SEARCH_SNIPPET'>,
'bib': {'title': 'A century of nonprofit studies: Scaling the knowledge of the field',
'author': ['J Ma', 'S Konrath'],
'pub_year': '2018',
'venue': 'VOLUNTAS: International Journal of Voluntary and ...',
'abstract': 'This empirical study examines knowledge production between 1925 and 2015 in nonprofit and philanthropic studies from quantitative and thematic perspectives. Quantitative results suggest that scholars in this field have been actively generating a considerable amount of literature and a solid intellectual base for developing this field toward a new discipline. Thematic analyses suggest that knowledge production in this field is also growing in cohesion-several main themes have been formed and actively advanced since 1980s, and the study of volunteering can be identified as a unique core theme of this field. The lack of geographic and cultural diversity is a critical challenge for advancing nonprofit studies. New paradigms are needed for developing this research field and mitigating the tension between academia and practice. Methodological and pedagogical implications, limitations, and future studies are discussed.'},
'filled': False,
'gsrank': 1,
'pub_url': 'https://www.cambridge.org/core/journals/voluntas/article/century-of-nonprofit-studies-scaling-the-knowledge-of-the-field/...',
'author_id': ['iVGd04UAAAAJ', '-bDW1IwAAAAJ'],
'url_scholarbib': '/scholar?hl=en&q=info:veUUt9BplfoJ:scholar.google.com/&output=cite&scirp=0&hl=en',
'url_add_sclib': '/citations?...&info=veUUt9BplfoJ&json=',
'num_citations': 124,
'citedby_url': '/scholar?cites=18056454626157585853&as_sdt=2005&sciodt=0,5&hl=en',
'url_related_articles': '/scholar?q=related:veUUt9BplfoJ:scholar.google.com/&scioq=10.1007/s11266-018-00057-5&hl=en&as_sdt=0,5',
'eprint_url': 'https://www.cambridge.org/core/services/aop-cambridge-core/content/view/...pdf'}
Notes:
search_single_pub(...)returns one best-match publication.- When Scholar exposes the expanded
Show moreabstract markup,scholarly2prefers that full abstract. - For exact DOI or exact-title lookups, this path often returns richer abstracts than a broad search page.
search_pubs(...)
search_pubs(...) returns an iterator over search results. next(...) gives only the first result. Use itertools.islice or a loop if you want more than one.
from itertools import islice
from scholarly2 import scholarly
results = list(islice(scholarly.search_pubs("machine learning"), 3))
Notes:
search_pubs(...)returns whatever the live Scholar result page exposes for each row.- If Scholar serves the expanded abstract markup for a result row,
scholarly2returns the full abstract. - If Scholar only serves the short snippet,
scholarly2returns the snippet.
fill(...) on a publication
Use fill(...) when you want additional publication metadata after the initial search result.
from scholarly2 import scholarly
pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
filled_pub = scholarly.fill(pub)
fill(...) is where publication objects usually gain fields such as publisher, journal, pages, volume, number, pub_type, and bib_id.
search_author_id(...)
Anonymous Google Scholar author-name discovery is not part of the current public workflow. Start from a stable Scholar profile ID.
from scholarly2 import scholarly
author = scholarly.search_author_id("Smr99uEAAAAJ")
You can then fetch more sections:
author = scholarly.fill(author, sections=['basics', 'indices', 'counts', 'publications'])
Search Semantics
search_single_pub(...) vs search_pubs(...)
search_single_pub(query)returns one best-match result.search_pubs(query)returns an iterator over search result rows.next(scholarly.search_pubs(...))returns only the first result.- Use
itertools.islice(...)or a loop to consume more results.
filled
filled: Falsemeans initial parsed result.filled: Truemeans additional metadata was fetched.- Authors use a list of filled sections, such as
['basics']or['basics', 'indices', 'counts'].
Finding Author IDs
If you have a Scholar profile URL like:
https://scholar.google.com/citations?user=4bahYMkAAAAJ&hl=en
Use the user parameter value with search_author_id(...).
You can also collect author IDs from publication results:
from scholarly2 import scholarly
pub = scholarly.search_single_pub("Creating correct blur and its effect on accommodation")
print(pub["author_id"])
# ['4bahYMkAAAAJ', '3xJXtlwAAAAJ', 'Smr99uEAAAAJ']
Citations and BibTeX
Get citations for a publication:
from itertools import islice
from scholarly2 import scholarly
pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
first_citations = list(islice(scholarly.citedby(pub), 3))
Export BibTeX:
from scholarly2 import scholarly
pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
print(scholarly.bibtex(pub))
Proxies
Google Scholar rate-limits aggressively. If you make enough requests, you should expect blocking and captcha pages. Use proxies for anything non-trivial.
There are many proxy providers available, I often use IPRoyal (disclaimer: this is a referral link). You are welcome to use your own, but make sure you choose Residential Proxies (may named differently depending on provider).
For simplicity, only SOCKS5 workflows are recommended. The legacy methods ScraperAPI(), Luminati(), FreeProxies(), SingleProxy(), Tor_External(), and Tor_Internal() remain for compatibility but are deprecated and will be removed in future releases.
Automatic .env.socks5 loading
If a .env.socks5 file exists in your working directory, scholarly2 loads it automatically at import time. Put one proxy per line in:
USER:PASS@HOST:PORT
Example:
user1:password1@127.0.0.1:1080
user2:password2@proxy.example.com:2080
See .env.socks5.example for the expected format.
Direct SOCKS5 configuration
Use ProxyGenerator.Socks5Proxies(...) when you want to configure the proxy pool in code:
from scholarly2 import ProxyGenerator, scholarly
pg = ProxyGenerator()
pg.Socks5Proxies([
"user1:password1@127.0.0.1:1080",
"user2:password2@proxy.example.com:2080",
])
scholarly.use_proxy(pg)
pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
If you pass only one proxy generator to scholarly.use_proxy(pg), that same SOCKS5 pool is reused for all requests.
Explicit file loading
Use load_socks5_proxy_file(path) to load a proxy file from any location at runtime:
from scholarly2 import scholarly
ok = scholarly.load_socks5_proxy_file("/path/to/my.env.socks5")
if ok:
print("Proxies loaded")
This is useful when your proxy file lives outside the working directory or has a non-standard name. The file format is the same one-proxy-per-line format as .env.socks5.
Deprecated legacy proxy methods
ProxyGenerator.ScraperAPI(), Luminati(), FreeProxies(), SingleProxy(), Tor_External(), and Tor_Internal() are deprecated compatibility paths. Existing code can still call them, but new setups should use .env.socks5, Socks5Proxies(...), Socks5ProxyFile(...), or load_socks5_proxy_file(path).
Availability Notes
Generally usable anonymously:
search_author_idsearch_pubssearch_single_pubsearch_citedbyfillcitedbybibtex- journal endpoints
- mandates CSV retrieval
Google may gate these Citations author-discovery endpoints behind sign-in:
search_keywordsearch_keywordssearch_author_custom_urlsearch_orgsearch_author_by_organization
If you need a reliable author workflow, prefer search_author_id(...).
Tests
From the repository root:
python -m unittest -v testdata.test_module
Target a smaller subset while iterating:
python -m unittest -v testdata.test_module.TestPublicationParser
python -m unittest -v testdata.test_module.TestNavigator
Documentation
See the hosted docs for the full API reference and quickstart:
Contributing
Contributions are welcome. Please create an issue, fork the repository, and submit a pull request. See .github/CONTRIBUTING.md for details.
License
The original code that this project was forked from was released by Luciano Bello under a WTFPL license. In keeping with that spirit, all code is released under the Unlicense.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scholarly2-2.0.0.tar.gz.
File metadata
- Download URL: scholarly2-2.0.0.tar.gz
- Upload date:
- Size: 45.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
175cea4a6b195850b8b3568d8416a4b2b6d48b4bd388cf58e57af2a28888fd4a
|
|
| MD5 |
d6e0fb9cdff7ff41dcc9b40e4c3c9684
|
|
| BLAKE2b-256 |
b7c84dfb14c87d74fb9e95aecca0d44384b6131eebeae1a4fcff951a0de712b3
|
File details
Details for the file scholarly2-2.0.0-py3-none-any.whl.
File metadata
- Download URL: scholarly2-2.0.0-py3-none-any.whl
- Upload date:
- Size: 44.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec9012a57cd4fccfbba647726aaedbeeffe98a502d45bccbda6ea9f665c23a96
|
|
| MD5 |
b6eb9ff93332ff3bccab0ac660c231e3
|
|
| BLAKE2b-256 |
b805a26311d736afd6efd43b74c00c65999a788f345a6253e6cdce28c3d2f1f5
|