Skip to main content

Simple access to Google Scholar authors and citations

Project description

Python package

scholarly2

scholarly2 is a fork of scholarly, maintained independently and strictly for academic and nonprofit purposes. It retrieves author and publication data from Google Scholar and returns plain Python dictionaries. The current public workflow is:

  • author profiles by Scholar ID with search_author_id(...)
  • publication lookup with search_single_pub(...)
  • publication search iterators with search_pubs(...)
  • citation traversal with citedby(...) and search_citedby(...)
  • BibTeX export with bibtex(...)
  • journal ranking and mandate CSV endpoints
  • proxy configuration, including automatic .env.socks5 loading and explicit load_socks5_proxy_file(path)

Google Scholar behavior changes over time, so exact ranking, citation counts, snippets, and query-token URLs can vary between runs. The parsed result examples below are representative outputs from the current code.

Installation

Install the latest release from PyPI:

pip3 install scholarly2

Install from GitHub:

pip3 install -U git+https://github.com/ma-ji/scholarly2.git

scholarly2 follows Semantic Versioning.

Optional dependencies

Tor support is deprecated since v1.5 and is not actively tested or supported. If you still want it:

pip3 install scholarly2[tor]

For zsh, quote the extra:

pip3 install scholarly2'[tor]'

Quick Start

from itertools import islice
from scholarly2 import scholarly

# Best-match publication lookup.
pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
print(pub["bib"]["title"])

# Publication search returns an iterator, not a list.
results = list(islice(scholarly.search_pubs("machine learning"), 3))
for item in results:
    print(item["gsrank"], item["bib"]["title"])

# Author profile lookup is ID-first.
author = scholarly.search_author_id("Smr99uEAAAAJ")
print(author["name"], author["citedby"])

What You Get Back

All main APIs return plain dictionaries.

Common publication fields:

  • container_type
  • source
  • bib
  • filled
  • gsrank
  • pub_url
  • author_id
  • num_citations
  • citedby_url
  • url_related_articles
  • url_scholarbib
  • eprint_url

Common author fields:

  • container_type
  • scholar_id
  • source
  • name
  • affiliation
  • interests
  • email_domain
  • homepage
  • citedby
  • filled

filled is important:

  • filled: False means the object only contains the fields parsed from the current search result or profile page.
  • filled: True means scholarly.fill(...) fetched and merged extra metadata.

Parsed Result Examples

search_single_pub(...)

Best-match publication lookup is useful for exact titles and DOIs.

from scholarly2 import scholarly

pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")

Representative parsed result:

{'container_type': 'Publication',
 'source': <PublicationSource.PUBLICATION_SEARCH_SNIPPET: 'PUBLICATION_SEARCH_SNIPPET'>,
 'bib': {'title': 'A century of nonprofit studies: Scaling the knowledge of the field',
         'author': ['J Ma', 'S Konrath'],
         'pub_year': '2018',
         'venue': 'VOLUNTAS: International Journal of Voluntary and ...',
         'abstract': 'This empirical study examines knowledge production between 1925 and 2015 in nonprofit and philanthropic studies from quantitative and thematic perspectives. Quantitative results suggest that scholars in this field have been actively generating a considerable amount of literature and a solid intellectual base for developing this field toward a new discipline. Thematic analyses suggest that knowledge production in this field is also growing in cohesion-several main themes have been formed and actively advanced since 1980s, and the study of volunteering can be identified as a unique core theme of this field. The lack of geographic and cultural diversity is a critical challenge for advancing nonprofit studies. New paradigms are needed for developing this research field and mitigating the tension between academia and practice. Methodological and pedagogical implications, limitations, and future studies are discussed.'},
 'filled': False,
 'gsrank': 1,
 'pub_url': 'https://www.cambridge.org/core/journals/voluntas/article/century-of-nonprofit-studies-scaling-the-knowledge-of-the-field/...',
 'author_id': ['iVGd04UAAAAJ', '-bDW1IwAAAAJ'],
 'url_scholarbib': '/scholar?hl=en&q=info:veUUt9BplfoJ:scholar.google.com/&output=cite&scirp=0&hl=en',
 'url_add_sclib': '/citations?...&info=veUUt9BplfoJ&json=',
 'num_citations': 124,
 'citedby_url': '/scholar?cites=18056454626157585853&as_sdt=2005&sciodt=0,5&hl=en',
 'url_related_articles': '/scholar?q=related:veUUt9BplfoJ:scholar.google.com/&scioq=10.1007/s11266-018-00057-5&hl=en&as_sdt=0,5',
 'eprint_url': 'https://www.cambridge.org/core/services/aop-cambridge-core/content/view/...pdf'}

Notes:

  • search_single_pub(...) returns one best-match publication.
  • When Scholar exposes the expanded Show more abstract markup, scholarly2 prefers that full abstract.
  • For exact DOI or exact-title lookups, this path often returns richer abstracts than a broad search page.

search_pubs(...)

search_pubs(...) returns an iterator over search results. next(...) gives only the first result. Use itertools.islice or a loop if you want more than one.

from itertools import islice
from scholarly2 import scholarly

results = list(islice(scholarly.search_pubs("machine learning"), 3))

Notes:

  • search_pubs(...) returns whatever the live Scholar result page exposes for each row.
  • If Scholar serves the expanded abstract markup for a result row, scholarly2 returns the full abstract.
  • If Scholar only serves the short snippet, scholarly2 returns the snippet.

fill(...) on a publication

Use fill(...) when you want additional publication metadata after the initial search result.

from scholarly2 import scholarly

pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
filled_pub = scholarly.fill(pub)

fill(...) is where publication objects usually gain fields such as publisher, journal, pages, volume, number, pub_type, and bib_id.

search_author_id(...)

Anonymous Google Scholar author-name discovery is not part of the current public workflow. Start from a stable Scholar profile ID.

from scholarly2 import scholarly

author = scholarly.search_author_id("Smr99uEAAAAJ")

You can then fetch more sections:

author = scholarly.fill(author, sections=['basics', 'indices', 'counts', 'publications'])

Search Semantics

search_single_pub(...) vs search_pubs(...)

  • search_single_pub(query) returns one best-match result.
  • search_pubs(query) returns an iterator over search result rows.
  • next(scholarly.search_pubs(...)) returns only the first result.
  • Use itertools.islice(...) or a loop to consume more results.

filled

  • filled: False means initial parsed result.
  • filled: True means additional metadata was fetched.
  • Authors use a list of filled sections, such as ['basics'] or ['basics', 'indices', 'counts'].

Finding Author IDs

If you have a Scholar profile URL like:

https://scholar.google.com/citations?user=4bahYMkAAAAJ&hl=en

Use the user parameter value with search_author_id(...).

You can also collect author IDs from publication results:

from scholarly2 import scholarly

pub = scholarly.search_single_pub("Creating correct blur and its effect on accommodation")
print(pub["author_id"])
# ['4bahYMkAAAAJ', '3xJXtlwAAAAJ', 'Smr99uEAAAAJ']

Citations and BibTeX

Get citations for a publication:

from itertools import islice
from scholarly2 import scholarly

pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
first_citations = list(islice(scholarly.citedby(pub), 3))

Export BibTeX:

from scholarly2 import scholarly

pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")
print(scholarly.bibtex(pub))

Proxies

Google Scholar rate-limits aggressively. If you make enough requests, you should expect blocking and captcha pages. Use proxies for anything non-trivial.

There are many proxy providers available, I often use IPRoyal (disclaimer: this is a referral link). You are welcome to use your own, but make sure you choose Residential Proxies (may named differently depending on provider).

For simplicity, only SOCKS5 workflows are recommended. The legacy methods ScraperAPI(), Luminati(), FreeProxies(), SingleProxy(), Tor_External(), and Tor_Internal() remain for compatibility but are deprecated and will be removed in future releases.

Automatic .env.socks5 loading

If a .env.socks5 file exists in your working directory, scholarly2 loads it automatically at import time. Put one proxy per line in:

USER:PASS@HOST:PORT

Example:

user1:password1@127.0.0.1:1080
user2:password2@proxy.example.com:2080

See .env.socks5.example for the expected format.

Direct SOCKS5 configuration

Use ProxyGenerator.Socks5Proxies(...) when you want to configure the proxy pool in code:

from scholarly2 import ProxyGenerator, scholarly

pg = ProxyGenerator()
pg.Socks5Proxies([
    "user1:password1@127.0.0.1:1080",
    "user2:password2@proxy.example.com:2080",
])
scholarly.use_proxy(pg)

pub = scholarly.search_single_pub("10.1007/s11266-018-00057-5")

If you pass only one proxy generator to scholarly.use_proxy(pg), that same SOCKS5 pool is reused for all requests.

Explicit file loading

Use load_socks5_proxy_file(path) to load a proxy file from any location at runtime:

from scholarly2 import scholarly

ok = scholarly.load_socks5_proxy_file("/path/to/my.env.socks5")
if ok:
    print("Proxies loaded")

This is useful when your proxy file lives outside the working directory or has a non-standard name. The file format is the same one-proxy-per-line format as .env.socks5.

Deprecated legacy proxy methods

ProxyGenerator.ScraperAPI(), Luminati(), FreeProxies(), SingleProxy(), Tor_External(), and Tor_Internal() are deprecated compatibility paths. Existing code can still call them, but new setups should use .env.socks5, Socks5Proxies(...), Socks5ProxyFile(...), or load_socks5_proxy_file(path).

Availability Notes

Generally usable anonymously:

  • search_author_id
  • search_pubs
  • search_single_pub
  • search_citedby
  • fill
  • citedby
  • bibtex
  • journal endpoints
  • mandates CSV retrieval

Google may gate these Citations author-discovery endpoints behind sign-in:

  • search_keyword
  • search_keywords
  • search_author_custom_url
  • search_org
  • search_author_by_organization

If you need a reliable author workflow, prefer search_author_id(...).

Tests

From the repository root:

python -m unittest -v testdata.test_module

Target a smaller subset while iterating:

python -m unittest -v testdata.test_module.TestPublicationParser
python -m unittest -v testdata.test_module.TestNavigator

Documentation

See the hosted docs for the full API reference and quickstart:

Contributing

Contributions are welcome. Please create an issue, fork the repository, and submit a pull request. See .github/CONTRIBUTING.md for details.

License

The original code that this project was forked from was released by Luciano Bello under a WTFPL license. In keeping with that spirit, all code is released under the Unlicense.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scholarly2-2.0.0.tar.gz (45.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scholarly2-2.0.0-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file scholarly2-2.0.0.tar.gz.

File metadata

  • Download URL: scholarly2-2.0.0.tar.gz
  • Upload date:
  • Size: 45.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scholarly2-2.0.0.tar.gz
Algorithm Hash digest
SHA256 175cea4a6b195850b8b3568d8416a4b2b6d48b4bd388cf58e57af2a28888fd4a
MD5 d6e0fb9cdff7ff41dcc9b40e4c3c9684
BLAKE2b-256 b7c84dfb14c87d74fb9e95aecca0d44384b6131eebeae1a4fcff951a0de712b3

See more details on using hashes here.

File details

Details for the file scholarly2-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: scholarly2-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 44.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scholarly2-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec9012a57cd4fccfbba647726aaedbeeffe98a502d45bccbda6ea9f665c23a96
MD5 b6eb9ff93332ff3bccab0ac660c231e3
BLAKE2b-256 b805a26311d736afd6efd43b74c00c65999a788f345a6253e6cdce28c3d2f1f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page