Skip to main content

paperscraper: Package to scrape papers.

Project description

Build Status License: MIT PyPI version

paperscraper

Overview

paperscraper is a python package that ships via pypi and facilitates scraping publication metadata from PubMed or from preprint servers such as arXiv, medRxiv, bioRxiv or chemRiv. It provides a streamlined interface to scrape metadata and comes with simple postprocessing functions and plotting routines for meta-analysis.

Getting started

pip install paperscraper

This is enough to query PubMed, arXiv or Google Scholar.

Download X-rxiv Dumps

However, to scrape publication data from the preprint servers biorxiv, medrxiv or chemrxiv, the setup is different. The entire dump is downloaded and stored in the server_dumps folder in a .jsonl format (one paper per line).

from paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv
chemrxiv()  # Takes ~1h and should result in ~10 MB file
medrxiv()  # Takes ~30min and should result in ~35 MB file
biorxiv()  # Takes ~2.5h and should result in ~250 MB file

NOTE: For chemrxiv you need to create an access token in your account on figshare.com. Either pass the token to as keyword argument (chemrxiv(token=your_token)) or save it under ~/.config/figshare/chemrxiv.txt. NOTE: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect.

Examples

paperscraper is build on top of the packages pymed, arxiv and scholarly.

Publication keyword search

Consider you want to perform a publication keyword search with the query: COVID-19 AND Artificial Intelligence AND Medical Imaging.

  • Scrape papers from PubMed:
from paperscraper.pubmed import get_and_dump_pubmed_papers
covid19 = ['COVID-19', 'SARS-CoV-2']
ai = ['Artificial intelligence', 'Deep learning', 'Machine learning']
mi = ['Medical imaging']
query = [covid19, ai, mi]

get_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')
  • Scrape papers from arXiv:
from paperscraper.pubmed import get_and_dump_arxiv_papers

get_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')
  • Scrape papers from bioRiv, medRxiv or chemRxiv:
from paperscraper.xrxiv.xrxiv_query import XRXivQuery

querier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')
querier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')

You can also use dump_queries to iterate over a bunch of queries for all available databases.

from paperscraper import dump_queries

queries = [[covid19, ai, mi], [covid19, ai], [ai]]
dump_queries(queries, '.')
  • Scrape papers from Google Scholar:

Thanks to scholarly, there is an endpoint for Google Scholar too. It does not understand Boolean expressions like the others, but should be used just like the Google Scholar search fields.

from paperscraper.scholar import get_and_dump_scholar_papers
topic = 'Machine Learning'
get_and_dump_scholar_papers(topic)

Citation search

A plus of the Scholar endpoint is that the number of citations of a paper can be fetched:

from paperscraper.scholar import get_citations_from_title
title = 'Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.'
get_citations_from_title(title)

NOTE: The scholar endpoint does not require authentification but since it regularly prompts with captchas, it's difficult to apply large scale.

Plotting

When multiple query searches are performed, two types of plots can be generated automatically: Venn diagrams and bar plots.

Barplots

Compare the temporal evolution of different queries across different servers.

from paperscraper import QUERY_FN_DICT
from paperscraper.postprocessing import aggregate_paper
from paperscraper.utils import get_filename_from_query

# Define search terms and their synonyms
ml = ['Deep learning', 'Neural Network', 'Machine learning']
mol = ['molecule', 'molecular', 'drug', 'ligand', 'compound']
gnn = ['gcn', 'gnn', 'graph neural', 'graph convolutional', 'molecular graph']
smiles = ['SMILES', 'Simplified molecular']
fp = ['fingerprint', 'molecular fingerprint', 'fingerprints']

# Define queries
queries = [[ml, mol, smiles], [ml, mol, fp], [ml, mol, gnn]]

root = '../keyword_dumps'

data_dict = dict()
for query in queries:
    filename = get_filename_from_query(query)
    data_dict[filename] = dict()
    for db,_ in QUERY_FN_DICT.items():
        # Assuming the keyword search has been performed already
        with open(os.path.join(root, db, filename), 'r') as f:
            data = f.readlines()

        # Unstructured matches are aggregated into 6 bins, 1 per year
        # from 2015 to 2020. Sanity check is performed by having 
        # `filtering=True`, removing papers that don't contain all of
        # the keywords in query.
        data_dict[filename][db], filtered = aggregate_paper(
            data, 2015, bins_per_year=1, filtering=True,
            filter_keys=query, return_filtered=True
        )

# Plotting is now very simple
from paperscraper.plotting import plot_comparison

data_keys = [
    'deeplearning_molecule_fingerprint.jsonl',
    'deeplearning_molecule_smiles.jsonl', 
    'deeplearning_molecule_gcn.jsonl'
]
plot_comparison(
    data_dict,
    data_keys,
    title_text="'Deep Learning' AND 'Molecule' AND X",
    keyword_text=['Fingerprint', 'SMILES', 'Graph'],
    figpath='mol_representation'
)

molreps

Venn Diagrams

from paperscraper.plotting import (
    plot_venn_two, plot_venn_three, plot_multiple_venn
)

sizes_2020 = (30842, 14474, 2292, 35476, 1904, 1408, 376)
sizes_2019 = (55402, 11899, 2563)
labels_2020 = ('Medical\nImaging', 'Artificial\nIntelligence', 'COVID-19')
labels_2019 = ['Medical Imaging', 'Artificial\nIntelligence']

plot_venn_two(sizes_2019, labels_2019, title='2019', figname='ai_imaging')

2019

plot_venn_three(
    sizes_2020, labels_2020, title='2020', figname='ai_imaging_covid'
)

2020)

Or plot both together:

plot_multiple_venn(
    [sizes_2019, sizes_2020], [labels_2019, labels_2020], 
    titles=['2019', '2020'], suptitle='Keyword search comparison', 
    gridspec_kw={'width_ratios': [1, 2]}, figsize=(10, 6),
    figname='both'
)

both

Citation

If you use paperscraper, please cite the following:

@article{born2020role,
  title={On the Role of Artificial Intelligence in Medical Imaging of COVID-19},
  author={Born, Jannis and Beymer, David and Rajan, Deepta and Coy, Adam and Mukherjee, Vandana V and Manica, Matteo and Prasanna, Prasanth and Ballah, Deddeh and Shah, Pallav L and Karteris, Emmanouil and others},
  journal={medRxiv},
  year={2020},
  publisher={Cold Spring Harbor Laboratory Press}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperscraper-0.0.3.tar.gz (23.1 kB view details)

Uploaded Source

File details

Details for the file paperscraper-0.0.3.tar.gz.

File metadata

  • Download URL: paperscraper-0.0.3.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/51.1.0.post20201221 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.1

File hashes

Hashes for paperscraper-0.0.3.tar.gz
Algorithm Hash digest
SHA256 6b663e168ae360db3c5305be596258e14c43a5429a78ebc5b0fe7cd118ad2640
MD5 ed09f1baa662dc1dfb2d5f7b62164265
BLAKE2b-256 d0fd7472eeec67fd4d187bbf461f4fa43cdc90a2620d3eb81b866e5b7eaa9333

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page