Skip to main content

the easiest way to scrape preprints from biorxiv

Project description

bio2csv 🐧

bio2csv is a Python package that allows you to easily scrape all research papers that match a search query (such as penguins) on BioRxiv. It retrieves metadata (title, authors, link to each paper), and it can also fetch the abstract and the full text if specified. You can also scrape all research papers that fall under a specific Biorxiv subject area, such as Genetics or Paleontology. To encourage responsible use of biorxiv, short random delays are implemented into the code to prevent overload/spam.

Open In Colab

Easy Installation

You can install the bio2csv package with pip:

pip install bio2csv

You probably already have most of these dependencies:

  • from bs4 import BeautifulSoup (This is less common. Run pip install beautifulsoup4)
  • from tqdm import tqdm
  • import requests
  • import re
  • import time
  • import pandas as pd
  • import random

Usage

There are two functions available: scrape_biorxiv() and fetch_paper_details(). scrape_biorxiv() repeatedly calls fetch_paper_details().

scrape_biorxiv

Parameters:

  • base_url (str): READ THIS FULLY The base URL to scrape the papers from. Default is "https://www.biorxiv.org/collection/genetics?page=". You can choose from any of the subject areas here: https://www.biorxiv.org/. Or choose a search result URL such as https://www.biorxiv.org/search/penguins. You MUST APPEND ?page= to the end of the URL!

  • pages (int, optional): The number of pages to scrape. Default is 5.

  • get_abstract (bool, optional): Whether to fetch the abstract of each paper. Default is True. If you don't want to fetch the abstracts, set this to False.

  • get_full_text (bool, optional): Whether to fetch the full text of each paper. Default is True. If you don't want to fetch the full texts, set this to False. Images will not be fetched.

Returns:

  • pandas.DataFrame: A DataFrame containing the details of the scraped papers.

fetch_paper_details

Function fetch_paper_details fetches the abstract and the full text of a single paper.

Parameters:

  • paper_url (str): The URL of the paper to fetch the details from.

  • session (requests.Session): An active requests.Session() to fetch the details.

Returns:

  • tuple: A tuple containing the abstract and the full text of the paper.

Note: If the function encounters any error while fetching the details, it will return "Not found" for the abstract and/or the full text.

Quickstart

Parameters

Here's a simple usage example 🐧:

!pip install bio2csv

from bio2csv import scrape_biorxiv

# 🐧Scrape the first 2 pages of the search results for "penguin" and get the abstract and full texts. 🐧
df = scrape_biorxiv(pages=2, base_url = 'https://www.biorxiv.org/search/penguin?page=', get_abstract=True, get_full_text=True)

# Print the resulting DataFrame
print(df)

# Save to CSV
df.to_csv("PenguinPapers.csv")

Fetching text for a single paper

You can also use the fetch_paper_details function to fetch the abstract and full text of a single paper:

from bio2csv import fetch_paper_details
import requests

# Initialize a session
session = requests.Session()

# URL of a paper about penguin conservation 🐧
paper_url = "https://www.biorxiv.org/content/10.1101/2021.04.06.438390v1"

# Fetch details
abstract, full_text = fetch_paper_details(paper_url, session)

# Print details
print(f"Abstract: {abstract}")
print(f"Full Text: {full_text}")

Please note that the fetch_paper_details function needs an active requests.Session() to work.

Only Scraping Abstracts

from bio_scraper import scrape_biorxiv

# Scrape only the abstracts from the first 5 pages of the Genetics collection (remember, the default base_url is for the Genetics collection)
df_abstracts = scrape_biorxiv(pages=5, get_abstract=True, get_full_text=False)

print(df_abstracts)

Only Scraping Full Text

from bio_scraper import scrape_biorxiv

# Scrape only the full text from the first 5 pages of the genetics collection
df_full_texts = scrape_biorxiv(pages=5, get_abstract=False, get_full_text=True)

print(df_full_texts)

Remember, it's important to use web scraping responsibly and respect terms of service! This code sends about one request every 10 seconds so it will not overload the biorxiv servers. I intentionally did not implement multithreading in order to prevent abuse of biorxiv. Also, you don't want to get IP banned.

Contributing

Contributions to bio2csv are welcome! If you have a feature request, bug report, or proposal, please open an issue on this repository. If you wish to contribute code, please fork the repository, make your changes, and submit a pull request. The penguin examples were inspired by my CS161 class at Stanford which features Plucky the Pedantic Penguin. If you find this repository useful, consider donating to the Global Penguin Society 🐧🐧🐧

License

bio2csv is released under the MIT License. For more details, see the LICENSE file in this repository. You are responsible for how you use this package. I am not liable for any losses, harms, damages, or other consequences incurred by this package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio2csv-0.0.3.tar.gz (5.4 kB view hashes)

Uploaded Source

Built Distribution

bio2csv-0.0.3-py3-none-any.whl (5.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page