Skip to main content

the easiest way to scrape preprints from biorxiv

Project description

bio2csv 🐧

bio2csv is a Python package that allows you to easily scrape all research papers that match a search query (such as penguins) on BioRxiv. It retrieves metadata (title, authors, link to each paper), and it can also fetch the abstract and the full text if specified. You can also scrape all research papers that fall under a specific Biorxiv subject area, such as Genetics or Paleontology. To encourage responsible use of biorxiv, short random delays are implemented into the code to prevent overload/spam.

Open In Colab

Easy Installation

You can install the bio2csv package with pip:

pip install bio2csv

You probably already have most of these dependencies:

  • from bs4 import BeautifulSoup (This is less common. Run pip install beautifulsoup4)
  • from tqdm import tqdm
  • import requests
  • import re
  • import time
  • import pandas as pd
  • import random

Usage

There are two functions available: scrape_biorxiv() and fetch_paper_details(). scrape_biorxiv() repeatedly calls fetch_paper_details().

scrape_biorxiv

Parameters:

  • base_url (str): READ THIS FULLY The base URL to scrape the papers from. Default is "https://www.biorxiv.org/collection/genetics?page=". You can choose from any of the subject areas here: https://www.biorxiv.org/. Or choose a search result URL such as https://www.biorxiv.org/search/penguins. You MUST APPEND ?page= to the end of the URL!

  • pages (int, optional): The number of pages to scrape. Default is 5.

  • get_abstract (bool, optional): Whether to fetch the abstract of each paper. Default is True. If you don't want to fetch the abstracts, set this to False.

  • get_full_text (bool, optional): Whether to fetch the full text of each paper. Default is True. If you don't want to fetch the full texts, set this to False. Images will not be fetched.

Returns:

  • pandas.DataFrame: A DataFrame containing the details of the scraped papers.

fetch_paper_details

Function fetch_paper_details fetches the abstract and the full text of a single paper.

Parameters:

  • paper_url (str): The URL of the paper to fetch the details from.

  • session (requests.Session): An active requests.Session() to fetch the details.

Returns:

  • tuple: A tuple containing the abstract and the full text of the paper.

Note: If the function encounters any error while fetching the details, it will return "Not found" for the abstract and/or the full text.

Quickstart

Parameters

Here's a simple usage example 🐧:

!pip install bio2csv

from bio2csv import scrape_biorxiv

# 🐧Scrape the first 2 pages of the search results for "penguin" and get the abstract and full texts. 🐧
df = scrape_biorxiv(pages=2, base_url = 'https://www.biorxiv.org/search/penguin?page=', get_abstract=True, get_full_text=True)

# Print the resulting DataFrame
print(df)

# Save to CSV
df.to_csv("PenguinPapers.csv")

Fetching text for a single paper

You can also use the fetch_paper_details function to fetch the abstract and full text of a single paper:

from bio2csv import fetch_paper_details
import requests

# Initialize a session
session = requests.Session()

# URL of a paper about penguin conservation 🐧
paper_url = "https://www.biorxiv.org/content/10.1101/2021.04.06.438390v1"

# Fetch details
abstract, full_text = fetch_paper_details(paper_url, session)

# Print details
print(f"Abstract: {abstract}")
print(f"Full Text: {full_text}")

Please note that the fetch_paper_details function needs an active requests.Session() to work.

Only Scraping Abstracts

from bio_scraper import scrape_biorxiv

# Scrape only the abstracts from the first 5 pages of the Genetics collection (remember, the default base_url is for the Genetics collection)
df_abstracts = scrape_biorxiv(pages=5, get_abstract=True, get_full_text=False)

print(df_abstracts)

Only Scraping Full Text

from bio_scraper import scrape_biorxiv

# Scrape only the full text from the first 5 pages of the genetics collection
df_full_texts = scrape_biorxiv(pages=5, get_abstract=False, get_full_text=True)

print(df_full_texts)

Remember, it's important to use web scraping responsibly and respect terms of service! This code sends about one request every 10 seconds so it will not overload the biorxiv servers. I intentionally did not implement multithreading in order to prevent abuse of biorxiv. Also, you don't want to get IP banned.

Contributing

Contributions to bio2csv are welcome! If you have a feature request, bug report, or proposal, please open an issue on this repository. If you wish to contribute code, please fork the repository, make your changes, and submit a pull request. The penguin examples were inspired by my CS161 class at Stanford which features Plucky the Pedantic Penguin. If you find this repository useful, consider donating to the Global Penguin Society 🐧🐧🐧

License

bio2csv is released under the MIT License. For more details, see the LICENSE file in this repository. You are responsible for how you use this package. I am not liable for any losses, harms, damages, or other consequences incurred by this package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio2csv-0.0.3.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

bio2csv-0.0.3-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file bio2csv-0.0.3.tar.gz.

File metadata

  • Download URL: bio2csv-0.0.3.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for bio2csv-0.0.3.tar.gz
Algorithm Hash digest
SHA256 d14f6815840668055011c6ff9ca1362d563822217c0aae5a40bbe7acc9850b33
MD5 0ba81f88418d0af7933203acc9da8ad6
BLAKE2b-256 755b42e3c85eedbbe8419da77f7227c29fce9e9411881d3b11cae08907ebb173

See more details on using hashes here.

File details

Details for the file bio2csv-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: bio2csv-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for bio2csv-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fd4fffebcebd20b0db8f55ad4eb0f1eba02e18878f0b2ddaeff6e646989debd1
MD5 85482479245c1c8567d428a4149c523e
BLAKE2b-256 8b7b6db6cdcfb261b21215b1f7d25b7eb7fb71e65ec0d89e0fe1dba045f4ecf7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page