the easiest way to scrape preprints from biorxiv
Project description
bio2csv 🐧
bio2csv
is a Python package that allows you to easily scrape all research papers that match a search query (such as penguins) on BioRxiv. It retrieves metadata (title, authors, link to each paper), and it can also fetch the abstract and the full text if specified.
You can also scrape all research papers that fall under a specific Biorxiv subject area, such as Genetics or Paleontology. To encourage responsible use of biorxiv, short random delays are implemented into the code to prevent overload/spam.
Easy Installation
You can install the bio2csv
package with pip:
pip install bio2csv
You probably already have most of these dependencies:
- from bs4 import BeautifulSoup (This is less common. Run pip install beautifulsoup4)
- from tqdm import tqdm
- import requests
- import re
- import time
- import pandas as pd
- import random
Usage
There are two functions available: scrape_biorxiv() and fetch_paper_details(). scrape_biorxiv() repeatedly calls fetch_paper_details().
scrape_biorxiv
Parameters:
-
base_url
(str): READ THIS FULLY The base URL to scrape the papers from. Default is "https://www.biorxiv.org/collection/genetics?page=". You can choose from any of the subject areas here: https://www.biorxiv.org/. Or choose a search result URL such as https://www.biorxiv.org/search/penguins. You MUST APPEND ?page= to the end of the URL! -
pages
(int, optional): The number of pages to scrape. Default is 5. -
get_abstract
(bool, optional): Whether to fetch the abstract of each paper. Default isTrue
. If you don't want to fetch the abstracts, set this toFalse
. -
get_full_text
(bool, optional): Whether to fetch the full text of each paper. Default isTrue
. If you don't want to fetch the full texts, set this toFalse
. Images will not be fetched.
Returns:
pandas.DataFrame
: A DataFrame containing the details of the scraped papers.
fetch_paper_details
Function fetch_paper_details
fetches the abstract and the full text of a single paper.
Parameters:
-
paper_url
(str): The URL of the paper to fetch the details from. -
session
(requests.Session): An activerequests.Session()
to fetch the details.
Returns:
tuple
: A tuple containing the abstract and the full text of the paper.
Note: If the function encounters any error while fetching the details, it will return "Not found" for the abstract and/or the full text.
Quickstart
Parameters
Here's a simple usage example 🐧:
!pip install bio2csv
from bio2csv import scrape_biorxiv
# 🐧Scrape the first 2 pages of the search results for "penguin" and get the abstract and full texts. 🐧
df = scrape_biorxiv(pages=2, base_url = 'https://www.biorxiv.org/search/penguin?page=', get_abstract=True, get_full_text=True)
# Print the resulting DataFrame
print(df)
# Save to CSV
df.to_csv("PenguinPapers.csv")
Fetching text for a single paper
You can also use the fetch_paper_details
function to fetch the abstract and full text of a single paper:
from bio2csv import fetch_paper_details
import requests
# Initialize a session
session = requests.Session()
# URL of a paper about penguin conservation 🐧
paper_url = "https://www.biorxiv.org/content/10.1101/2021.04.06.438390v1"
# Fetch details
abstract, full_text = fetch_paper_details(paper_url, session)
# Print details
print(f"Abstract: {abstract}")
print(f"Full Text: {full_text}")
Please note that the fetch_paper_details
function needs an active requests.Session()
to work.
Only Scraping Abstracts
from bio_scraper import scrape_biorxiv
# Scrape only the abstracts from the first 5 pages of the Genetics collection (remember, the default base_url is for the Genetics collection)
df_abstracts = scrape_biorxiv(pages=5, get_abstract=True, get_full_text=False)
print(df_abstracts)
Only Scraping Full Text
from bio_scraper import scrape_biorxiv
# Scrape only the full text from the first 5 pages of the genetics collection
df_full_texts = scrape_biorxiv(pages=5, get_abstract=False, get_full_text=True)
print(df_full_texts)
Remember, it's important to use web scraping responsibly and respect terms of service! This code sends about one request every 10 seconds so it will not overload the biorxiv servers. I intentionally did not implement multithreading in order to prevent abuse of biorxiv. Also, you don't want to get IP banned.
Contributing
Contributions to bio2csv
are welcome! If you have a feature request, bug report, or proposal, please open an issue on this repository. If you wish to contribute code, please fork the repository, make your changes, and submit a pull request.
The penguin examples were inspired by my CS161 class at Stanford which features Plucky the Pedantic Penguin.
If you find this repository useful, consider donating to the Global Penguin Society
🐧🐧🐧
License
bio2csv
is released under the MIT License. For more details, see the LICENSE
file in this repository.
You are responsible for how you use this package. I am not liable for any losses, harms, damages, or other consequences incurred by this package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bio2csv-0.0.3.tar.gz
.
File metadata
- Download URL: bio2csv-0.0.3.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d14f6815840668055011c6ff9ca1362d563822217c0aae5a40bbe7acc9850b33 |
|
MD5 | 0ba81f88418d0af7933203acc9da8ad6 |
|
BLAKE2b-256 | 755b42e3c85eedbbe8419da77f7227c29fce9e9411881d3b11cae08907ebb173 |
File details
Details for the file bio2csv-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: bio2csv-0.0.3-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd4fffebcebd20b0db8f55ad4eb0f1eba02e18878f0b2ddaeff6e646989debd1 |
|
MD5 | 85482479245c1c8567d428a4149c523e |
|
BLAKE2b-256 | 8b7b6db6cdcfb261b21215b1f7d25b7eb7fb71e65ec0d89e0fe1dba045f4ecf7 |