Data Scraping for PubMed Central.

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.11

Project description

ScrapeMed

Data Scraping for PubMed Central

GitHub CI

PyPI PyPI - Downloads

⭐ Used by Duke University to power medical generative AI research.

⭐ Enables pythonic object-oriented access to a massive amount of research data. PMC constitutes over 14% of The Pile.

⭐ Natural language Paper querying and Paper embedding, powered via LangChain and ChromaDB

⭐ Easy to integrate with pandas for data science workflows

Installation

Available on PyPI! Simply pip install scrapemed.

Feature List

Scraping API for PubMed Central (PMC) ✅
Data Validation ✅
Markup Language Cleaning ✅
Processes all PMC XML into Paper objects ✅
Dataset building functionality (paperSets) ✅
Semantic paper vectorization with ChromaDB ✅
Natural language Paper querying ✅
Integration with pandas ✅
paperSet visualization ✅
Direct Search for Papers by PMCID on PMC ✅
Advanced Term Search for Papers on PMC ✅

Introduction

ScrapeMed is designed to make large-scale data science projects relying on PubMed Central (PMC) easy. The raw XML that can be downloaded from PMC is inconsistent and messy, and ScrapeMed aims to solve that problem at scale. ScrapeMed downloads, validates, cleans, and parses data from nearly all PMC articles into Paper objects which can then be used to build datasets (paperSets), or investigated in detail for literature reviews.

Beyond the heavy-lifting performed behind the scenes by ScrapeMed to standardize data scraped from PMC, a number of features are included to make data science and literature review work easier. A few are listed below:

Papers can be queried with natural language [.query()], or simply chunked and embedded for storage in a vector DB [.vectorize()]. Papers can also be converted to pandas Series easily [.to_relational()] for data science workflows.
paperSets can be visualized [.visualize()], or converted to pandas DataFrames [.to_relational()]. paperSets can be generated not only via a list of PMCIDs, but also via a search term using PMC advanced search [.from_search()].
Useful for advanced users: TextSections and TextParagraphs found within .abstract and .body attributes of Paper objects contain not only text [.text], but also text with attached reference data [.text_with_refs]. Reference data includes tables, figures, and citations. These are processed into DataFrames and data dicts and can be found within the .ref_map attribute of a Paper object. Simply decode references based on their MHTML index. ie. an MHTML tag of "MHTML::dataref::14" found in a TextSection of paper p corresponds to the table, fig, or citation at p.ref_map[14].

Documentation

The docs are hosted on Read The Docs!

Sponsorship

Package sponsored by Daceflow.ai!

If you'd like to sponsor a feature or donate to the project, reach out to me at danielfrees@g.ucla.edu.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.11

Release history Release notifications | RSS feed

This version

1.1.3

Jan 6, 2024

1.0.8

Sep 19, 2023

1.0.7

Sep 14, 2023

1.0.6

Sep 8, 2023

1.0.5

Sep 8, 2023

1.0.4

Sep 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemed-1.1.3.tar.gz (45.8 kB view details)

Uploaded Jan 6, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapemed-1.1.3-py3-none-any.whl (47.2 kB view details)

Uploaded Jan 6, 2024 Python 3

File details

Details for the file scrapemed-1.1.3.tar.gz.

File metadata

Download URL: scrapemed-1.1.3.tar.gz
Upload date: Jan 6, 2024
Size: 45.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for scrapemed-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`5fef2f5f74d0cb64a33020df1c15aad52da5c36702d2c4f0fd8e63edfdbcd24b`
MD5	`aec676b6aa526384008a3ef894a9a44c`
BLAKE2b-256	`a9785e8a6c79d9d775aab01993bc67928cb6974af9a460724565f16db461d9ca`

See more details on using hashes here.

File details

Details for the file scrapemed-1.1.3-py3-none-any.whl.

File metadata

Download URL: scrapemed-1.1.3-py3-none-any.whl
Upload date: Jan 6, 2024
Size: 47.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for scrapemed-1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`249c3820ddc3045e344885f0b9bcac44a4af37b86f5aee6b2ae11155a8eb8153`
MD5	`04184b5ac45cd59a271bf1270d520425`
BLAKE2b-256	`3957254e8458ee57c1b71c4a0e51e2d13b8ae8993d30d88c1b55bb9e0dc760e6`

See more details on using hashes here.

scrapemed 1.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ScrapeMed

Data Scraping for PubMed Central

Installation

Feature List

Introduction

Documentation

Sponsorship

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes