Data Scraping for PubMed Central.

These details have not been verified by PyPI

Project links

Project description

ScrapeMed

Data Scraping for PubMed Central

GitHub CI

⭐ Used by Duke University to power medical generative AI research.

⭐ Enables pythonic object-oriented access to a massive amount of research data. PMC constitutes over 14% of The Pile.

⭐ Natural language Paper querying and Paper embedding, powered via LangChain and ChromaDB

⭐ Available on PyPI! Simply pip install scrapemed.

Shoutout:

Package sponsored by Daceflow.ai!

Feature List

Scraping API for PubMed Central (PMC) ✅
Full Adanced Term Search for Papers on PMC ✅
Direct Search for Papers by PMCID on PMC ✅
Data Validation ✅
Markup Language Cleaning ✅
Process PMC XMl into Paper objects ✅
Dataset building functionality (paperSets) ✅
Integration with pandas for easy use in data science applications ✅
Semantic paper vectorization with ChromaDB ✅
Natural language paper querying ✅
Integration with pandas ✅
paperSet visualization ✅

Developer Usage

License: MIT

Feel free to fork and continue work on the ScrapeMed package, it is licensed under the MIT license to promote collaboration, extension, and inheritance.

Make sure to create a conda environment and install the necessary requirements before developing this package.

ie: $ conda create --name myenv --file requirements.txt

Add a .env file in your base scrapemed directory with PMC_EMAIL=youremail@example.com. This is necessary for several of the test scripts and may be useful for your development in general.

You will need to install clang++ for chromadb and the Paper vectorization to work. You also need to make sure you have python 3.11 installed and active in your dev environment.

Now an overview of the package structure:

Under examples you can find some example work using the scrapemed modules, which may provide some insight into usage possibilities.

Under examples/data you will find some example downloaded date (XML from Pubmed Central). It is recommended that any time you download data while working out of the notebooks, it should go here. Downloads will also go here by default when passing download=True to the scrapemed module functions which allow you to do so.

Under scrapemed/tests you will find several python scripts which can be run using pytest. If you also clone the .github/workflows/test-scrapemed.yml, these tests will be automatically run on any PR/ push to your github repo. Under scrapemed/test/testdata are some XML data crafted for the purpose of testing scrapemed. This data is necessary to run the testing scripts.

Each of the scrapemed python modules has a docstring at the top describing its general purpose and usage. All functions should also have descriptive docstrings and descriptions of input/output. Please contact me if any documentation is unclear. Full documentation is on its way.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.3

Jan 6, 2024

1.0.8

Sep 19, 2023

1.0.7

Sep 14, 2023

This version

1.0.6

Sep 8, 2023

1.0.5

Sep 8, 2023

1.0.4

Sep 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemed-1.0.6.tar.gz (38.2 kB view details)

Uploaded Sep 8, 2023 Source

Built Distribution

scrapemed-1.0.6-py3-none-any.whl (39.3 kB view details)

Uploaded Sep 8, 2023 Python 3

File details

Details for the file scrapemed-1.0.6.tar.gz.

File metadata

Download URL: scrapemed-1.0.6.tar.gz
Upload date: Sep 8, 2023
Size: 38.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for scrapemed-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`359272d22f1ba433a7a6f4c549b014f51350e653d6443a46ddb653b9368af874`
MD5	`a72fd38faf4a534c163539487bc59ea3`
BLAKE2b-256	`74360560981d80dc54f4058b0125fb740dd3edb61594774e2ee01b3cc6002210`

See more details on using hashes here.

File details

Details for the file scrapemed-1.0.6-py3-none-any.whl.

File metadata

Download URL: scrapemed-1.0.6-py3-none-any.whl
Upload date: Sep 8, 2023
Size: 39.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for scrapemed-1.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd3f0b5c224f697ff502750e41636176fbd3b8c0718e4e42f3bc219d36747a84`
MD5	`a0a688c0e5c89cb2dc257f7978d7fe63`
BLAKE2b-256	`9f76ce55dd974b836069bbeb4559bf2fafc31f8083d1d44a161519dd241ad362`