Automated BioMedical Information Curation for Machine Learning Applications.
Project description
BioVida is a library designed to make it easy to gain access to existing data sets of biomedical images as well as build brand new, custom-made ones.
It is hoped that by automating the tedious data munging that is typically involved in this process, more people will become interested in applying machine learning to biomedical images and, in turn, advancing insights into human disease.
In a nod to recursion, BioVida tries to accomplish some of this automation with machine learning itself, using tools like convolutional neural networks.
Installation
Python Package Index:
$ pip install biovida
Latest Build:
$ pip install git+git://github.com/TariqAHassan/BioVida@master
Requires Python 3.4+
Images: Stable
In just a few lines of code, you can gain access to biomedical databases which store tens of millions of images.
Please note that you are bound to adhere to the copyright and other usage restrictions under which this data is provided to you by its creators.
Open-i BioMedical Image Search Engine
# 1. Import the Interface for the NIH's Open-i API.
from biovida.images import OpeniInterface
# 2. Create an Instance of the Tool
opi = OpeniInterface()
# 3. Perform a search for x-rays and cts of lung cancer
opi.search(query='lung cancer', image_type=['x_ray', 'ct']) # Results Found: 9,220.
# 4. Pull the data
search_df = opi.pull()
Cancer Imaging Archive
# 1. Import the interface for the Cancer Imaging Archive
from biovida.images import CancerImageInterface
# 2. Create an Instance of the Tool
cii = CancerImageInterface(YOUR_API_KEY_HERE)
# 3. Perform a search
cii.search(cancer_type='esophageal')
# 4. Pull the data
cdf = cii.pull()
Both CancerImageInterface and OpeniInterface cache images for later use. When data is ‘pulled’, a records_db is generated, which is a dataframe of all text data associated with the images. They are provided as class attributes, e.g., cii.records_db. While records_db only stores data from the most recent data pull, cache_records_db dataframes provides an account of all image data currently cached.
Splitting Images
BioVida can divide cached images into train/validation/test.
from biovida.images import image_divvy
# 1. Define a rule to 'divvy' up images in the cache.
def my_divvy_rule(row):
if row['image_modality_major'] == 'x_ray':
return 'x_ray'
elif row['image_modality_major'] == 'ct':
return 'ct'
# 2. Define Proportions and Divide Data
tt = image_divvy(opi, my_divvy_rule, action='ndarray', train_val_test_dict={'train': 0.8, 'test': 0.2})
# 3. The resultant ndarrays can be unpacked as follows:
train_ct, train_xray = tt['train']['ct'], tt['train']['x_ray']
test_ct, test_xray = tt['test']['ct'], tt['test']['x_ray']
Images: Experimental
Automated Image Data Cleaning
Unfortunately, the data pulled from Open-i above is likely to contain a large number of images unrelated to the search query and/or are unsuitable for machine learning.
The experimental OpeniImageProcessing class can be used to completely automate this data cleaning process, which is partly powered by a Convolutional Neural Network.
# 1. Import Image Processing Tools
from biovida.images import OpeniImageProcessing
# 2. Instantiate the Tool using the OpeniInterface Instance
ip = OpeniImageProcessing(opi)
# 3. Analyze the Images
idf = ip.auto()
# 4. Use the Analysis to Clean Images
ip.clean_image_dataframe()
It is easy to split these images into training and test sets.
from biovida.images import image_divvy
def my_divvy_rule(row):
if row['image_modality_major'] == 'x_ray':
return 'x_ray'
elif row['image_modality_major'] == 'ct':
return 'ct'
tt = image_divvy(ip, my_divvy_rule, action='ndarray', train_val_test_dict={'train': 0.8, 'test': 0.2})
# These ndarrays can be unpack as shown above.
Genomic Data
While primarily focused on images, BioVida also provides a simple interface for obtaining related information, such genomic data.
# 1. Import the Interface for DisGeNET.org
from biovida.genomics import DisgenetInterface
# 2. Create an Instance of the Tool
dna = DisgenetInterface()
# 3. Pull a Database
gdf = dna.pull('curated')
Diagnostic Data
BioVida also makes it easy to obtain diagnostics data.
Information on disease definitions, families and synonyms:
# 1. Import the Interface for DiseaseOntology.org
from biovida.diagnostics import DiseaseOntInterface
# 2. Create an Instance of the Tool
doi = DiseaseOntInterface()
# 3. Pull the Database
ddf = doi.pull()
Information on symptoms associated with diseases:
# 1. Import the Interface for Disease-Symptoms Information
from biovida.diagnostics import DiseaseSymptomsInterface
# 2. Create an Instance of the Tool
dsi = DiseaseSymptomsInterface()
# 3. Pull the Database
dsdf = dsi.pull()
Unifying Information
The unify_against_images function integrates image data information against DisgenetInterface, DiseaseOntInterface and DiseaseSymptomsInterface.
from biovida.unification import unify_against_images
unify_against_images(interfaces=[cii, opi], db_to_extract='cache_records_db')
Left side of DataFrame: Image Data Alone
article_type |
image_id |
image_ca ption |
modality_best_guess |
age |
sex |
disease |
… |
|
---|---|---|---|---|---|---|---|---|
0 |
case_re port |
1 |
… |
Magnetic Resonance Imaging (MRI) |
73 |
male |
fibroma |
… |
1 |
case_re port |
2 |
… |
Magnetic Resonance Imaging (MRI) |
73 |
male |
fibroma |
… |
2 |
case_re port |
1 |
… |
Computed Tomography (CT): angiography |
45 |
femal e |
bile duct cancer |
… |
Right side of DataFrame: Added Information
disease_famil y |
disease_sy nonym |
disease_d efinition |
known_associ ated_symptom s |
mentioned_symptoms |
known_assoc iated_genes |
---|---|---|---|---|---|
(cell type benign neoplasm,) |
nan |
nan |
(abdominal pain,…) |
(pain,) |
((ANTXR2, 0.12), …) |
(cell type benign neoplasm,) |
nan |
nan |
(abdominal pain,…) |
(pain,) |
((ANTXR2, 0.12), …) |
(biliary tract cancer,) |
(bile duct tumor,…) |
A biliary tract… |
(abdominal obesity,..) |
(colic,) |
nan |
Documentation
Contributing
For more information on how to contribute, see the contributing document.
Bug reports and feature requests are always welcome and can be provided through the Issues page.
Resources
The resources document provides an account of all data sources and scholarly work used by BioVida.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file biovida-0.1.1.tar.gz
.
File metadata
- Download URL: biovida-0.1.1.tar.gz
- Upload date:
- Size: 125.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e09fbda396953eb31b316223545d37d38b7a3eb0a128d17658b19c8ca7aa13d |
|
MD5 | 77e6a655aa5b5811d2d55fba41b71113 |
|
BLAKE2b-256 | 1226f772446462de5b631e0890a99614824cb4d0f30a9d7910a8ab9b769715fb |