Skip to main content

Phenotype comparison scoring by semantic similarity.

Project description

phenopy

phenopy is a Python package to perform phenotype similarity scoring by semantic similarity. phenopy is a lightweight but highly optimized command line tool and library to efficiently perform semantic similarity scoring on generic entities with phenotype annotations from the Human Phenotype Ontology (HPO).

Phenotype Similarity Clustering

Installation

GitHub

Install from GitHub:

git clone https://github.com/GeneDx/phenopy.git
cd phenopy
python setup.py install

Command Line Usage

Initial setup

phenopy is designed to run with minimal setup from the user, to run phenopy with default parameters (recommended), skip ahead to the Commands overview.

This section provides details about where phenopy stores data resources and config files. The following occurs when you run phenopy for the first time.

  1. phenopy creates a .phenopy/ directory in your home folder and downloads external resources from HPO into the $HOME/.phenopy/data/ directory.
  2. phenopy stores a binary version of the HPO as a networkx graph object here: $HOME/.phenopy/data/hpo_network.pickle.
  3. phenopy creates a $HOME/.phenopy/phenopy.ini config file where users can set variables for phenopy to use at runtime.

Commands overview

phenopy is primarily used as a command line tool. An entity, as described here, is presented as a sample, gene, or disease, but could be any concept that warrants annotation of phenotype terms.

  1. Score similarity of an entity defined by the HPO terms from an input file against all the genes in .phenopy/data/phenotype_to_genes.txt. We provide a test input file in the repo.

    phenopy score tests/data/test.score.txt
    

    Output:

    #query	gene	score
    SAMPLE	NCBI:10000[AKT3]	0.0252
    SAMPLE	NCBI:10002[NR2E3]	0.0148
    SAMPLE	NCBI:100033413[SNORD116-1]	0.0283
    ...
    
  2. Score similarity of an entity defined by the HPO terms from an input file against a custom list of entities with HPO annotations, referred to as the --records-file.

    phenopy score tests/data/test.score.txt --records-file tests/data/test.score-product.txt
    

    Output:

    #query	entity_id	score
    SAMPLE	118200	0.0584
    SAMPLE	118210	0.057
    SAMPLE	118220	0.0563
    ...
    
  3. Score pairwise similarity of entities defined in the --records-file.

    phenopy score-product tests/data/test.score-product.txt --threads 4
    

    Output:

    118200	118200	0.7692
    118200	118300	0.5345
    118200	300905	0.2647
    ...
    

Parameters

For a full list of command arguments use phenopy [subcommand] --help:

phenopy score --help

Output:

    --records_file=RECORDS_FILE
        One record per line, tab delimited. First column record unique identifier, second column pipe separated list of HPO identifier (HP:0000001).
    --query_name=QUERY_NAME
        Unique identifier for the query file.
    --obo_file=OBO_FILE
        OBO file from https://hpo.jax.org/app/download/ontology.
    --pheno2genes_file=PHENO2GENES_FILE
        Phenotypes to genes from https://hpo.jax.org/app/download/annotation.
    --threads=THREADS
        Number of parallel process to use.
    --agg_score=AGG_SCORE
        The aggregation method to use for summarizing the similarity matrix between two term sets Must be one of {'BMA', 'maximum'}
    --no_parents=NO_PARENTS
        If provided, scoring is done by only using the most informative nodes. All parent nodes are removed.
    --hpo_network_file=HPO_NETWORK_FILE
        If provided, phenopy will try to load a cached hpo_network obejct from file.
    --custom_annotations_file=CUSTOM_ANNOTATIONS_FILE
        A comma-separated list of custom annotation files in the same format as tests/data/test.score-product.txt
    --output_file=OUTPUT_FILE
        filepath where to store the results.  

Library Usage

The phenopy library can be used as a Python module, allowing more control for advanced users.

import os
from phenopy import config
from phenopy.obo import restore
from phenopy.score import Scorer

network_file = os.path.join(config.data_directory, 'hpo_network.pickle')

hpo = restore(network_file)
scorer = Scorer(hpo)

terms_a = ['HP:0001882', 'HP:0011839']
terms_b = ['HP:0001263', 'HP:0000252']

print(scorer.score(terms_a, terms_b))

Output:

0.0005

Another example is to use the library to prune parent phenotypes from the phenotype_to_genes.txt

import os
from phenopy import config
from phenopy.obo import restore
from phenopy.util import export_pheno2genes_with_no_parents


network_file = os.path.join(config.data_directory, 'hpo_network.pickle')
phenotype_to_genes_file = os.path.join(config.data_directory, 'phenotype_to_genes.txt')
phenotype_to_genes_no_parents_file = os.path.join(config.data_directory, 'phenotype_to_genes_no_parents.txt')

hpo = restore(network_file)
export_pheno2genes_with_no_parents(phenotype_to_genes_file, phenotype_to_genes_no_parents_file, hpo)

Config

While we recommend using the default settings for most users, the config file can be modified: $HOME/.phenopy/phenopy.ini.

IMPORTANT NOTE:
If the config variable hpo_network_file is defined, phenopy will try to load this stored version of the HPO and ignore the following command-line arguments: obo_file and custom_annotations_file.

To run phenopy with different obo_file or custom_annotations_file: Rename or move the HPO network file: mv $HOME/.phenopy/data/hpo_network.pickle $HOME/.phenopy/data/hpo_network.old.pickle

To run phenopy with a previously stored version of the HPO network, simply set hpo_network_file = /path/to/hpo_network.pickle.

Contributing

We welcome contributions from the community. Please follow these steps to setup a local development environment.

pipenv install --dev

To run tests locally:

pipenv shell
coverage run --source=. -m unittest discover --start-directory tests/
coverage report -m

References

The underlying algorithm which determines the semantic similarity for any two HPO terms is based on an implementation of HRSS, published here.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phenopy-0.2.1.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phenopy-0.2.1-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file phenopy-0.2.1.tar.gz.

File metadata

  • Download URL: phenopy-0.2.1.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4

File hashes

Hashes for phenopy-0.2.1.tar.gz
Algorithm Hash digest
SHA256 6c0b9044a1190b8ad6abbe0eec2fde36c3343303130347f40642348bc1d9f039
MD5 aa1769320134c7bfa191af8816905e47
BLAKE2b-256 bbaf95861bcf98efee21649e8bd7c4e284672d26cbd2823e48a0785989a6f8a1

See more details on using hashes here.

File details

Details for the file phenopy-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: phenopy-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 25.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4

File hashes

Hashes for phenopy-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 20940378938674bfff6b7015151052d02ac3affdf86032f64e0a0ea6e5698e7a
MD5 8a88c89070b7d95affa2849dcd4aca5c
BLAKE2b-256 635fcb0157235f05c82d63e540ef5f08e2ab868d29c6b1d8f837e2dc141c6fe0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page