Skip to main content

Search the biomedical literature for protein interactions andprotein associations.

Project description

PEDL

PEDL is a tool for predicting protein-protein assocations from the biomedical literature. It searches more than 30 million abstracts of biomedical publications and over 4 million full texts with the help of PubTatorCentral. A state-of-the-art machine reading model then predicts which types of association between the proteins are supported by the literature. Among others, PEDL can detect posttranslational modifications, transcription factor-target interactions, complex formations and controlled transports.

Installation

pip install pedl

Usage

PEDL supports two commands pedl predict and pedl summarize. The default workflow is to first predict associations for one or more protein pairs of interest, which will store the results for each pair in a separate file. The contents of these files can then be aggregated into a single csv-file with summarize.

PEDL expects proteins to be identified either via HGNC symbols (for human genes) or entrez gene ids. These can be looked up via standard webinterfaces like NCBI Gene.

predict

  • Interactions between single proteins

    pedl predict --p1 CD274 --p2 CMTM6 --out PEDL_predictions
    

    Results:

    $ ls PEDL_predictions/
    CD274-CMTM6.txt  CMTM6-CD274.txt
    
    $ head -n1 PEDL_predictions/CD274-CMTM6.txt
    in-complex-with	0.98	6978769	A PD-L1 antibody, H1A, was developed to destabilize PD-L1 by disrupting the <e1>PD-L1</e1> stabilizer <e2>CMTM6</e2>.	PEDL
    
  • Pairwise interactions between multiple proteins

    pedl predict --p1 CMTM6 --p2  54918 920  --out PEDL_predictions
    

    searches for interactions between CMTM6 and 54918, and for interactions between CMTM6 and 920

  • Read protein lists from files

    pedl predict --p1 proteins.txt --p2  54918 920  --out PEDL_predictions
    

    searches for interactions between the proteins in proteins.txt and 54918, as well as interactions between proteins in proteins.txt and 920

  • Allow multiple sentences

    By default, PEDL will only search for interactions described in a single sentence. If you want PEDL to read text snippets that span multiple sentences, use --multi_sentence. Note, that this may slow down reading by a lot if you are not using a GPU.

      pedl predict --p1 CD274 --p2 CMTM6 --out PEDL_predictions --multi_sentence
    
  • Search for multiple species at once

    If the provided gene ids are from human, mouse, rat or zebrafish, PEDL can automatically search for interactions in the other model species (currently human, mouse, rat and zebrafish) via homology classes defined by the Alliance of Genome Resources:

    pedl predict --p1 29126 --p2 54918 --out PEDL_predictions --expand_species mouse zebrafish
    

    would also include interactions in mouse and zebrafish.

  • Interactions from pathway databases

    It is also possible to query PathwayCommons for interactions. This requires the python package indra to be installed, which can be achieved via pip install indra:

      pedl predict --p1 29126 --p2 54918 --out PEDL_predictions --dbs pid reactome kegg
    

    to query pid reactome and kegg. See --help for the full list of available databases.

  • Large gene lists

    If you need to test for more than 100 interactions at once, you have to use a local copy of PubTatorCentral, which can be downloaded here. Unpack the PubTatorCentral files and point PEDL towards the files:

    pedl predict --p1 large_protein_list1.txt --p2 large_protein_list2 --out PEDL_predictions --pubtator [PATH_TO_PUBTATOR]
    

    In this case, it is also strongly advised to use a CUDA-compatible GPU to speed up the machine reading:

    pedl predict --p1 large_protein_list1.txt --p2 large_protein_list2 --out PEDL_predictions
      --pubtator [PATH_TO_PUBTATOR]--device cuda
    

summarize

Use summarize to create a summary file describing all results in a directory. By default, PEDL will create the summary CSV next to the results directory.

pedl summarize PEDL_predictions

Results:

$ head -n4 PEDL_predictions.tsv
p1      association type        p2      score (sum)     score (max)
CMTM6   controls-state-change-of        CD274   4.17    0.90
CMTM6   in-complex-with CD274   2.48    0.97
CD274   in-complex-with CMTM6   2.40    0.98

Results can also be aggregate ignoring the association type and the direction of the association:

  $ pedl summarize PEDL_predictions --no_association_type

  $ cat PEDL_predictions.tsv
  p1      association type        p2      score (sum)     score (max)
  CD274   association     CMTM6   11.52   1.00

References

Code and instructions to reproduce the results of our paper, can be found here.

If you use PEDL in your work, please cite us

@article{weber2020pedl,
  title={PEDL: extracting protein--protein associations using deep language models and distant supervision},
  author={Weber, Leon and Thobe, Kirsten and Migueles Lozano, Oscar Arturo and Wolf, Jana and Leser, Ulf},
  journal={Bioinformatics},
  volume={36},
  number={Supplement\_1},
  pages={i490--i498},
  year={2020},
  publisher={Oxford University Press}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pedl-0.3.0.tar.gz (4.0 MB view hashes)

Uploaded Source

Built Distribution

pedl-0.3.0-py3-none-any.whl (4.0 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page