Skip to main content

Read, process and write ProteinNet data

Project description

ProteinNetPy 1.0.1

DOI Documentation Status

A python library for working with ProteinNet text data, allowing you to easily load, stream and filter data, map functions across records and produce TensorFlow datasets. For details of the dataset see the ProteinNet Bioinformatics paper. Documentation for all functions of the module is available here.

Install

pip install proteinnetpy

Or install the development version from Github:

pip install git+https://github.com:allydunham/proteinnetpy

Requirements

  • Python 3
  • Numpy
  • Biopython
  • TensorFlow (if using the datasets module)

Basic Usage

The main object used in ProteinNetPy is the ProteinNetRecord, which allows access to the various record fields and methods for common manipulations, such as calculating a one-hot sequence representation or residue distance matrix. It also supports most applicable operations like len, str etc. While the parser module contains a generator to parse files, it is generally easier to use the ProteinNetDataset class from the data module:

from proteinnetpy.data import ProteinNetDataset
data = ProteinNetDataset(path="path/to/proteinnet")

This class includes a preload argument, which determines if the dataset is loaded into memory or streamed. It also supports filtering using the filter_func argument, which is passed a function that returns truthy values for a record to determine if it is kept in the dataset. A range of common filters are included in the data module, as well as combine_filters(), which can applies all passed filters to each record.

Once a dataset has been loaded it can be iterated over to process data. The ProteinNetMap class creates map objects that map a function over the dataset, including options to stream the map on each iteration or pre-calculate results. They have a generate method that creates a generator object yielding the output of the function. The LabeledFunction class is provided to create functions annotated with output types and shapes, used for automatically creating TensorFlow datasets. The mutation module provides some example functions that return mutated records.

The following example code shows a typical simple usage, creating a streamed TensorFlow dataset from ProteinNet data:

from proteinnetpy import data
from proteinnetpy import tfdataset

class MapFunction(data.LabeledFunction):
    """
    Example ProteinNetMap function outputting a one-hot sequence and contact graph input data
    and multiple alignment PSSM labels
    """
    def __init__(self):
        self.output_shapes = (([None, 20], [None, None]), [None, 20])
        self.output_types = (('float32', 'float32'), 'int32')

    def __call__(self, record):
        return (record.get_one_hot_sequence().T, record.distance_matrix()), record.evolutionary.T

filter_func = data.make_length_filter(min_length=32, max_length=2000)
data = data.ProteinNetDataset(path="path/to/proteinnet", preload=False)
pn_map = data.ProteinNetMap(data, map=MapFunction(), static=False, filter_errors=True)

tf_dataset = tfdataset.proteinnet_tf_dataset(pn_map, batch_size=100, prefetch=400, shuffle_buffer=200)

Many more functions, arguments and uses are available, with detailed descriptions currently found in docstrings. Full documentation will be generated from these for a future release.

Scripts

The package also provides convenience scripts for processing ProteinNet datasets:

  • add_angles_to_proteinnet - Add extra fields to a ProteinNet file with φ, ψ and χ backbone/torsion angles
  • proteinnet_to_fasta - Extract a fasta file with the sequences from a ProteinNet file
  • filter_proteinnet - Filter a ProteinNet file to include/exclude records from a list of IDs

Detailed usage instructions for each can be found using the -h argument.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proteinnetpy-1.0.1.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

proteinnetpy-1.0.1-py3-none-any.whl (51.2 kB view details)

Uploaded Python 3

File details

Details for the file proteinnetpy-1.0.1.tar.gz.

File metadata

  • Download URL: proteinnetpy-1.0.1.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for proteinnetpy-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d4650759a019e859c55119dd5c98b18df499e389f54cec49087221660d12ab4e
MD5 ca2265331a926f8ee8c84107384a8114
BLAKE2b-256 c716ab8ee1e8bb5954629c11ce035da23416e1a1e95c64502693e555486f6058

See more details on using hashes here.

File details

Details for the file proteinnetpy-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: proteinnetpy-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 51.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for proteinnetpy-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 161de26b65e2f1ff5c836a1c11705cf4272260752a9be08ec45eb01d45b3fc2f
MD5 3a523aa0df906dac4f195441b93b588b
BLAKE2b-256 707499a1c7ac64de66e856b35540659d849f04654cbd7b5b145961069ce428f5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page