Skip to main content

GCGC is a preprocessing library for biological sequence model development.

Project description

GCGC

GCGC is a python package for feature processing on Biological Sequences.

DOI

Installation

Install GCGC via pip:

$ pip install gcgc

Documentation

The GCGC documentation is at gcgc.trenthauck.com, please see it for an example.

Citing GCGC

If you use GCGC in your research, cite it with the following:

@misc{trent_hauck_2018_2329966,
  author       = {Trent Hauck},
  title        = {GCGC},
  month        = dec,
  year         = 2018,
  doi          = {10.5281/zenodo.2329966},
  url          = {https://doi.org/10.5281/zenodo.2329966}
}

Changelog

0.11.0 (2019-11-15)

Added

  • Added the SequenceTokenizerSpec object for specifying the tokenizer.
  • Added Vocab object for storing the int to token, and token to int encodings.
  • Added example of using tensorflow/keras together with gcgc.

0.10.0 (2019-11-09)

Changed

gcgc has been revamped quite a bit to better support existing processing pipelines for NLP without trying to do to much. See the docs for more information about how this works.

0.9.0 (2019-08-05)

Added

  • Parser now outputs the length of the tensor not including padding. This is useful for packing and length based iteration.
  • Generating masked output from the parse_record method is now available.
  • Alphabet can include an optional mask token.

Changed

  • Can now specify how large of kmer step size to generate when supplying a kmer value.
  • Renames EncodedSeq.integer_encoded to EncodedSeq.get_integer_encoding which takes a kmer_step_size to specify how large of steps to take when encoding.
  • Add parsed_seq_len to the SequenceParser object to control how much padding to apply to the end of the integer encoded sequence. This is useful since a batch of tensors is expected to have the same size.

0.8.0 (2019-07-04)

Fixed

  • Broken test due to platform differences in Path.glob sorting.

Added

  • User can specify to use start or end tokens optionally.

Removed

  • Removed one_hot_encoding. The user can do that pretty easily if needed. E.g. see scatter in PyTorch.

0.7.0 (2019-06-22)

Added

  • Properties to access the integer encodings of special tokens. (35cae2a)
    • Alphabet.encoded_start
    • Alphabet.encoded_end
    • Alphabet.encoded_padding
  • Remove uniprot dataset creation. (e233162)
  • Simplify index handling for GenomicDataset. (3213a9e)

0.6.1 (2019-06-10)

Added

  • Updated package management so gcgc is easier to use with other version of torch.

0.6.0 (2019-04-04)

Added

  • Ability for kmer size to be passed to an alphabet.

0.5.2 (2019-03-21)

Added

  • Add Dockerfile and docker-compose.yml for development.
  • EncodedSeq.shift, which will shift sequence by an offset integer.
  • EncodedSeq.from_integer_encoded_seq will take a list of integers and an alphabet and return an EncodedSeq object.
  • Add the ability to apply a function to the rollout_kmers yielded values.

Changed

  • Alphabet special characters are now located at the start, rather than the end, of the letters and token sequence.

0.5.1 (2019-01-09)

Added

  • Add extra css to make underline links in articles.
  • Exit if the download directory doesn't exist in the call to download organism.
  • Wording improvements in docs.

0.5.0 (2018-12-31)

Added

  • Include seq_tensor_one_hot in the PyTorch Parser.
  • Added a GCGCRecord.encoded_seq property.
  • New gcgc.random module to start holding sequence data.
  • New gcgc.rollout module to handle working through chunks of sequences.
    • rollout_kmers will roll out kmers.
    • rollout_seq_features will roll out the SeqFeatures from a SeqRecord.
  • EncodingAlphabet now can optionally take a gap_characters set of characters to add to the alphabet letters. It also takes add_lower_case_for_inserts which will duplicate the alphabet, but convert the letters to lowercase.

Changed

Fixed

  • Fixed bug in GenomicDataset.from_path where it still referred to init_from_path_generator.

0.4.0

Added

  • EncodedSeq now supports iterating through kmers, see EncodedSeq.rollout_kmers for options.
  • GCGC is citable.
  • GCGC now has a CHANGELOG.md.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gcgc-0.11.0.tar.gz (8.4 kB view hashes)

Uploaded Source

Built Distribution

gcgc-0.11.0-py3-none-any.whl (11.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page