Find specific gene or transcript kmers. And more.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python

Project description

Kmerator

Prototype for decomposition of transcript or gene sequences and extraction of their specific k-mers

ref: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8221386/

Kmerator is a prototype tool designed for the prediction of specific k-mers (also called tags) from input sequences, considering a reference genome and an ENSEMBL-like transcriptome. From these specific k-mers, it also outputs their corresponding specific contigs which are sequences of consecutive k-mers (overlapping length between k-mers must be k-1, otherwise, it's a new contig). Kmerator first uses Jellyfish [1] to create 2 requestable indexes from the reference genome and transcriptome, and second, decomposes your input transcript or gene sequences to count the occurences of each k-mer in the genome and transcriptome. Number of occurrences are then interpreted, in different manners, to select specific k-mer from your input.

Before using kmerator, a jellyfish index of the reference genome must be created. kmerator automatically creates a dataset according to the species and the desired release number (by default, homo_sapiens and the latest version). The dataset is composed of 4 files per species/version: a jellyfish index of the modified transcriptome (cDNA + ncRNA - alternative chormosomes) from Ensembl, a binary file representing the same transcriptome, another binary file containing general information on the genes of the transcriptome and a report file.

Specific kmers

Specific contigs

Dependencies

Python >= v3.7
Jellyfish >= 2.0

Installation

Solution 1 (preferred)

Install with pip

pip3 install kmerator

Solution 2

Installation from github

git clone https://github.com/Transipedia/kmerator3.git
ln -s $PWD/kmerator3/kmerator/kmerator.py /usr/local/bin/kmerator  # or somewhere in your $PATH

How to use kmerator

Before all, remember that kmerator needs a jellyfish index of the genome. You must build it according to the species you are studying. You can store and name the index file whatever you want.

Configuration file

The arguments to run kmerator are numerous, so to reduce the number of arguments to enter, it is advisable to edit the configuration file with the command :

kmerator -e

By filling in the datadir and genome directives, you will avoid having to re-enter the --datadir and --genome arguments systematically. If you are working on a species other than Human, you can also fill in the specie directive. And in a long-term project, you may want to set a release number.

Execute requests

There are two main cases:

you find for specific k-mers for annotated genes or transcripts : use the --selection option, followed by:
- the list of gene and/or transcripts separated by a space
- or a file with the list of genes/transcripts. Separator could by a space, a tab or a newline, and comments are allowed (#)
you find for specific k-mers of unannotated sequences : use the --fasta-file option, followed by a fasta file containing yours requests. In case of you focuses on chimeras, add the --chimera option

Examples:

kmerator -s npm1 brca2 ENST00000255409 ENSG00000159216    # you can mix genes and transcripts
kmerator -s genes.txt                                     # you can also use a file with gene list
kmerator -f file.fa                                       # give a fasta file fr unannotated sequences

Note the above commands assume that the configuration file contains at least the datadir and genome directives, the default species is homo_sapiens and the last available version will be used (if it is not present in datadir, kmerator will propose the construction of a dataset automatically)

Note the difference between genes and transcripts

When you are looking for specific kmers of a gene (symbol, alias or Ensembl name), kmerator fetch sequence of its canonical transcript, extracts kmers and keep those that found only in the gene.
When you are looking for a transcript, kmerator only keeps the kmer found in the transcript, and only in that transcript. If isoforms completely cover the transcript, no kmer will be kept.

Datasets

To work, kmerator needs a jellyfish index of the genome, a jellyfish index of the transcriptome and various files. You will have to make the jellyfish genome index manually. Instead, kmerator builds the jellyfish transcriptome index and the files it needs, which we call datasets. There is one dataset per species and per transcriptome version. When kemrator does not find (in datadir) the requested transcriptome release (by default, the latest available on Ensembl), it offers to automatically build the dataset in question. In addition, dataset management options are available:

kmerator -l            # list local datasets
kmerator -u            # find last release on Ensembl, and build dataset if not present
kmerator --mk-dataset  # build dataset according to -r <release> and -S <specie> arguments
kmerator --rm-dataset  # delete dataset according to -r <release> and -S <specie> arguments

All arguments

optional arguments:                                                                                      
  -h, --help            show this help message and exit                                                  
  -s SELECTION [SELECTION ...], --selection SELECTION [SELECTION ...]                                    
                        list of gene IDs (ENSG, gene Symbol or alias) or transcript IDs (ENST) from which you want to extract specific kmers from. For                                                             
                        genes, kmerator search specific kmers along the gene. For transcripts, it search specific kmers to the transcript. You can also give                                                       
                        a file with yours genes/transcripts separated by space, tab or newline. If you want to use your own unannotated sequences, you must                                                        
                        give your fasta file with --fasta-file option.                                   
  -f FASTA_FILE, --fasta-file FASTA_FILE                                                                 
                        Use this option when yours sequences are unannonated or provided by a annotation file external from Ensembl. Otherwise, use                                                                
                        --selection option.                                                              
  -d DATADIR, --datadir DATADIR                     
                        Storage directory for kmerator datasets.We recommend to set this parameter by editing the configuration file (kmerator --edit)                                                             
  -g GENOME, --genome GENOME                                                                             
                        Genome jellyfish index (.jf) to use for k-mers requests.                                                                                                                                   
  -S SPECIE, --specie SPECIE                                                                                                                                                                                       
                        indicate a specie referenced in Ensembl, to help, follow the link https://rest.ensembl.org/documentation/info/species. You can use                                                         
                        the 'name', the 'display_name' or any 'alias'. For example human, homo_sapiens or homsap are valid (default: human).                                                                       
  -k KMER_LENGTH, --kmer-length KMER_LENGTH                                                              
                        k-mer length that you want to use (default 31).                                  
  -r RELEASE, --release RELEASE                     
                        release of transcriptome (default: last).                                        
  --chimera             Only with '--fasta-file' option.                                                 
  --stringent           Only for genes with '--selection' option: use this option if you want to select gene-specific k-mers present in ALL known                                                                  
                        transcripts foryour gene. If false, a k-mer is considered as gene-specific if presentin at least one isoform of your gene of                                                               
                        interest.                   
  -o OUTPUT, --output OUTPUT                        
                        output directory, created if not exists (default: 'output')                                                                                                                                
  -t THREAD, --thread THREAD                        
                        run n process simultaneously (default: 1)                                        
  --tmpdir TMPDIR       directory to temporary file (default: /tmp/kmerator_<random>                                                                                                                               
  -D, --debug           Show more details while Kmerator is running.                                     
  --keep                keep intermediate files (sequences, indexes, separate kmers and contigs files).                                                                                                            
  -y, --yes             assumes 'yes' as the prompt answer, run non-interactively.                                                                                                                                 
  -e, --edit-config     Edit config file                                                                 
  -l, --list-dataset, --list-datasets               
                        list the local datasets (based on the datadir option).                           
  --rm-dataset          remove a dataset, according with --specie and --release options                                                                                                                            
  --mk-dataset          make a dataset, according with --specie and --release options                                                                                                                              
  -u, --update-dataset  builds a new dataset if a new version is found on Ensembl                                                                                                                                  
  -v, --version         show program's version number and exit

References

[1] Guillaume Marçais, Carl Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, Volume 27, Issue 6, 15 March 2011, Pages 764–770, https://doi.org/10.1093/bioinformatics/btr011 [2] Rodriguez JM, et al. Nucleic Acids Res. Database issue; 2017 Oct 23

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python

Release history Release notifications | RSS feed

1.0.2

Oct 18, 2024

1.0.0

Jun 12, 2024

0.10.0

May 16, 2024

0.9.6b0 pre-release

May 15, 2024

0.9.3b0 pre-release

Dec 18, 2023

0.9.2b0 pre-release

Sep 7, 2023

0.8.4b0 pre-release

Aug 29, 2023

0.8.3b0 pre-release

Jun 19, 2023

0.8.2b0 pre-release

Jun 19, 2023

0.8.1b0 pre-release

Jun 15, 2023

This version

0.7.7b0 pre-release

May 16, 2023

0.7.6b0 pre-release

May 15, 2023

0.7.5b0 pre-release

Apr 17, 2023

0.7.4b0 pre-release

Apr 17, 2023

0.7.3b0 pre-release

Apr 12, 2023

0.7.2b0 pre-release

Mar 7, 2023

0.7.1b0 pre-release

Mar 7, 2023

0.7.0b0 pre-release

Feb 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmerator-0.7.7b0.tar.gz (29.3 kB view details)

Uploaded May 16, 2023 Source

Built Distribution

kmerator-0.7.7b0-py3-none-any.whl (42.4 kB view details)

Uploaded May 16, 2023 Python 3

File details

Details for the file kmerator-0.7.7b0.tar.gz.

File metadata

Download URL: kmerator-0.7.7b0.tar.gz
Upload date: May 16, 2023
Size: 29.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.2

File hashes

Hashes for kmerator-0.7.7b0.tar.gz
Algorithm	Hash digest
SHA256	`f452df8d185feff3816a269521bc2467663c4f83e996468c12c2887ecb38d11c`
MD5	`9dd6779f36f3a012e6100e1ace3e002a`
BLAKE2b-256	`2dc4691b2f96419494c4402c3cbaed2123a3ac4376637cffc1b6661b3115d106`

See more details on using hashes here.

Provenance

File details

Details for the file kmerator-0.7.7b0-py3-none-any.whl.

File metadata

Download URL: kmerator-0.7.7b0-py3-none-any.whl
Upload date: May 16, 2023
Size: 42.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.2

File hashes

Hashes for kmerator-0.7.7b0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`73bc3926164d31c5f78ec4481a2d93fcfc249069f78cfe2a525d0b8e1ff4fc12`
MD5	`da4fc3ae27794aec2b880156315d9be6`
BLAKE2b-256	`2e69c245cb5ebe15a8e914df56d74bcd0c90a21a901583b1535b80c39585b488`