Count small kmer frequencies in nucleotide sequences.
Project description
SEEKR
Find communities of nucleotide sequences based on k-mer frequencies.
Installation
To use this library, you have to have >=Python3.9 on your computer.
Install through Python Package Index (PyPI)
$ pip install seekr
This will make both the command line tool and the python module available.
Install through Github
$ pip install git+https://github.com/CalabreseLab/seekr.git
This will install seekr from the main branck of the Github page and make both the command line tool and the python module available.
Install through Docker Hub
First you need to install Docker on your local computer. Then pull the Docker Image:
$ docker pull calabreselab/seekr:latest
This will install the Docker container which enables running seekr from the command line or Jupyter Notebook. See below the Seekr Docker Image section for more details.
CentOS
Users have been successful in installing seekr
from source on CentOS:
conda create --name seekr_source python=3.9
conda activate seekr_source
git clone https://github.com/CalabreseLab/seekr.git
python3 setup.py install
conda install python-igraph
conda install louvain
See this issue for further discussion.
Usage
You can either use SEEKR from the command line or as a python module.
The package is broken up into a set of tools, each of which perform a single task.
From the command line, all of the functions will begin with seekr_
.
For example, you can use seekr_kmer_counts
to generate a normalized k-mer count matrix of m
rows by n
columns,
where m
is the number of transcripts in a fasta file and n
is 4^k-mer.
Then seekr_pearson
can be used to calculate how well correlated all pairwise combinations of sequences are.
To see all tools and some examples, run:
$ seekr
Quickstart
To get a .csv file of communities for every transcript in a small .fa file called example.fa
,
(where RNAs have been normalized to a data set of canonical transcripts from GENCODE,
we would run:
$ seekr_download_gencode lncRNA -g
$ seekr_filter_gencode v33_lncRNA.fa -gtf v33_lncRNA.chr_patch_hapl_scaff.annotation.gtf -len 500 -can -o v33 # Name may change with GENCODE updates.
$ seekr_norm_vectors v33_filtered.fa
$ seekr_kmer_counts example.fa -o 6mers.csv -mv mean.npy -sv std.npy
$ seekr_pearson 6mers.csv 6mers.csv -o example_vs_self.csv
$ cat example_vs_self.csv
This quickstart procedure produces a number of other files beside the file example_vs_self.csv
.
See below to learn about the other files produced along the way.
Notes:
- We'll use
example.fa
as a small sample set, if you want to download that file and follow along. - GENCODE is a high quality source for human and mouse lncRNA annotation.
Fasta files can be found here.
- In the examples below we'll generically refer to
gencode.fa
. Any sufficiently large fasta file can be used, as needed.
- In the examples below we'll generically refer to
Here are some examples if you just want to get going.
Command line examples
seekr_download
Browsing GENCODE is nice if you want to explore fasta file options. But if you know what you want, you can just download it from the command line. This tool is also helpful on remote clusters.
To download all human transcripts of the latest release into a fasta file, run:
$ seekr_download_gencode all
GENCODE also stores mouse sequences. You can select mouse using the --species
flag:
$ seekr_download_gencode all -s mouse
For consistency across experiments, you may want to stick to a particular release of GENCODE. To get lncRNAs from the M5 release of mouse, use --release
:
$ seekr_download_gencode lncRNA -s mouse -r M5
If you do not want the script to automatically unzip the file, you can leave the fasta file gzipped with --zip
:
$ seekr_download_gencode all -z
Finally, if you want to download the gtf file from the same species and same release for further useage in sequence filtering with seekr_filter_gencode, you can do so with --gtf
:
$ seekr_download_gencode all -g
seekr_kmer_counts
Let's make a small .csv
file of counts.
We'll set a couple flags:
--kmer 2
so we only have 16 k-mers--outfile out_counts.csv
. This file will contain the log2-transformed z-scores of k-mer counts per kb.
$ seekr_kmer_counts example.fa -o out_counts.csv -k 2
$ cat out_counts.csv
You can also see the output of this command here.
Three options are available for log transformation, using the --log2
flag.
Pass --log2 Log2.pre
for log transformation of length normalized k-mer counts, with a +1 pseudo-count,
pass --log2 Log2.post
for log transformation of z-scores following count standardization (this is the default),
and pass --log2 Log2.none
for no log transformation.
If we want to avoid normalization, we can produce k-mer counts per kb by setting the --log2 Log2.none
, --uncentered
and --unstandardized
flags:
$ seekr_kmer_counts example.fa -o out_counts.csv -k 2 --log2 Log2.none -uc -us
Similarly, if we want a more compact, efficient numpy file,
we can add the --binary
and --remove_label
flags:
$ seekr_kmer_counts example.fa -o out_counts.npy -k 2 --binary --remove_label
Note: This numpy file is binary, so you won't be able to view it directly.
What happens if we also remove the --kmer 2
option?
$ seekr_kmer_counts example.fa -o out_counts.npy
~/seekr/seekr/kmer_counts.py:143: RuntimeWarning: invalid value encountered in true_divide
self.counts /= self.std
WARNING: You have `np.nan` values in your counts after standardization.
This is likely due to a *k*-mer not appearing in any of your sequences. Try:
1) using a smaller *k*-mer size,
2) beginning with a larger set of sequences,
3) passing precomputed normalization vectors from a larger data set (e.g. GENCODE).
The code runs, but we get a warning. That's because we're normalizing 4096 columns of k-mers. Most of those k-mers never appear in any of our 5 lncRNAs. This necessarily results in division by 0. If we use a much larger set of sequences, this same line works fine:
$ kmer_counts gencode.fa -o gencode_counts.npy
But what should you do if you're only interested in specific sequences?
seekr_norm_vectors
An effective way to find important k-mers in a small number of RNAs is to count their k-mers, but normalize their counts to mean and standard deviation vectors produced from a larger set of transcripts. We can produce these vectors once, then use them on multiple smaller sets of RNAs of interest. To produce the vectors, run:
Note: If --log2 Log2.post
is passed in seekr_kmer_counts, then the --log2 Log2.post
flag must be passed to seekr_norm_vectors.
This is so that the log-transformed k-mer counts are standardized against reference k-mer counts that are also log transformed.
$ seekr_norm_vectors gencode.fa
If you run ls
, you should see mean.npy
and std.npy
in your directory.
To specify the path of these output files,
use the --mean_vector
and --std_vector
flags:
$ seekr_norm_vectors gencode.fa -k 5 -mv mean_5mers.npy -sv std_5mers.npy
Now, we can use these vectors to analyze our RNAs of interest:
$ kmer_counts example.fa -o out_5mers_gencode_norm.csv -k 5 -mv mean_5mers.npy -sv std_5mers.npy
seekr_pearson
To find Pearson correlations between k-mer count profiles, run seekr_pearson
.
Running the program and options are similar to seekr_kmers_counts
.
Input files for seekr_pearson
will always be the output files from
one or more runs of kmer_counts
.
The default settings accept two csv files and output a third csv file.
$ seekr_pearson out_counts.csv out_counts.csv -o example_vs_self.csv
$ cat example_vs_self.csv
The only other options besides the -o
flag control binary versus plain text input and output.
If you have a binary input file (i.e. a .npy file),
and also want a binary output file, you can use the --binary_input
and --binary_output
flags:
$ seekr_pearson out_counts.npy out_counts.npy -o example_vs_self.npy -bi -bo
If we want to compare counts between two files (e.g. RNAs between mouse and human), that is also possible:
$ seekr_pearson human_6mers.npy mouse_6mers.npy -o human_vs_mouse.npy
Module example
For larger or more specific workloads, it may be better to use the seekr
module in python.
In this example, we'll calculate similarities between two example fasta files,
(e.g., XIST and a set of RNAs we think could be similar to XIST)
using the normalization vectors from the human GENCODE set.
We'll use all k-mers from 3 to 6, and label transcripts with unique labels.
import numpy as np
import pandas as pd
from seekr.kmer_counts import BasicCounter
from seekr.pearson import pearson
from seekr.fasta_reader import Reader
gencode = 'gencode.fa'
xist = 'xist.fa'
lncRNAs = 'other_lncs.fa'
# Make sure each lncRNA in other_lncs.fa has a unique name
headers = Reader(lncRNAs).get_headers()
names = [h.strip('>') + f'_{i}' for i, h in enumerate(headers)]
for k in range(3, 7):
# Make normalization vectors
gencode_counter = BasicCounter(gencode, k=k)
gencode_counter.get_counts()
mean_path = f'mean_{k}mers.npy'
std_path = f'std_{k}mers.npy'
np.save(mean_path, gencode_counter.mean)
np.save(std_path, gencode_counter.std)
# Count *k*-mers
xist_counter = BasicCounter(xist,
outfile=f'{k}mers_xist.npy',
mean=mean_path,
std=std_path,
k=k)
lncs_counter = BasicCounter(lncRNAs,
outfile=f'{k}mers_lncs.npy',
mean=mean_path,
std=std_path,
k=k)
xist_counter.make_count_file(names=['XIST'])
lncs_counter.make_count_file(names=names)
# Find similarities
sim = pearson(xist_counter.counts,
lncs_counter.counts,
outfile=f'xist_vs_lncs_{k}mers.npy')
# Save labeled .csv file of similarities.
sim_df = pd.DataFrame(sim, ['XIST'], names)
sim_df.to_csv(f'xist_vs_lncs_{k}mers.csv')
Each loop will write six files to disk:
mean_{k}mers.npy
: Mean vector for GENCODE human lncRNAs. Once this has been saved, the first portion of the code doesn't need to be run again.std_{k}mers.npy
: Standard deviation vector for GENCODE human lncRNAs.{k}mers_xist.npy
: Normalized k-mer profile for Xist.{k}mers_lncs.npy
: Normalized k-mer profile for other lncRNAs of interest.xist_vs_lncs_{k}mers.npy
: Pearson's r values for all pairwise comparisons between Xist and the other lncRNAs.xist_vs_lncs_{k}mers.csv
: Labeled, plain text version of pairwise comparisons.
Seekr2.0.0 functions update
In seekr2.0.0 we updated 9 functions spanning from sequence processing, stastics to visualizations. The statistics part focuses on fitting a background distribution and then using this background distribution to calculate p values for pearson's correlation r values. In this way, p value could be used straightforwardly to call pairs of significant similarities. The visualization part focuses on plotting the r or p value from various aspects as well as rendering sequence similarities based on seekr mode.
Sequence processing
filter_gencode
Filter gencode fasta file by 1: length; 2: transcript feature type and Ensemble_canonical tag 3: transcript feature type and isoform number 4: remove duplicated sequences. Please use gencode downloaded fasta and gtf files as input, other formats are not tested. Make sure the fasta and gtf files are from the same release and same species. If canonical filtering is turned on, only sequences that has the feature type 'transcript' and has a tag 'Ensembl_canonical' will be kept. For older version of gencode gtfs that does not have the tag 'Ensembl_canonical', please use isoform filter. If isoform filtering is turned on, only sequences that has the feature type 'transcript' and has a transcript_name ('Gm37180-201') that contains the isoform number (201) will be kept. Isoform number support regular expression, for example, '[0-9]01'
will match 101, 201, 301 up to 901. Please quote the argument if using regular expression in the command line to avoid errors. If rm_dup is set to True, sequences that are exactly the same will be removed and only the first occurence will be kept. If more than 50 transcript_ids in gtf file cannot be matched to the fasta headers, there will be a warining. transcript_id in gtf file is located inside the 9th field. transcript_id in the fasta headers is the first element of the header split by |
. transcript_name is also located inside the 9th field. These are standard formats from gencode. Please refer to Gencode data format website for further details.
Python example:
from seekr import filter_gencode
headers, seqs = filter_gencode.filter_gencode(fasta_path='gencode.vM33.lncRNA_transcripts.fa',
gtf_path='gencode.vM33.long_noncoding_RNAs.gtf',
len_threshold=500, canonical=True, isoform='201',
rm_dup=False, outputname='test')
Console examples:
filter by length only (>=500)
$ seekr_filter_gencode gencode.vM33.lncRNA_transcripts.fa -len 500 -o test
filter by length (>=500) and Ensemble_canonical tag and isoform number 201
$ seekr_filter_gencode gencode.vM33.lncRNA_transcripts.fa -gtf gencode.vM33.long_noncoding_RNAs.gtf -len 500 -can -iso 201 -o test
filter by Ensemble_canonical tag and remove duplicated sequences
$ seekr_filter_gencode gencode.vM33.lncRNA_transcripts.fa -gtf gencode.vM33.long_noncoding_RNAs.gtf -can -rd -o test
filter by isoform number with regular expression, remeber to quote the full argument when using regular expression
$ seekr_filter_gencode gencode.vM33.lncRNA_transcripts.fa -gtf gencode.vM33.long_noncoding_RNAs.gtf -iso '[0-9]01' -o test
Statistics
find_dist
Find the best fitted model to the input background sequences. Data to be fitted are all possible pairwise kmer pearson correlation for the background sequences. User can define the list of models for fitting. The output is usually a fitted model, but could also be the actual data (pearson correlation r values as an numpy array). This is helpful if the background distribution is small, in which case fitting to any model would not give a descent result. The results (best fitted model or actual data) will be used to calculate p-values.
If fit_model is True, it returns a dataframe contains the distributions name, goodness of fit (like D stats in ks) and fitted params for all models. If fit_model is False, it returns the actual data (with or without subsetting based on subsetting argument) as an numpy array, without fitting to any distributions. If plotfit is given, a pdf plot with the given name and path will be saved, showing the fitted distributions (red dash line) vs the actual data (blue solid line) for all distributions in models. If outputname is given, a csv file (outputname.csv) will be saved for either a list of fitted models or an numpy array.
Python example:
from seekr import find_dist
fitres = find_dist.find_dist(inputseq='default', k_mer=4, models='common10',
subsetting=True, subset_size = 10000,
fit_model=True, statsmethod='ks',progress_bar=True,
plotfit='modefitplot', outputname='test_fitres')
Console example (minimal and full versions):
$ seekr_find_dist default -o test_fitres
$ seekr_find_dist default -k 4 -l Log2.pre -mdl common10 -sbt -sbs 100000 -fm -statm ks -pb -pf modelfitplot -o test_fitres
find_pval
Calculate p values of the seekr.pearson correlation r values for input sequences 1 vs input sequences 2, sequences 1 and sequences 2 can be the same. The calculation is based on the output of find_dist, which is either a list of models or an numpy array. This function connects the output of find_dist and the input sequnces of interests. Given the background sequencs, find_dist calculates all possible pairwise seekr.pearson values and then outputs either a list of fitted models or the numpy array of the actual data. find_pval firstly calculates the seekr.pearson of the two input sequences, which produces a correlation matrix. p value is then calculated for each r value in the pearson correlation matrix based on the output of find_dist.
The output of find_pval is a dataframe of p values, with the row names (input 1) and column names (input 2) as the headers of the input sequences. If outputname is given, the dataframe is saved to a csv file named as outputname.csv.
Python example:
from seekr import find_pval
pvals=find_pval.find_pval(seq1file='test1.fa', seq2file='test2.fa',
mean_path='mean_4mers.npy', std_path='std_4mers.npy',
k_mer=4, fitres=fitres, log2='Log2.post',
bestfit=1, outputname='test_pval', progress_bar=True)
Console example (minimal and full versions):
$ seekr_find_pval seqs1.fa seqs2.fa mean_4mer.npy std_4mer.npy 4 fitres.csv -o test_pval
$ seekr_find_pval seqs1.fa seqs2.fa mean_4mer.npy std_4mer.npy 4 fitres.csv -ft npy -bf 1 -o test_pval -pb
adj_pval
Adjust p-values for multiple comparisons. This function performs multiple comparison correction for p values calculated by find_pval. Multiple comparison correction is performed by statsmodels.stats.multitest.multipletests function. The input is a dataframe of p values, with the row names (input 1) and column names (input 2) as the headers of the input sequences. The output is a dataframe of adjusted p values, with the same dimension as the input dataframe. If the input dataframe is a symmetric matrix: the row and column (or input1 and input2 in find_pval) are the same. The number of multiple comparisons is automatically corrected to be half the matrix size, excluding the diagonal. That is only the upper triangle of the matrix (excluding diagonal) is used for multiple comparison correction. In the output dataframe, the upper triangle of the matrix (excluding diagonal) is filled with the adjusted p values, while the lower triangle and the diagonal is filled with NaN. Symmetry is determined by checking if the input dataframe is equal to its transpose after excluding the diagonal elements. Before transposing and comparison, the input dataframe is also rounded to the 5th digit after the decimal point to avoid floating point error. If the input dataframe is not a symmetric matrix the total matrix is used for multiple comparison correction
The output of adj_pval is a dataframe of adjusted p values, with the row names (input 1) and column names (input 2) same as the input dataframe. If outputname is given, the dataframe is saved to a csv file named as outputname.csv.
Python example:
from seekr import adj_pval
adjpvals=adj_pval.adj_pval(pvals, method='bonferroni', alpha=0.05, outputname='test_bonferroni_0.05_adj_pval')
Console example (minimal and full versions):
$ seekr_adj_pval test_pval.csv fdr_bh
$ seekr_adj_pval test_pval.csv bonferroni -a 0.05 -o test_bonferroni_0.05_adj_pval
Visualization
kmer_heatmap
Customizeable heatmap (outputname.hformat) for easier visualization of the results of seekr.pearson (r values) or find_pval (p values). It takes in a dataframe with row and column names and plot the heatmap with/without dendrograms for both rows and columns. kmer_dendrgram is a good alternative to plot only the dendrograms (partial or full) to get a better idea of the clustering. For interested subgroups, use kmer_leiden, kmer_count_barplot, kmer_msd_barplot, kmer_comp_textplot and kmer_indi_textplot for further analysis.
Python example:
from seekr import kmer_heatmap
kmer_heatmap.kmer_heatmap(df, datamin=-1, datamax=1, thresh_value=0.00,
color_range=['#1b7837', '#ffffff', '#c51b7d'],
cluster=True,
distmetric='correlation', linkmethod='complete',
hmapw_ratio=0.4, hmaph_ratio=0.3, x_tick_size=10,
y_tick_size=10, cbar_font_size=26,
outputname='kmer_heatmap',hformat='pdf', hdpi=300)
Console example (minimal and full versions):
$ seekr_kmer_heatmap pval.csv 0 1
$ seekr_kmer_heatmap pval.csv 0 1 -th 0.05 -cr '#1b7837,#ffffff,#c51b7d' -cl -distm correlation -linkm complete -wratio 0.3 -hratio 0.3 -xts 16 -yts 16 -cfs 16 -o /Users/username/Desktop/kmer_heatmap -hf pdf -hd 300
kmer_dendrogram
Customizeable dendrograms (outputname.pformat) for easier visualization of the hierarchical clustering results of seekr.pearson (r values) or find_pval (p values). Performs hierarchial clustering on either rows or columns of a dataframe, which is a better way to visualize the whole or partial clusters in kmer_heatmap. For interested subgroups, use kmer_leiden, kmer_count_barplot, kmer_msd_barplot, kmer_comp_textplot and kmer_indi_textplot for further analysis.
Python example:
from seekr import kmer_dendrogram
kmer_dendrogram.kmer_dendrogram(df, dendro_direct='row',
distmetric='correlation', linkmethod='complete',
plot_ht=8, wd_ratio=0.5, leaf_font_size = 16,
outputname='kmer_dendrogram', pformat='pdf', pdpi=300)
Console example (minimal and full versions):
$ seekr_kmer_dendrogram pval.csv
$ seekr_kmer_dendrogram pval.csv -dd column -distm correlation -linkm complete -ph 8 -wratio 0.5 -lfs 16 -o kmer_dendrogram -pf pdf -d 300
kmer_leiden
Plot Leiden community network (plotname.pdf ) for input fasta seqeunces: calculte sequences distance matrix as seekr pearson correlation of the fasta file to itself. After filtering the edges with pearsoncutoff, use Leiden algorithms to call cluster. Seed can be set to get reporducible community assignments. The resulting undirected network edges are weighted with the pearson correlation r values: darker and thicker edges have higher weights. It works better with small number of nodes (less than 50). For better visualization and customization, please set csvfile to a customized name (string) and import the saved nodes and edges files (csvfile_edges_leiden.csv and csvfile_nodes_leiden.csv) directly to Gephi.
Python example:
from seekr import kmer_leiden
kmer_leiden.kmer_leiden(inputfile='testld.fa',mean='mean_4mers.npy',std='std_4mers.npy',
k=4, algo='RBERVertexPartition', rs=1.0, pearsoncutoff=0, setseed=False,
edgecolormethod='gradient', edgethreshold=0.1, labelfontsize=12,
plotname='kmer_leiden', csvfile=None)
Console example (minimal and full versions):
$ seekr_kmer_leiden ldseq.fa mean_4mer.npy std_4mer.npy 4 -cf testdata
$ seekr_kmer_leiden ldseq.fa mean_4mer.npy std_4mer.npy 4 -a RBERVertexPartition -r 1.0 -pco 0.1 -sd -ec threshold -et 0.2 -m 12 -pn kmer_leiden -cf testdata
kmer_count_barplot
Barplot (outputname.pformat) of the transformed or raw z-score for kmer words of the input sequences (limit to 10). Calculate the z-score with BasicCounter, where user can define whether and how to do log2 transform. Order kmer words by the summed difference from the mean among all sequences, in descending or ascending order. Plot the top x kmer words with bars differently colored for each input sequences.
Python example:
from seekr import kmer_count_barplot
kmer_count_barplot.kmer_count_barplot(inputfile='test.fa', mean='mean_4mers.npy',
std = 'std_4mers.npy', log2 = 'Log2.post',
k=4, sortmethod='ascending', topkmernumber=10,
xlabelsize=20, ylabelsize=20,
xticksize=20, yticksize=20, lengendsize=12,
outputname='kmer_count_barplot', pformat='pdf', pdpi=300)
Console example (minimal and full versions):
$ seekr_kmer_count_barplot test.fa mean_4mer.npy std_4mer.npy 4
$ seekr_kmer_count_barplot test.fa mean_4mer.npy std_4mer.npy 4 -l Log2.post -sm ascending -tn 10 -xls 20 -yls 20 -xts 20 -yts 20 -ls 12 -o kmer_count_barplot -pf pdf -d 300
kmer_msd_barplot
Barplot (outputname.pformat) of the mean of the transformed or raw z-score for kmer words across input sequences. Error bar represents the standard deviation. Order kmer words by mean or sd, in descending or ascending order. Plot the top x kmer words with each bar corresponds to one kmer word.
Python example:
from seekr import kmer_msd_barplot
kmer_msd_barplot.kmer_msd_barplot(inputfile='test.fa', mean='mean_4mers.npy',
std = 'std_4mers.npy', log2 = 'Log2.post',
k=4, sortstat='sd', sortmethod='ascending',
topkmernumber=10,xlabelsize=20, ylabelsize=20,
xticksize=20, yticksize=20,
outputname='kmer_msd_barplot', pformat='pdf', pdpi=300)
Console example (minimal and full versions):
$ seekr_kmer_msd_barplot test.fa mean_4mer.npy std_4mer.npy 4
$ seekr_kmer_msd_barplot test.fa mean_4mer.npy std_4mer.npy 4 -l Log2.post -ss mean -sm descending -tn 10 -xls 20 -yls 20 -xts 20 -yts 20 -o kmer_msd_barplot -pf pdf -d 300
kmer_comp_textplot
Highlight the interested kmer words and compare them in two input sequences (outputname.plotformat). Align the 2 input sequences at the beginning and label sequence positions on the bottom. Highlight in colors the words of interest. For multiple words, if the words overlap, the overlapped characters will be highlighted in the color of the first word in the list. Therefore arrange the words of interest based on the priority of the words, with the most important word at the beginning of the list.
Python example:
from seekr import kmer_comp_textplot
kmer_comp_textplot.kmer_comp_textplot(seq1file='test1.fa', seq2file='test2.fa',
words=['CTCT','GTAG','AAAA','GCGC'],
color_vec='default', wraplen=60,
char_spacing=1.0, line_spacing=0.5,
seqfontsize=28, numfontsize=18, colorblockh=0.5,
outputname='comp_textplot', plotformat='pdf', plotdpi=300)
Console example (minimal and full versions):
$ seekr_kmer_comp_textplot seq1.fa seq2.fa 'ATTA,AAAA,ACTC'
$ seekr_kmer_comp_textplot seq1.fa seq2.fa 'ATTA,AAAA,ACTC' -cv '#d62728,#e377c2,#ff7f0e' -wl 60 -cs 1.0 -ls 0.5 -sfs 28 -nfs 18 -cbh 0.5 -o comp_textplot -pf pdf -d 300
kmer_indi_textplot
Highlight the input kmer words across all input sequences and save for each sequence a separate plot. This is a similar function as kmer_comp_textplot. But instead of aligning and comparing two sequences, kmer_indi_textplot highlight the interested words in all input sequences and plot each sequence individually. For multiple words, if the words overlap, the overlapped characters will be highlighted in the color of the first word in the list. Therefore arrange the words of interest based on the priority of the words, with the most important word at the beginning of the list.
Python example:
from seekr import kmer_indi_textplot
kmer_indi_textplot.kmer_indi_textplot(seqfile='test.fa',
words=['CTCT','GTAG','AAAA','GCGC'],
color_vec='default', wraplen=60,
char_spacing=1.0, line_spacing=0.5,
seqfontsize=28, numfontsize=18, colorblockh=0.5,
outputpath='', plotformat='pdf', plotdpi=300)
Console example (minimal and full versions):
$ seekr_kmer_indi_textplot seq.fa 'ATTA,AAAA,ACTC'
$ seekr_kmer_indi_textplot seq.fa 'ATTA,AAAA,ACTC' -cv '#d62728,#e377c2,#ff7f0e' -wl 60 -cs 1.0 -ls 0.5 -sfs 28 -nfs 18 -cbh 0.5 -op /Users/username/Desktop/indi_textplot/ -pf pdf -d 300
Seekr Docker Image
We now provide a Docker image of seekr which hopefully avoids all the package dependencies and complications when install and run seekr through pip.
Docker Installation
Firstly, you should install Docker on your computer. Please refer to the official website for installation instructions.
After you have successfully installed Docker. Start/Run the application and make sure it is running properly – you should see the docker icon on your task bar with the status indicated.
Pull Docker Image and Test Run
- Start your command line tool: Terminal for MacOS and CMD for Windows. You can also use Powershell or Cygwin for Windows, but Cygwin might have interaction issues.
From the command line, pull the Docker Image:
$ docker pull calabreselab/seekr:latest
You can replace latest
with a specific tag if needed.
- Test Run the Docker Image
$ docker run -it --rm calabreselab/seekr:latest
The -it
tag sets it to interactive mode. If you don't need to run the Docker container in interactive mode (i.e., you don't need a shell inside the container), you can simply omit the -it
flag.
This will print the user manual out to the command line, which is basically the same as you run the command seekr
directly from command line when you pip install seekr.
Run Docker Image from command line
You can run the seekr function from this Docker Image directly from command line with the following specified syntax.
$ docker run -v /path/to/your/files:/data calabreselab/seekr:latest seekr_kmer_comp_textplot /data/seq1.fa /data/seq2.fa 'ATTA,AAAA,ACTC' -o /data/comp_textplot
In this command:
-v /path/to/your/files:/data
: This mounts the directory/path/to/your/files
from your host machine (where seq1.fa and seq2.fa are located) to the/data
directory inside the Docker container. Replace/path/to/your/files
with the actual path to your files.seekr_kmer_comp_textplot /data/seq1.fa /data/seq2.fa 'ATTA,AAAA,ACTC' -o /data/test
: This is the command that gets executed inside the Docker container. Since we mounted our files to/data
in the container, we reference them with/data/seq1.fa
and/data/seq2.fa
.- The
/data
folder is basically a mirror of the folder you specified in/path/to/your/files
. So by specifying-o /data/comp_textplot
(output with the path /data/ and plotname as in comp_textplot.pdf) we can have the output files directly written in to the folder in/path/to/your/files
. - Please remember to specify your output path to
/data/
otherwise it will not be saved to your folder on local machine and it would be hard to locate it even inside the Docker Container Filesystem (in this case, when the Docker Container is removed, your data will be deleted as well).
Examples of code mounts e:/test on Windows as the folder that contains the input and holds the output files:
$ docker run -v e:/test:/data calabreselab/seekr:latest seekr_kmer_comp_textplot /data/test1.fa /data/test2.fa 'ATTA,AAAA,ACTC' -o /data/comp_textplot
$ docker run -v e:/test:/data calabreselab/seekr:latest seekr_kmer_leiden /data/testld.fa /data/bkg_mean_4mers.npy /data/bkg_std_4mers.npy 4 -s -o /data/kmer_leiden
Basically you need to add: docker run -v /path/to/your/files:/data calabreselab/seekr:latest
before the command line code for seekr (see above for examples of all functions).
Run Docker Image with Jupyter Notebook
If you want to work with python codes instead of directly calling from the command line, you can choose to run seekr with Jupyter Notebook inside the Docker Container.
- Run Docker Container with Interactive Terminal:
$ docker run -it -p 8888:8888 -v /path/on/host:/path/in/container calabreselab/seekr:latest /bin/bash
This command will start the container and give you a bash terminal inside it. The -p 8888:8888
flag maps the port 8888 from the container to your host machine so that you can access the Jupyter Notebook server.
/path/on/host
is the path to a directory on your local machine where you want the notebooks and other data to be saved. /path/in/container
is the directory inside the container where you want to access and save the files from Jupyter Notebook.
When you use Jupyter Notebook now and create or save notebooks, they will be stored at /path/in/container
inside the Docker container. Due to the volume mount, these files will also be accessible and stored at /path/on/host
on your host machine so that you can later access the code and files even when the container is removed.
Example of code:
$ docker run -it -p 8888:8888 -v e:/test:/data calabreselab/seekr:latest /bin/bash
- Manually start Jupyter Notebook. From the bash terminal inside the Docker container:
$ jupyter notebook --ip=0.0.0.0 --port=8888 --NotebookApp.token='' --NotebookApp.password='' --allow-root
--ip=0.0.0.0
: Allow connections from any IP address. This is important for accessing Jupyter from outside the container.--port=8888
: Run Jupyter on this port. You can change this if needed, but remember to adjust the-p
flag when starting the Docker container.--allow-root
: Allow Jupyter to be run as the root user inside the Docker container.--NotebookApp.token=''
: This disables the token-based security measure.--NotebookApp.password=''
: This ensures that there's no password required to access the Jupyter server.
Disabling the token and password for Jupyter Notebook reduces security. It's generally okay for local development, but you should avoid doing this in production or any publicly accessible server.
-
Access the Jupyter Notebook from your host machine's browser by entering this address: http://localhost:8888/
-
Run Python functions as demonstrated above. It would be convenient if all input files can be copied over to the folder you have mounted:
/path/on/host
. Pay attention that when you specify your input and output file route, use the/path/in/container
route, as that is a mirror to your local folder.
Example of code:
from seekr import kmer_comp_textplot
kmer_comp_textplot.kmer_comp_textplot(seq1file='/data/test1.fa', seq2file='/data/test2.fa',
words=['ATCG','GTAG','AAAA','GCGC'],
color_vec='default', wraplen=60,
char_spacing=1.0, line_spacing=0.5,
seqfontsize=28, numfontsize=18, colorblockh=0.5,
outputname='/data/comp_textplot', plotformat='pdf',
plotdpi=300)
Once you are done, you can click the Shut Down button under the File tab of Jupyter Notebook to shut down the instance or you can just click Ctrl+C
twice from the command line to kill the process.
Then you need to exit the Docker Container from the interative session by typing exit
and hit enter.
Cleanup (Optional)
-
Clean Up Docker Container:
- List all containers, including the stopped ones:
docker ps -a
- To remove a specific container:
docker rm CONTAINER_ID_OR_NAME
- To remove all stopped containers:
docker container prune
- List all containers, including the stopped ones:
-
Clean Up Docker Image. If you want to remove the Docker image you used:
- List all images:
docker images
- Remove a specific image:
docker rmi IMAGE_ID_OR_NAME
- List all images:
You'd typically only remove an image if you're sure you won't be using it again soon, or if you want to fetch a fresh version of it from a repository like Docker Hub.
- Additional Cleanup. Docker also maintains a cache of intermediate images and volumes. Over time, these can accumulate. To free up space:
- Remove unused data:
docker system prune
- To also remove unused volumes (be careful, as this might remove data you want to keep):
docker system prune --volumes
- Remove unused data:
Remember to always be cautious when cleaning up, especially with commands that remove data. Make sure you have backups of any essential data, and always double-check what you're deleting.
Additional considerations
This section is a "not-so-quickstart", providing more complete views on selection of input data and parameter selection.
Some general advice for thinking about how to use SEEKR. One challenge that we continually face in the lab is there are few ground truths in the lncRNA field and thus it is often unclear how to decide on the best parameters for sequence comparisons using SEEKR. Below are some points that may be useful – these are also discussed in the conclusions of PMID 31097619
Selection of a set of sequences to use for the calculation of standardization vectors
In our experience, one of the most useful features of SEEKR is that it provides a metric of relative similarity.
Consider two lncRNAs, lncRNA-X, which is from mouse, and lncRNA-Y, which is from human.
One way to compare these two lncRNAs using SEEKR would be to calculate their k-mer profiles and compare these profiles in relation to all known mouse lncRNAs.
To do this, one would first use all mouse lncRNAs as an input for the "seekr_norm_vectors" script, to determine the mean and standard deviation of counts for all k-mers in all mouse lncRNAs.
Then, users would take those standardization vectors along with the sequences of lncRNA-X and lncRNA-Y and use them in the “kmer_counts” script to calculate k-mer profiles of lncRNA-X and lncRNA-Y in relation to all mouse lncRNAs.
Finally, users would employ the “seekr_pearson” script to determine how similar lncRNA-X and lncRNA-Y were to each other relative to all other mouse lncRNAs.
The point here is that the set of sequences used to calculate standardization vectors is a key variable – perhaps analogous to a reference gene in a quantitative PCR experiment or a loading control in a western blot.
Users should think through what comparison they are interested in performing and choose their set of standardization sequences accordingly.
In the example above, where all mouse lncRNAs were used for standardization, users might discover that “At the level of k-mers, lncRNA-X is more similar to lncRNA-Y than it is similar to 99% of other mouse lncRNAs”.
But changing the set of sequences for standardization can change the question being asked.
For example, users might be interested in comparing lncRNA-X and lncRNA-Y relative to all known human enhancer RNAs (eRNAs).
Here, using all human eRNAs as the standardization set, users might make an additional insightful discovery, perhaps: “At the level of k-mers, lncRNA-X is no more similar to lncRNA-Y than it is similar to the average human eRNA”.
Relatedly, when users are comparing two large groups of sequences (let us call these “set A” and “set B”), we again recommend thinking about what set of reference sequences would be best for standardization.
In most cases, users will probably want to use the same set of reference sequences to standardize k-mer counts in set A and in set B.
Perhaps set A and set B should again be standardized relative to all mouse lncRNAs; or, both set A and set B be should be standardized relative to the sequences in set B.
But standardizing set A relative to itself, then standardizing set B relative to itself, then using “seekr_pearson” to compare the two sets of sequences might yield a non-sensical comparison.
Selection of k-mer length
In our experience, the most robust biological trends have been relatively insensitive to the length of k-mer used in SEEKR. Still, when deciding on a length of k to use for comparisons, we recommend using a k-mer length for which 4^k is similar to the length of the average feature or key feature that is being compared. The reason for this is that as the length of k increases, so does the number of zero values for k-mer counts in a given sequence. For example, there are 16384 possible 7-mers. If users were interested in finding lncRNAs that are similar to lncRNA-X, which is 500 nucleotides long, a k-mer length of 7 would not be ideal, because the vector of 7-mer counts that corresponds to lncRNA-X would be dominated by zero values. In this example, unless users had a specific rationale for searching 7-mers, a k-mer length of 4 (256 possible k-mers) or 5 (1024 possible k-mers) would provide the basis for a stronger comparison.
Issues and Help
If you have questions about how you can use seekr in your own research, please send an email to jmcalabr@med.unc.edu
For full documentation, run:
$ seekr
Any suggestions, questions, or problems can be directed to our GitHub Issues page.
Please also see the pre-print to a methods paper we wrote last year. This paper was originally scheduled to appear in Methods in Molecular Biology in 2020 but its publication date may be delayed.
Citation
If you use this work, please cite:
Kirk, J. M., Kim, S. O., Inoue, K., Smola, M. J., Lee, D. M., Schertzer, M. D., … Calabrese, J. M. (2018). Functional classification of long non-coding RNAs by k -mer content. Nature Genetics, 50(10), 1474–1482. https://doi.org/10.1038/s41588-018-0207-8
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file seekr-2.0.2.tar.gz
.
File metadata
- Download URL: seekr-2.0.2.tar.gz
- Upload date:
- Size: 6.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46bd4a325827b908f573cf28bae610962c74abca874768cc47cd1c23750567e4 |
|
MD5 | d5538dfa08969b2160c6133b625b51c0 |
|
BLAKE2b-256 | ad4a7c1ac248e9a565e4226080f1e283c6d17df33faf3194618d23533a877ea4 |
File details
Details for the file seekr-2.0.2-py3-none-any.whl
.
File metadata
- Download URL: seekr-2.0.2-py3-none-any.whl
- Upload date:
- Size: 6.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f586ce8212b6b06ba45c32fc9e0c31cf3acd2f06d683446c0a9de061c8a2840 |
|
MD5 | ef98b2e1f9a0b9d746d4e04bc3c65add |
|
BLAKE2b-256 | 2f132ce102a3b56285c6389d6d6e4888d77256323e8e5d37b390e843b265a7a5 |