Skip to main content

Helper functions.

Project description

roux : Helper functions

PyPIPyPI
build

Installation

pip install roux

Examples


Importing commonly used helper functions for use in jupyter notebooks for example.

Expand

Usage

# import helper functions
from roux.global_imports import *

Documentation

roux.global_imports

For details on roux.global_imports such as which helper functions are imported, see detailed_roux_global_imports.ipynb.


Helper attributes of pandas dataframes.

Expand
# import helper functions
from roux.global_imports import *
0it [00:00, ?it/s]


INFO:root:pandarallel.initialize(nb_workers=4,progress_bar=True)
WARNING:root:not found: metadata


INFO: Pandarallel will run on 6 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.

Helper functions for basic data validations

## demo data
import seaborn as sns
df1=sns.load_dataset('iris')
# .rd (roux data) attributes
## validate no missing values in the table
assert df1.rd.validate_no_na()
## validate no duplicates in the table
df1.rd.validate_no_dups()
WARNING:root:duplicate rows found





False

Helper functions for checking duplicates in a table

df1.rd.check_dups()
INFO:root:duplicate rows: 1% (2/150)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sepal_length sepal_width petal_length petal_width species
101 5.8 2.7 5.1 1.9 virginica
142 5.8 2.7 5.1 1.9 virginica

Helper functions for logging changes in the dataframe shapes

_=df1.log.drop_duplicates()
INFO:root:drop_duplicates: shape changed: (150, 5)->(149, 5), width constant
## within pipes
_=(df1
   .log.drop_duplicates()
   .log.check_nunique(groupby='species',subset='sepal_length')
  )
INFO:root:drop_duplicates: shape changed: (150, 5)->(149, 5), width constant
INFO:root:unique {'groupby': 'species', 'subset': 'sepal_length'}
INFO:root:species
setosa        15
versicolor    21
virginica     21
Name: sepal_length, dtype: int64

Helper functions to filter dataframe using a dictionary

_=df1.rd.filter_rows({'species':'setosa'})
INFO:root:(150, 5)
INFO:root:(50, 5)

Helper functions to merge tables while validating the changes in shapes

df2=df1.groupby('species').head(1)
df1.log.merge(right=df2,
              how='inner',
              on='species',
              validate='m:1',
             validate_equal_length=True,
             # validate_no_decrease_length=True,
             ).head(1)
INFO:root:merge: shape changed: (150, 5)->(150, 9), length constant
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sepal_length_x sepal_width_x petal_length_x petal_width_x species sepal_length_y sepal_width_y petal_length_y petal_width_y
0 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2

Documentation

roux.lib.df roux.lib.dfs


Helper functions for input/output.

Expand

Saving and reading dictionaries

Unifies input/output functions for .yaml, .json, .joblib and .pickle files.

d={'a':1,'b':2,'c':3}
d
{'a': 1, 'b': 2, 'c': 3}
from roux.lib.io import to_dict
to_dict(d,'tests/output/data/dict.json')
'data/dict.json'
from roux.lib.io import read_dict
read_dict('tests/output/data/dict.json')
{'a': 1, 'b': 2, 'c': 3}

Saving and reading tables

Unifies several of pandas's input/output functions.

# demo data
import seaborn as sns
df1=sns.load_dataset('iris')
from roux.lib.io import to_table
to_table(df1,'tests/output/data/table.tsv')
'data/table.tsv'
from roux.lib.io import read_table
read_table('tests/output/data/table.tsv')
WARNING:root:dropped columns: Unnamed: 0
INFO:root:shape = (150, 5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

Documentation

roux.viz.io


Helper functions applicable to strings.

Expand
# import helper functions
from roux.lib.str import encode,decode

Encoding and decoding data

Reversible

# example data
parameters=dict(
    colindex='drug id',
    colsample='sample id',
    coly='auc',
    formulas={
            f"auc ~ C(sample_subset, Treatment(reference='ref')) + primary_or_metastasis": 'mixedlm',
        },
    kws_get_stats_regression=dict(
            groups='screen_id',
        ),
    colsubset='sample subset',
    variable="C(sample_subset, Treatment(reference='ref'))[T.test]",
)
## encoding
encoded=encode(parameters)
print(encoded)
eNqVj00KwjAQRq8Ssqli8QCCK6_gTiSk7WcJNkmZSbRF9OwmjYtuhSwm7_HNz0u2fjCuwyQPQnYUe2E6WYuMWdtxQOalWpnYMMLK_ECxcxY6tvl782TjoDmhV2biI06bElIlVIszQQcLFzaEGwiuxbFKZbXdip0YyVhNs_KkLILm9ExuJ62Z0A1WvtOY-5NVj6CSDawIPYHZeLeM7cnHcYlwS4BT6Y4cemgyuikX_rPU5bwP4HCV7y_fP20r
## decoding
decoded=decode(encoded,out='dict')
print(decoded)
{'colindex': 'drug id', 'colsample': 'sample id', 'colsubset': 'sample subset', 'coly': 'auc', 'formulas': {"auc ~ C(sample_subset, Treatment(reference='ref')) + primary_or_metastasis": 'mixedlm'}, 'kws_get_stats_regression': {'groups': 'screen_id'}, 'variable': "C(sample_subset, Treatment(reference='ref'))[T.test]"}
## test reversibility
assert parameters==decoded

Non-reversible

## clear variables
%reset_selective -f "encoded.*"
## encoding
encoded=encode(parameters,short=True)
print(encoded)
e11fafe6bf21d3db843f8a0e4cea21bc600832b3ed738d2b09ee644ce8008e44
## dictionary shuffled
from random import sample
parameters_shuffled={k:parameters[k] for k in sample(parameters.keys(), len(parameters))}
## encoding dictionary shuffled
encoded_shuffled=encode(parameters_shuffled,short=True)
print(encoded_shuffled)
e11fafe6bf21d3db843f8a0e4cea21bc600832b3ed738d2b09ee644ce8008e44
## test equality
assert encoded==encoded_shuffled

Documentation

roux.lib.str


Helper functions applicable to file-system.

Expand

Encoding and decoding data

# example data
parameters=dict(
    colindex='drug id',
    colsample='sample id',
    coly='auc',
    formulas={
            f"auc ~ C(sample_subset, Treatment(reference='ref')) + primary_or_metastasis": 'mixedlm',
        },
    kws_get_stats_regression=dict(
            groups='screen_id',
        ),
    colsubset='sample subset',
    variable="C(sample_subset, Treatment(reference='ref'))[T.test]",
)
## simulate modifications in dictionaries by subsetting for example
from random import sample
inputs=[]
for n in range(2,len(parameters),1):
    inputs.append({k:parameters[k] for k in list(parameters.keys())[:n]})
# import helper functions
from roux.lib.sys import to_output_paths
output_paths=to_output_paths(
    input_paths=['tests/input/1/output.tsv','tests/input/2/output.tsv'],
    replaces_output_path={'input':'output'},
    inputs=inputs,
    output_path='tests/output/{KEY}/output.tsv',
    )
output_paths
{'tests/output/2/output.tsv': 'tests/input/2/output.tsv',
 'tests/output/eae9949969282a98c356fe9f0e6d6aa9025eccc46d914226e3b36280d340e4fb/output.tsv': {'colindex': 'drug id',
  'colsample': 'sample id',
  'coly': 'auc'},
 'tests/output/f501f7c1674e142f9632f46cc921179f62f38a94c8cc59e0e5c15a2944689eb7/output.tsv': {'colindex': 'drug id',
  'colsample': 'sample id',
  'coly': 'auc',
  'formulas': {"auc ~ C(sample_subset, Treatment(reference='ref')) + primary_or_metastasis": 'mixedlm'}},
 'tests/output/28d279121c3b1e82d60de99397e4d0f90f4f096127d06fed0c1d0422034a8967/output.tsv': {'colindex': 'drug id',
  'colsample': 'sample id',
  'coly': 'auc',
  'formulas': {"auc ~ C(sample_subset, Treatment(reference='ref')) + primary_or_metastasis": 'mixedlm'},
  'kws_get_stats_regression': {'groups': 'screen_id'}},
 'tests/output/3b765e060a1cc9b06ec9e3010a797a7f6c77479488e1bdde90565bfa912fa3ea/output.tsv': {'colindex': 'drug id',
  'colsample': 'sample id',
  'coly': 'auc',
  'formulas': {"auc ~ C(sample_subset, Treatment(reference='ref')) + primary_or_metastasis": 'mixedlm'},
  'kws_get_stats_regression': {'groups': 'screen_id'},
  'colsubset': 'sample subset'},
 'tests/output/1/output.tsv': 'tests/input/1/output.tsv',
 'tests/output/032180568f7e29dd4a3042301f150f0988f968c2d093fa8a79a669ceee8359b6/output.tsv': {'colindex': 'drug id',
  'colsample': 'sample id'}}
len(output_paths)
7

Documentation

roux.lib.sys


Helper functions for querying data from Biomart

Expand

Requirements

# installing the required roux subpackage
!pip install roux[query]

A wrapper around pybiomart.

from roux.query.biomart import query
df01=query(
      species='homo sapiens',
      release=100,
      attributes=['ensembl_gene_id','entrezgene_id',
                  'percentage_gene_gc_content','hgnc_symbol','transcript_count','transcript_length'],
      filters={'biotype':['protein_coding'],},
)
INFO:root:hsapiens_gene_ensembl version: 100 is used
df01.head(1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Gene stable ID NCBI gene (formerly Entrezgene) ID HGNC symbol Gene % GC content Transcript count Transcript length (including UTRs and CDS)
0 ENSG00000198888 4535.0 MT-ND1 47.7 1 956
from roux.lib.io import to_table
to_table(df01,'tests/output/data/biomart/00_raw.tsv')
'tests/output/data/biomart/00_raw.tsv'

Documentation

roux.query.biomart


Helper functions for clustering.

Expand

Requirements

# installing the required roux subpackage
!pip install roux[stat]
from roux.lib.io import read_table,to_table
## reading a table generated using the roux_query.ipynb notebook
df01=read_table('tests/output/data/biomart/00_raw.tsv')
WARNING:root:dropped columns: Unnamed: 0
INFO:root:shape = (167181, 5)
from roux.lib.io import *
df1=df01.log.drop_duplicates(subset=['Gene stable ID','Gene % GC content'])
INFO:root:drop_duplicates: shape changed: (167181, 5)->(22802, 5), width constant
df1.head(1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Gene stable ID HGNC symbol Gene % GC content Transcript count Transcript length (including UTRs and CDS)
0 ENSG00000198888 MT-ND1 47.7 1 956
to_table(df1,'tests/output/data/biomart/01_dedup.tsv')
'data/biomart/01_dedup.tsv'

Documentation

roux.lib.io

Fitting a Gaussian-Mixture Model

from roux.lib.io import read_table
df1=read_table('tests/output/data/biomart/01_dedup.tsv')
WARNING:root:dropped columns: Unnamed: 0
INFO:root:shape = (22802, 5)
from roux.stat.cluster import cluster_1d
from roux.viz.io import to_plot
d1=cluster_1d(
    ds=df1['Gene % GC content'].copy(),
    n_clusters=2,
    clf_type='gmm',
    random_state=88,
    returns=['coff','mix_pdf','two_pdfs','weights'],
    ax=None,
    bins=60,
    test=True,
)
ax=plt.gca()
ax.set(xlabel='Gene % GC content',ylabel='density')
to_plot('plot/hist_gmm.png')
assert exists('tests/output/plot/hist_gmm.png')
INFO:root:intersections [46.95]
WARNING:root:overwritting: plot/hist_gmm.png

Documentation

roux.stat.cluster


Helper functions for annotating visualisations.

Expand
# installing the required roux subpackage
!pip install roux[viz]

Example of annotated scatter plot

# demo data
import seaborn as sns
df1=sns.load_dataset('iris')
# plot
from roux.viz.scatter import plot_scatter
ax=plot_scatter(df1,colx='sepal_length',coly='petal_width')
from roux.viz.annot import annot_side
ax=annot_side(ax=ax,
           df1=df1.sample(5),
           colx='sepal_length',coly='petal_width',cols='species',length_axhline=1.3)
ax=annot_side(ax=ax,
           df1=df1.sort_values('petal_width',ascending=False).head(5),
           colx='sepal_length',coly='petal_width',cols='species',length_axhline=1,
           loc='top',)
from roux.viz.io import to_plot
_=to_plot('tests/output/plot/scatter_annotated.png')
WARNING:root:overwritting: plot/scatter_annotated.png

Documentation

roux.viz.annot roux.viz.scatter

Example of annotated histogram

# demo data
import seaborn as sns
df1=sns.load_dataset('iris')

# plot
from roux.viz.dist import hist_annot
ax=hist_annot(df1,colx='sepal_length',colssubsets=['species'],bins=10,
          params_scatter=dict(marker='|',alpha=1))
from roux.viz.io import to_plot
_=to_plot('tests/output/plot/hist_annotated.png')

Documentation

roux.viz.dist

Example of annotated heatmap

# demo data
import seaborn as sns
df1=sns.load_dataset('iris')
df1=(df1
    .set_index('species')
    .melt(ignore_index=False)
    .reset_index()
    .pivot_table(index='variable',columns='species',values='value',aggfunc='mean'))

# plot
ax=sns.heatmap(df1,
            cmap='Blues',
           cbar_kws=dict(label='mean value'))
from roux.viz.annot import show_box
ax=show_box(ax=ax,xy=[1,2],width=2,height=1,ec='red',lw=2)
from roux.viz.io import to_plot
_=to_plot('tests/output/plot/heatmap_annotated.png')

Documentation

roux.viz.heatmap

Example of annotated distributions

# demo data
import seaborn as sns
df1=sns.load_dataset('iris')
df1=df1.loc[df1['species'].isin(['setosa','virginica']),:]
df1['id']=range(len(df1))

# plot
from roux.viz.dist import plot_dists
ax=plot_dists(df1,x='sepal_length',y='species',colindex='id',kind=['box','strip'])
from roux.viz.io import to_plot
_=to_plot('tests/output/plot/dists_annotated.png')

Documentation

roux.viz.dist

Example of annotated barplot

# demo data
import seaborn as sns
df1=sns.load_dataset('iris')

# plot
from roux.viz.bar import plot_barh
ax=plot_barh(df1.sort_values('sepal_length',ascending=False).head(5),
          colx='sepal_length',coly='species',colannnotside='sepal_length')
from roux.viz.io import to_plot
_=to_plot('tests/output/plot/bar_annotated.png')

Documentation

roux.viz.bar


Helper functions for the input/output of visualizations.

Expand

Saving plots with the source data

# demo data
import seaborn as sns
df1=sns.load_dataset('iris')
# import helper functions
from roux.viz.io import *

## parameters
kws_plot=dict(y='sepal_width')
## log the code from this cell of the notebook
log_code()
# plot
fig,ax=plt.subplots(figsize=[3,3])
sns.scatterplot(data=df1,x='sepal_length',y=kws_plot['y'],hue='species',
                ax=ax,)
## save the plot
to_plot('tests/output/plot/plot_saved.png',# filename
       df1=df1, #source data
       kws_plot=kws_plot,# plotting parameters
       )
assert exists('tests/output/plot/plot_saved.png')
Activating auto-logging. Current session state plus future input saved.
Filename       : log_notebook.log
Mode           : over
Output logging : False
Raw input log  : False
Timestamping   : False
State          : active

"Reading" the plots

read_plot('tests/output/plot/plot_saved.png')
0it [00:00, ?it/s]


INFO:root:Python implementation: CPython
Python version       : 3.7.13
IPython version      : 7.31.1
numpy     : 1.18.1
sys       : 3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0]
seaborn   : 0.11.2
tqdm      : 4.64.1
json      : 2.0.9
matplotlib: 3.5.1
re        : 2.2.1
logging   : 0.5.1.2
scipy     : 1.7.3
yaml      : 6.0
pandas    : 1.3.5

INFO:root:shape = (150, 5)


INFO: Pandarallel will run on 6 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.





<AxesSubplot:xlabel='sepal_length', ylabel='sepal_width'>

"Reading" the plot and modifying its parameters

read_plot('tests/output/plot/plot_saved.png',kws_plot=dict(y='petal_length'),title='modified')
INFO:root:shape = (150, 5)





<AxesSubplot:title={'center':'modified'}, xlabel='sepal_length', ylabel='petal_length'>

Documentation

roux.viz.io


Helper functions for line plots.

Expand

Step plot

Demo data

import pandas as pd
data=pd.DataFrame({
    'sample size':[100,30,60,50,30,20,25],
    })
data['step name']='step#'+(data.index+1).astype(str)
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sample size step name
0 100 step#1
1 30 step#2
2 60 step#3
3 50 step#4
4 30 step#5
5 20 step#6
6 25 step#7

Plot

from roux.viz.line import plot_steps
ax=plot_steps(
    data,
    col_step_name='step name',
    col_step_size='sample size',
    )

Documentation

roux.viz.line


Helper functions for scatter plots.

Expand

Volcano plot

Demo data

import pandas as pd
data = pd.read_csv('https://git.io/volcano_data1.csv')
data['P']=data['P'].replace(data['P'].min(),0) # to show P=0 as a triangle 
data=pd.concat([
    data.query(expr='P>=0.001'),
    data.query(expr='P<0.001').assign(
        **{'categories':lambda df: pd.qcut(df['BP'],3, labels=['low','med','high'])}, # to annotate
        ),
    ],axis=0)
data.head(1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CHR BP P SNP ZSCORE EFFECTSIZE GENE DISTANCE categories
0 1 937641 0.335344 rs9697358 0.9634 -0.0946 ISG15 1068 NaN

Plot

from roux.viz.scatter import plot_volcano
ax=plot_volcano(
    data,
    colx='EFFECTSIZE',
    coly='P',
    colindex='SNP',
    show_labels=3, # show top n 
    collabel='SNP',
    text_increase='n',
    text_decrease='n',
    # palette=sns.color_palette()[:3], # increase, decrease, ns
    )
WARNING:root:transforming the coly ("P") values.
WARNING:root:zeros found, replaced with min 0.001
/mnt/d/Documents/code/roux/roux/stat/transform.py:67: RuntimeWarning: divide by zero encountered in log10
  return -1*(np.log10(x))

With highlighted points

import seaborn as sns # required to set the palette of the outlines
ax=plot_volcano(
    data=data.query(expr="P<0.05"),
    colx='EFFECTSIZE',
    coly='P',
    colindex='SNP',
    # show_labels=3, # show top n 
    # collabel='SNP',
    show_outlines='categories',
    outline_colors=sns.color_palette()[:3],
    text_increase='n',
    text_decrease='n',
    palette=sns.color_palette('pastel')[:3], # increase, decrease, ns
    legend=True,
    )
WARNING:root:transforming the coly ("P") values.
WARNING:root:zeros found, replaced with min 0.001
/mnt/d/Documents/code/roux/roux/stat/transform.py:67: RuntimeWarning: divide by zero encountered in log10
  return -1*(np.log10(x))

Documentation

roux.viz.scatter

API

Expand

module roux.global_imports

For the use in jupyter notebook for example.

module roux.lib.df

For processing individual pandas DataFrames/Series


function get_name

get_name(df1: DataFrame, cols: list = None, coff: float = 2, out=None)

Gets the name of the dataframe.

Especially useful within groupby+pandarellel context.

Parameters:

  • df1 (DataFrame): input dataframe.
  • cols (list): list groupby columns.
  • coff (int): cutoff of unique values to infer the name.
  • out (str): format of the output (list|not).

Returns:

  • name (tuple|str|list): name of the dataframe.

function get_groupby_columns

get_groupby_columns(df_)

Get the columns supplied to groupby.

Parameters:

  • df_ (DataFrame): input dataframe.

Returns:

  • columns (list): list of columns.

function get_constants

get_constants(df1)

Get the columns with a single unique value.

Parameters:

  • df1 (DataFrame): input dataframe.

Returns:

  • columns (list): list of columns.

function drop_unnamedcol

drop_unnamedcol(df)

Deletes the columns with "Unnamed" prefix.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function drop_unnamedcol

drop_unnamedcol(df)

Deletes the columns with "Unnamed" prefix.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function drop_levelcol

drop_levelcol(df)

Deletes the potentially temporary columns names with "level" prefix.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function drop_constants

drop_constants(df)

Deletes columns with a single unique value.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function dropby_patterns

dropby_patterns(df1, patterns=None, strict=False, test=False)

Deletes columns containing substrings i.e. patterns.

Parameters:

  • df1 (DataFrame): input dataframe.
  • patterns (list): list of substrings.
  • test (bool): verbose.

Returns:

  • df1 (DataFrame): output dataframe.

function flatten_columns

flatten_columns(df: DataFrame, sep: str = ' ', **kws)  DataFrame

Multi-index columns to single-level.

Parameters:

  • df (DataFrame): input dataframe.
  • sep (str): separator within the joined tuples (' ').

Returns:

  • df (DataFrame): output dataframe.

Keyword Arguments:

  • kws (dict): parameters provided to coltuples2str function.

function lower_columns

lower_columns(df)

Column names of the dataframe to lower-case letters.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function renameby_replace

renameby_replace(
    df: DataFrame,
    replaces: dict,
    ignore: bool = True,
    **kws
)  DataFrame

Rename columns by replacing sub-strings.

Parameters:

  • df (DataFrame): input dataframe.
  • replaces (dict|list): from->to format or list containing substrings to remove.
  • ignore (bool): if True, not validate the successful replacements.

Returns:

  • df (DataFrame): output dataframe.

Keyword Arguments:

  • kws (dict): parameters provided to replacemany function.

function clean_columns

clean_columns(df: DataFrame)  DataFrame

Standardise columns.

Steps: 1. Strip flanking white-spaces. 2. Lower-case letters.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function clean

clean(
    df: DataFrame,
    cols: list = [],
    drop_constants: bool = False,
    drop_unnamed: bool = True,
    verb: bool = False
)  DataFrame

Deletes potentially temporary columns.

Steps: 1. Strip flanking white-spaces. 2. Lower-case letters.

Parameters:

  • df (DataFrame): input dataframe.
  • drop_constants (bool): whether to delete the columns with a single unique value.
  • drop_unnamed (bool): whether to delete the columns with 'Unnamed' prefix.
  • verb (bool): verbose.

Returns:

  • df (DataFrame): output dataframe.

function compress

compress(df1, coff_categories=20, test=False)

Compress the dataframe by converting columns containing strings/objects to categorical.

Parameters:

  • df1 (DataFrame): input dataframe.
  • coff_categories (int): if the number of unique values are less than cutoff the it will be converted to categories.
  • test (bool): verbose.

Returns:

  • df1 (DataFrame): output dataframe.

function clean_compress

clean_compress(df, kws_compress={}, **kws_clean)

clean and compress the dataframe.

Parameters:

  • df (DataFrame): input dataframe.
  • kws_compress (int): keyword arguments for the compress function.
  • test (bool): verbose.

Keyword Arguments:

  • kws_clean (dict): parameters provided to clean function.

Returns:

  • df1 (DataFrame): output dataframe.

See Also: clean compress


function check_na

check_na(df, subset=None, perc=False)

Number/percentage of missing values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • perc (bool): output percentages.

Returns:

  • ds (Series): output stats.

function validate_no_na

validate_no_na(df, subset=None)

Validate no missing values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • perc (bool): output percentages.

Returns:

  • ds (Series): output stats.

function assert_no_na

assert_no_na(df, subset=None)

Assert that no missing values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • perc (bool): output percentages.

Returns:

  • ds (Series): output stats.

function check_nunique

check_nunique(
    df: DataFrame,
    subset: list = None,
    groupby: str = None,
    perc: bool = False
)  Series

Number/percentage of unique values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • perc (bool): output percentages.

Returns:

  • ds (Series): output stats.

function check_inflation

check_inflation(df1, subset=None)

Occurances of values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.

Returns:

  • ds (Series): output stats.

function check_dups

check_dups(df, subset=None, perc=False)

Check duplicates.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • perc (bool): output percentages.

Returns:

  • ds (Series): output stats.

function check_duplicated

check_duplicated(df, subset=None, perc=False)

Check duplicates (alias of check_dups)


function validate_no_dups

validate_no_dups(df, subset=None)

Validate that no duplicates.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.

function validate_no_duplicates

validate_no_duplicates(df, subset=None)

Validate that no duplicates (alias of validate_no_dups)


function assert_no_dups

assert_no_dups(df, subset=None)

Assert that no duplicates


function validate_dense

validate_dense(
    df01: DataFrame,
    subset: list = None,
    duplicates: bool = True,
    na: bool = True,
    message=None
)  DataFrame

Validate no missing values and no duplicates in the dataframe.

Parameters:

  • df01 (DataFrame): input dataframe.
  • subset (list): list of columns.
  • duplicates (bool): whether to check duplicates.
  • na (bool): whether to check na.
  • message (str): error message

function assert_dense

assert_dense(
    df01: DataFrame,
    subset: list = None,
    duplicates: bool = True,
    na: bool = True,
    message=None
)  DataFrame

Alias of validate_dense.

Notes:

to be deprecated in future releases.


function classify_mappings

classify_mappings(
    df1: DataFrame,
    col1: str,
    col2: str,
    clean: bool = False
)  DataFrame

Classify mappings between items in two columns.

Parameters:

  • df1 (DataFrame): input dataframe.
  • col1<.py#1.
  • col2<.py#2.
  • clean (str): drop columns with the counts.

Returns:

  • (pd.DataFrame): output.

function check_mappings

check_mappings(
    df: DataFrame,
    subset: list = None,
    out: str = 'full'
)  DataFrame

Mapping between items in two columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • out (str): format of the output.

Returns:

  • ds (Series): output stats.

function validate_1_1_mappings

validate_1_1_mappings(df: DataFrame, subset: list = None)  DataFrame

Validate that the papping between items in two columns is 1:1.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • out (str): format of the output.

function get_mappings

get_mappings(
    df1: DataFrame,
    subset=None,
    keep='1:1',
    clean=False,
    cols=None
)  DataFrame

Classify the mapapping between items in two columns.

Parameters:

  • df1 (DataFrame): input dataframe.
  • subset (list): list of columns.
  • keep (str): type of mapping (1:1|1:m|m:1).
  • clean (bool): whether remove temporary columns.
  • cols (list): alias of subset.

Returns:

  • df (DataFrame): output dataframe.

function groupby_filter_fast

groupby_filter_fast(
    df1: DataFrame,
    col_groupby,
    fun_agg,
    expr,
    col_agg: str = 'temporary',
    **kws_query
)  DataFrame

Groupby and filter fast.

Parameters:

  • df1 (DataFrame): input dataframe.
  • by (str|list): column name/s to groupby with.
  • fun (object): function to filter with.
  • how (str): greater or less than coff (>|<).
  • coff (float): cut-off.

Returns:

  • df1 (DataFrame): output dataframe.

Todo: Deprecation if pandas.core.groupby.DataFrameGroupBy.filter is faster.


function to_map_binary

to_map_binary(df: DataFrame, colgroupby=None, colvalue=None)  DataFrame

Convert linear mappings to a binary map

Parameters:

  • df (DataFrame): input dataframe.
  • colgroupby (str): name of the column for groupby.
  • colvalue (str): name of the column containing values.

Returns:

  • df1 (DataFrame): output dataframe.

function check_intersections

check_intersections(
    df: DataFrame,
    colindex=None,
    colgroupby=None,
    plot=False,
    **kws_plot
)  DataFrame

Check intersections. Linear dataframe to is converted to a binary map and then to a series using groupby.

Parameters:

  • df (DataFrame): input dataframe.
  • colindex (str): name of the index column.
  • colgroupby (str): name of the groupby column.
  • plot (bool): plot or not.

Returns:

  • ds1 (Series): output Series.

Keyword Arguments:

  • kws_plot (dict): parameters provided to the plotting function.

function get_totals

get_totals(ds1)

Get totals from the output of check_intersections.

Parameters:

  • ds1 (Series): input Series.

Returns:

  • d (dict): output dictionary.

function filter_rows

filter_rows(
    df,
    d,
    sign='==',
    logic='and',
    drop_constants=False,
    test=False,
    verbose=True
)

Filter rows using a dictionary.

Parameters:

  • df (DataFrame): input dataframe.
  • d (dict): dictionary.
  • sign (str): condition within mappings ('==').
  • logic (str): condition between mappings ('and').
  • drop_constants (bool): to drop the columns with single unique value (False).
  • test (bool): testing (False).
  • verbose (bool): more verbose (True).

Returns:

  • df (DataFrame): output dataframe.

function get_bools

get_bools(df, cols, drop=False)

Columns to bools. One-hot-encoder (get_dummies).

Parameters:

  • df (DataFrame): input dataframe.
  • cols (list): columns to encode.
  • drop (bool): drop the cols (False).

Returns:

  • df (DataFrame): output dataframe.

function agg_bools

agg_bools(df1, cols)

Bools to columns. Reverse of one-hot encoder (get_dummies).

Parameters:

  • df1 (DataFrame): input dataframe.
  • cols (list): columns.

Returns:

  • ds (Series): output series.

function melt_paired

melt_paired(
    df: DataFrame,
    cols_index: list = None,
    suffixes: list = None,
    cols_value: list = None
)  DataFrame

Melt a paired dataframe.

Parameters:

  • df (DataFrame): input dataframe.
  • cols_index (list): paired index columns (None).
  • suffixes (list): paired suffixes (None).
  • cols_value (list): names of the columns containing the values (None).

Notes:

Partial melt melts selected columns cols_value.

Examples: Paired parameters: cols_value=['value1','value2'], suffixes=['gene1','gene2'],


function get_chunks

get_chunks(
    df1: DataFrame,
    colindex: str,
    colvalue: str,
    bins: int = None,
    value: str = 'right'
)  DataFrame

Get chunks of a dataframe.

Parameters:

  • df1 (DataFrame): input dataframe.
  • colindex (str): name of the index column.
  • colvalue (str): name of the column containing values [0-100]
  • bins (int): number of bins.
  • value (str): value to use as the name of the chunk ('right').

Returns:

  • ds (Series): output series.

function get_group

get_group(groups, i: int = None, verbose: bool = True)  DataFrame

Get a dataframe for a group out of the groupby object.

Parameters:

  • groups (object): groupby object.
  • i (int): index of the group (None).
  • verbose (bool): verbose (True).

Returns:

  • df (DataFrame): output dataframe.

Notes:

Useful for testing groupby.


function infer_index

infer_index(
    data: DataFrame,
    cols_drop=[],
    include=<class 'object'>,
    exclude=None
)  list

Infer the index (id) of the table.


function to_multiindex_columns

to_multiindex_columns(df, suffixes, test=False)

Single level columns to multiindex.

Parameters:

  • df (DataFrame): input dataframe.
  • suffixes (list): list of suffixes.
  • test (bool): verbose (False).

Returns:

  • df (DataFrame): output dataframe.

function to_ranges

to_ranges(df1, colindex, colbool, sort=True)

Ranges from boolean columns.

Parameters:

  • df1 (DataFrame): input dataframe.
  • colindex (str): column containing index items.
  • colbool (str): column containing boolean values.
  • sort (bool): sort the dataframe (True).

Returns:

  • df1 (DataFrame): output dataframe.

TODO: compare with io_sets.bools2intervals.


function to_boolean

to_boolean(df1)

Boolean from ranges.

Parameters:

  • df1 (DataFrame): input dataframe.

Returns:

  • ds (Series): output series.

TODO: compare with io_sets.bools2intervals.


function to_cat

to_cat(ds1, cats, ordered=True)

To series containing categories.

Parameters:

  • ds1 (Series): input series.
  • cats (list): categories.
  • ordered (bool): if the categories are ordered (True).

Returns:

  • ds1 (Series): output series.

function sort_valuesby_list

sort_valuesby_list(df1, by, cats, **kws)

Sort dataframe by custom order of items in a column.

Parameters:

  • df1 (DataFrame): input dataframe.
  • by (str): column.
  • cats (list): ordered list of items.

Keyword parameters:

  • kws (dict): parameters provided to sort_values.

Returns:

  • df (DataFrame): output dataframe.

function agg_by_order

agg_by_order(x, order)

Get first item in the order.

Parameters:

  • x (list): list.
  • order (list): desired order of the items.

Returns:

  • k: first item.

Notes:

Used for sorting strings. e.g. damaging > other non-conserving > other conserving

TODO: Convert categories to numbers and take min


function agg_by_order_counts

agg_by_order_counts(x, order)

Get the aggregated counts by order*.

Parameters:

  • x (list): list.
  • order (list): desired order of the items.

Returns:

  • df (DataFrame): output dataframe.

Examples: df=pd.DataFrame({'a1':['a','b','c','a','b','c','d'], 'b1':['a1','a1','a1','b1','b1','b1','b1'],}) df.groupby('b1').apply(lambda df : agg_by_order_counts(x=df['a1'], order=['b','c','a'], ))


function groupby_sort_values

groupby_sort_values(
    df,
    col_groupby,
    col_sortby,
    subset=None,
    col_subset=None,
    func='mean',
    ascending=True
)

Sort groups.

Parameters:

  • df (DataFrame): input dataframe.
  • col_groupby (str|list): column/s to groupby with.
  • col_sortby (str|list): column/s to sort values with.
  • subset (list): columns (None).
  • col_subset (str): column containing the subset (None).
  • func (str): aggregate function, provided to numpy ('mean').
  • ascending (bool): sort values ascending (True).

Returns:

  • df (DataFrame): output dataframe.

function groupby_sort_values

groupby_sort_values(
    df,
    col_groupby,
    col_sortby,
    subset=None,
    col_subset=None,
    func='mean',
    ascending=True
)

Sort groups.

Parameters:

  • df (DataFrame): input dataframe.
  • col_groupby (str|list): column/s to groupby with.
  • col_sortby (str|list): column/s to sort values with.
  • subset (list): columns (None).
  • col_subset (str): column containing the subset (None).
  • func (str): aggregate function, provided to numpy ('mean').
  • ascending (bool): sort values ascending (True).

Returns:

  • df (DataFrame): output dataframe.

function swap_paired_cols

swap_paired_cols(df_, suffixes=['gene1', 'gene2'])

Swap suffixes of paired columns.

Parameters:

  • df_ (DataFrame): input dataframe.
  • suffixes (list): suffixes.

Returns:

  • df (DataFrame): output dataframe.

function sort_columns_by_values

sort_columns_by_values(
    df: DataFrame,
    cols_sortby=['mutation gene1', 'mutation gene2'],
    suffixes=['gene1', 'gene2'],
    clean=False
)  DataFrame

Sort the values in columns in ascending order.

Parameters:

  • df (DataFrame): input dataframe.
  • cols_sortby (list): (['mutation gene1','mutation gene2'])
  • suffixes (list): suffixes, without no spaces. (['gene1','gene2'])

Returns:

  • df (DataFrame): output dataframe.

Notes:

In the output dataframe, sorted means values are sorted because gene1>gene2.


function make_ids

make_ids(df, cols, ids_have_equal_length, sep='--', sort=False)

Make ids by joining string ids in more than one columns.

Parameters:

  • df (DataFrame): input dataframe.
  • cols (list): columns.
  • ids_have_equal_length (bool): ids have equal length, if True faster processing.
  • sep (str): separator between the ids ('--').
  • sort (bool): sort the ids before joining (False).

Returns:

  • ds (Series): output series.

function make_ids_sorted

make_ids_sorted(df, cols, ids_have_equal_length, sep='--')

Make sorted ids by joining string ids in more than one columns.

Parameters:

  • df (DataFrame): input dataframe.
  • cols (list): columns.
  • ids_have_equal_length (bool): ids have equal length, if True faster processing.
  • sep (str): separator between the ids ('--').

Returns:

  • ds (Series): output series.

function get_alt_id

get_alt_id(s1='A--B', s2='A', sep='--')

Get alternate/partner id from a paired id.

Parameters:

  • s1 (str): joined id.
  • s2 (str): query id.

Returns:

  • s (str): partner id.

function split_ids

split_ids(df1, col, sep='--', prefix=None)

Split joined ids to individual ones.

Parameters:

  • df1 (DataFrame): input dataframe.
  • col (str): column containing the joined ids.
  • sep (str): separator within the joined ids ('--').
  • prefix (str): prefix of the individual ids (None).

Return:

  • df1 (DataFrame): output dataframe.

function dict2df

dict2df(d, colkey='key', colvalue='value')

Dictionary to DataFrame.

Parameters:

  • d (dict): dictionary.
  • colkey (str): name of column containing the keys.
  • colvalue (str): name of column containing the values.

Returns:

  • df (DataFrame): output dataframe.

function log_shape_change

log_shape_change(d1, fun='')

Report the changes in the shapes of a DataFrame.

Parameters:

  • d1 (dic): dictionary containing the shapes.
  • fun (str): name of the function.

function log_apply

log_apply(
    df,
    fun,
    validate_equal_length=False,
    validate_equal_width=False,
    validate_equal_shape=False,
    validate_no_decrease_length=False,
    validate_no_decrease_width=False,
    validate_no_increase_length=False,
    validate_no_increase_width=False,
    *args,
    **kwargs
)

Report (log) the changes in the shapes of the dataframe before and after an operation/s.

Parameters:

  • df (DataFrame): input dataframe.
  • fun (object): function to apply on the dataframe.
  • validate_equal_length (bool): Validate that the number of rows i.e. length of the dataframe remains the same before and after the operation.
  • validate_equal_width (bool): Validate that the number of columns i.e. width of the dataframe remains the same before and after the operation.
  • validate_equal_shape (bool): Validate that the number of rows and columns i.e. shape of the dataframe remains the same before and after the operation.

Keyword parameters:

  • args (tuple): provided to fun.
  • kwargs (dict): provided to fun.

Returns:

  • df (DataFrame): output dataframe.

class log

Report (log) the changes in the shapes of the dataframe before and after an operation/s.

TODO: Create the attribures (attr) using strings e.g. setattr. import inspect fun=inspect.currentframe().f_code.co_name

method __init__

__init__(pandas_obj)

method check_dups

check_dups(**kws)

method check_na

check_na(**kws)

method check_nunique

check_nunique(**kws)

method clean

clean(**kws)

method drop

drop(**kws)

method drop_duplicates

drop_duplicates(**kws)

method dropna

dropna(**kws)

method explode

explode(**kws)

method filter_

filter_(**kws)

method filter_rows

filter_rows(**kws)

method groupby

groupby(**kws)

method join

join(**kws)

method melt

melt(**kws)

method melt_paired

melt_paired(**kws)

method merge

merge(**kws)

method pivot

pivot(**kws)

method pivot_table

pivot_table(**kws)

method query

query(**kws)

method stack

stack(**kws)

method unstack

unstack(**kws)

module roux.lib.dfs

For processing multiple pandas DataFrames/Series


function filter_dfs

filter_dfs(dfs, cols, how='inner')

Filter dataframes based items in the common columns.

Parameters:

  • dfs (list): list of dataframes.
  • cols (list): list of columns.
  • how (str): how to filter ('inner')

Returns

  • dfs (list): list of dataframes.

function merge_with_many_columns

merge_with_many_columns(
    df1: DataFrame,
    right: str,
    left_on: str,
    right_ons: list,
    right_id: str,
    how: str = 'inner',
    validate: str = '1:1',
    test: bool = False,
    verbose: bool = False,
    **kws_merge
)  DataFrame

Merge with many columns. For example, if ids in the left table can map to ids located in multiple columns of the right table.

Parameters:

  • df1 (pd.DataFrame): left table.
  • right (pd.DataFrame): right table.
  • left_on (str): column in the left table to merge on.
  • right_ons (list): columns in the right table to merge on.
  • right_id (str): column in the right dataframe with for example the ids to be merged.

Keyword parameters:

  • kws_merge: to be supplied to pandas.DataFrame.merge.

Returns: Merged table.


function merge_paired

merge_paired(
    df1,
    df2,
    left_ons,
    right_on,
    common=[],
    right_ons_common=[],
    how='inner',
    validates=['1:1', '1:1'],
    suffixes=None,
    test=False,
    verb=True,
    **kws
)

Merge uppaired dataframes to a paired dataframe.

Parameters:

  • df1 (DataFrame): paired dataframe.
  • df2 (DataFrame): unpaired dataframe.
  • left_ons (list): columns of the df1 (suffixed).
  • right_on (str|list): column/s of the df2 (to be suffixed).
  • common (str|list): common column/s between df1 and df2 (not suffixed).
  • right_ons_common (str|list): common column/s between df2 to be used for merging (not to be suffixed).
  • how (str): method of merging ('inner').
  • validates (list): validate mappings for the 1st mapping between df1 and df2 and 2nd one between df1+df2 and df2 (['1:1','1:1']).
  • suffixes (list): suffixes to be used (None).
  • test (bool): testing (False).
  • verb (bool): verbose (True).

Keyword Parameters:

  • kws (dict): parameters provided to merge.

Returns:

  • df (DataFrame): output dataframe.

Examples:

Parameters: how='inner', left_ons=['gene id gene1','gene id gene2'], # suffixed common='sample id', # not suffixed right_on='gene id', # to be suffixed right_ons_common=[], # not to be suffixed


function merge_dfs

merge_dfs(dfs, **kws)

Merge dataframes from left to right.

Parameters:

  • dfs (list): list of dataframes.

Keyword Parameters:

  • kws (dict): parameters provided to merge.

Returns:

  • df (DataFrame): output dataframe.

Notes:

For example, reduce(lambda x, y: x.merge(y), [1, 2, 3, 4, 5]) merges ((((1.merge(2)).merge(3)).merge(4)).merge(5)).


function compare_rows

compare_rows(df1, df2, test=False, **kws)

module roux.lib.dict

For processing dictionaries.


function head_dict

head_dict(d, lines=5)

function sort_dict

sort_dict(d1, by=1, ascending=True)

Sort dictionary by values.

Parameters:

  • d1 (dict): input dictionary.
  • by (int): index of the value among the values.
  • ascending (bool): ascending order.

Returns:

  • d1 (dict): output dictionary.

function merge_dicts

merge_dicts(l: list)  dict

Merge dictionaries.

Parameters:

  • l (list): list containing the dictionaries.

Returns:

  • d (dict): output dictionary.

TODOs: in python>=3.9, merged = d1 | d2


function merge_dict_values

merge_dict_values(l, test=False)

Merge dictionary values.

Parameters:

  • l (list): list containing the dictionaries.
  • test (bool): verbose.

Returns:

  • d (dict): output dictionary.

function flip_dict

flip_dict(d)

switch values with keys and vice versa.

Parameters:

  • d (dict): input dictionary.

Returns:

  • d (dict): output dictionary.

module roux.lib.google

Processing files form google-cloud services.


function get_service

get_service(service_name='drive', access_limit=True, client_config=None)

Creates a google service object.

:param service_name: name of the service e.g. drive :param access_limit: True is access limited else False :param client_config: custom client config ... :return: google service object

Ref: https://developers.google.com/drive/api/v3/about-auth


function get_service

get_service(service_name='drive', access_limit=True, client_config=None)

Creates a google service object.

:param service_name: name of the service e.g. drive :param access_limit: True is access limited else False :param client_config: custom client config ... :return: google service object

Ref: https://developers.google.com/drive/api/v3/about-auth


function list_files_in_folder

list_files_in_folder(service, folderid, filetype=None, fileext=None, test=False)

Lists files in a google drive folder.

:param service: service object e.g. drive :param folderid: folder id from google drive :param filetype: specify file type :param fileext: specify file extension :param test: True if verbose else False ... :return: list of files in the folder


function get_file_id

get_file_id(p)

function download_file

download_file(
    p=None,
    file_id=None,
    service=None,
    outd=None,
    outp=None,
    convert=False,
    force=False,
    test=False
)

Downloads a specified file.

:param service: google service object :param file_id: file id as on google drive :param filetypes: specify file type :param outp: path to the ouput file :param test: True if verbose else False

Ref: https://developers.google.com/drive/api/v3/ref-export-formats


function upload_file

upload_file(service, filep, folder_id, test=False)

Uploads a local file onto google drive.

:param service: google service object :param filep: path of the file :param folder_id: id of the folder on google drive where the file will be uploaded :param test: True is verbose else False ... :return: id of the uploaded file


function upload_files

upload_files(service, ps, folder_id, **kws)

function download_drawings

download_drawings(folderid, outd, service=None, test=False)

Download specific files: drawings

TODOs: 1. use download_file


function get_comments

get_comments(
    fileid,
    fields='comments/quotedFileContent/value,comments/content,comments/id',
    service=None
)

Get comments.

fields: comments/ kind: id: createdTime: modifiedTime: author: kind: displayName: photoLink: me: True htmlContent: content: deleted: quotedFileContent: mimeType: value: anchor: replies: []


function search

search(query, results=1, service=None, **kws_search)

Google search.

:param query: exact terms ... :return: dict


function get_search_strings

get_search_strings(text, num=5, test=False)

Google search.

:param text: string :param num: number of results :param test: True if verbose else False ... :return lines: list


function get_metadata_of_paper

get_metadata_of_paper(
    file_id,
    service_drive,
    service_search,
    metadata=None,
    force=False,
    test=False
)

Get the metadata of a pdf document.


function share

share(
    drive_service,
    content_id,
    share=False,
    unshare=False,
    user_permission=None,
    permissionId='anyoneWithLink'
)

:params user_permission: user_permission = { 'type': 'anyone', 'role': 'reader', 'email':'@' } Ref: https://developers.google.com/drive/api/v3/manage-sharing


class slides


method create_image

create_image(service, presentation_id, page_id, image_id)

image less than 1.5 Mb


method get_page_ids

get_page_ids(service, presentation_id)

module roux.lib.io

For input/output of data files.


function to_zip

to_zip(p, outp=None, fmt='zip')

Compress a file/directory.

Parameters:

  • p (str): path to the file/directory.
  • outp (str): path to the output compressed file.
  • fmt (str): format of the compressed file.

Returns:

  • outp (str): path of the compressed file.

function read_zip

read_zip(p: str, file_open: str = None, fun_read=None, test: bool = False)

Read the contents of a zip file.

Parameters:

  • p (str): path of the file.
  • file_open (str): path of file within the zip file to open.
  • fun_read (object): function to read the file.

Examples:

  1. Setting fun_read parameter for reading tab-separated table from a zip file.

fun_read=lambda x: pd.read_csv(io.StringIO(x.decode('utf-8')),sep=' ',header=None),


function get_version

get_version(suffix='')

Get the time-based version string.

Parameters:

  • suffix (string): suffix.

Returns:

  • version (string): version.

function version

version(p, outd=None, **kws)

Get the version of the file/directory.

Parameters:

  • p (str): path.
  • outd (str): output directory.

Keyword parameters:

  • kws (dict): provided to get_version.

Returns:

  • version (string): version.

function backup

backup(
    p,
    outd,
    versioned=False,
    suffix='',
    zipped=False,
    move_only=False,
    test=True,
    no_test=False
)

Backup a directory

Steps: 0. create version dir in outd 1. move ps to version (time) dir with common parents till the level of the version dir 2. zip or not

Parameters:

  • p (str): input path.
  • outd (str): output directory path.
  • versioned (bool): custom version for the backup (False).
  • suffix (str): custom suffix for the backup ('').
  • zipped (bool): whether to zip the backup (False).
  • test (bool): testing (True).
  • no_test (bool): no testing (False).

TODO:

  1. Chain to if exists and force. 2. Option to remove dirs find and move/zip "find -regex ./_." "find -regex ./test."

function read_url

read_url(url)

Read text from an URL.

Parameters:

  • url (str): URL link.

Returns:

  • s (string): text content of the URL.

function download

download(
    url: str,
    outd: str,
    path: str = None,
    force: bool = False,
    verbose: bool = True
)  str

Download a file.

Parameters:

  • url (str): URL.
  • path (str): custom output path (None)
  • outd (str): output directory ('data/database').
  • force (bool): overwrite output (False).
  • verbose (bool): verbose (True).

Returns:

  • path (str): output path (None)

function read_text

read_text(p)

Read a file. To be called by other functions

Args:

  • p (str): path.

Returns:

  • s (str): contents.

function to_list

to_list(l1, p)

Save list.

Parameters:

  • l1 (list): input list.
  • p (str): path.

Returns:

  • p (str): path.

function read_list

read_list(p)

Read the lines in the file.

Args:

  • p (str): path.

Returns:

  • l (list): list.

function read_list

read_list(p)

Read the lines in the file.

Args:

  • p (str): path.

Returns:

  • l (list): list.

function read_yaml

read_yaml(p)

Read .yaml file.

Parameters:

  • p (str): path.

Returns:

  • d (dict): output dictionary.

function to_yaml

to_yaml(d, p, **kws)

Save .yaml file.

Parameters:

  • d (dict): input dictionary.
  • p (str): path.

Keyword Arguments:

  • kws (d): parameters provided to yaml.safe_dump.

Returns:

  • p (str): path.

function read_json

read_json(path_to_file, encoding=None)

Read .json file.

Parameters:

  • p (str): path.

Returns:

  • d (dict): output dictionary.

function to_json

to_json(data, p)

Save .json file.

Parameters:

  • d (dict): input dictionary.
  • p (str): path.

Returns:

  • p (str): path.

function read_pickle

read_pickle(p)

Read .pickle file.

Parameters:

  • p (str): path.

Returns:

  • d (dict): output dictionary.

function is_dict

is_dict(p)

function read_dict

read_dict(p, fmt: str = '', apply_on_keys=None, **kws)  dict

Read dictionary file.

Parameters:

  • p (str): path.
  • fmt (str): format of the file.

Keyword Arguments:

  • kws (d): parameters provided to reader function.

Returns:

  • d (dict): output dictionary.

function to_dict

to_dict(d, p, **kws)

Save dictionary file.

Parameters:

  • d (dict): input dictionary.
  • p (str): path.

Keyword Arguments:

  • kws (d): parameters provided to export function.

Returns:

  • p (str): path.

function post_read_table

post_read_table(df1, clean, tables, verbose=True, **kws_clean)

Post-reading a table.

Parameters:

  • df1 (DataFrame): input dataframe.
  • clean (bool): whether to apply clean function. tables ()
  • verbose (bool): verbose.

Keyword parameters:

  • kws_clean (dict): paramters provided to the clean function.

Returns:

  • df (DataFrame): output dataframe.

function read_table

read_table(
    p,
    ext=None,
    clean=True,
    filterby_time=None,
    params={},
    kws_clean={},
    kws_cloud={},
    check_paths=True,
    tables=1,
    test=False,
    verbose=True,
    **kws_read_tables
)

Table/s reader.

Parameters:

 - <b>`p`</b> (str):  path of the file. It could be an input for `read_ps`, which would include strings with wildcards, list etc.  
 - <b>`ext`</b> (str):  extension of the file (default: None meaning infered from the path). 
 - <b>`clean=(default`</b>: True). filterby_time=None). 
 - <b>`check_paths`</b> (bool):  read files in the path column (default:True).  
 - <b>`test`</b> (bool):  testing (default:False). 
 - <b>`params`</b>:  parameters provided to the 'pd.read_csv' (default:{}). For example 
 - <b>`params['columns']`</b>:  columns to read. 
 - <b>`kws_clean`</b>:  parameters provided to 'rd.clean' (default:{}). 
 - <b>`kws_cloud`</b>:  parameters for reading files from google-drive (default:{}). 
 - <b>`tables`</b>:  how many tables to be read (default:1). 
 - <b>`verbose`</b>:  verbose (default:True).  

Keyword parameters: - kws_read_tables (dict): parameters provided to read_tables function. For example: - drop_index (bool): whether to drop the index column e.g. path (default: True). - replaces_index (object|dict|list|str): for example, 'basenamenoext' if path to basename. - colindex (str): the name of the column containing the paths (default: 'path')

Returns:

 - <b>`df`</b> (DataFrame):  output dataframe.  

Examples:

  1. For reading specific columns only set params=dict(columns=list).

  2. While reading many files, convert paths to a column with corresponding values:

drop_index=False, colindex='parameter', replaces_index=lambda x: Path(x).parent

  1. Reading a vcf file. p='*.vcf|vcf.gz' read_table(p, params_read_csv=dict( #compression='gzip', sep=' ',comment='#',header=None, names=replace_many(get_header(path,comment='#',lineno=-1),['#',' '],'').split(' ')) )

function get_logp

get_logp(ps)

Infer the path of the log file.

Parameters:

  • ps (list): list of paths.

Returns:

  • p (str): path of the output file.

function apply_on_paths

apply_on_paths(
    ps,
    func,
    replaces_outp=None,
    replaces_index=None,
    drop_index=True,
    colindex='path',
    filter_rows=None,
    fast=False,
    progress_bar=True,
    params={},
    dbug=False,
    test1=False,
    verbose=True,
    kws_read_table={},
    **kws
)

Apply a function on list of files.

Parameters:

  • ps (str|list): paths or string to infer paths using read_ps.
  • func (function): function to be applied on each of the paths.
  • replaces_outp (dict|function): infer the output path (outp) by replacing substrings in the input paths (p).
  • filter_rows (dict): filter the rows based on dict, using rd.filter_rows.
  • fast (bool): parallel processing (default:False).
  • progress_bar (bool): show progress bar(default:True).
  • params (dict): parameters provided to the pd.read_csv function.
  • dbug (bool): debug mode on (default:False).
  • test1 (bool): test on one path (default:False).
  • kws_read_table (dict): parameters provided to the read_table function (default:{}).
  • replaces_index (object|dict|list|str): for example, 'basenamenoext' if path to basename.
  • drop_index (bool): whether to drop the index column e.g. path (default: True).
  • colindex (str): the name of the column containing the paths (default: 'path')

Keyword parameters:

  • kws (dict): parameters provided to the function.

Example:

  1. Function: def apply_(p,outd='data/data_analysed',force=False): outp=f"{outd}/{basenamenoext(p)}.pqt' if exists(outp) and not force: return df01=read_table(p) apply_on_paths( ps=glob("data/data_analysed/*"), func=apply_, outd="data/data_analysed/", force=True, fast=False, read_path=True, )

TODOs: Move out of io.


function read_tables

read_tables(
    ps,
    fast=False,
    filterby_time=None,
    to_dict=False,
    params={},
    tables=None,
    **kws_apply_on_paths
)

Read multiple tables.

Parameters:

  • ps (list): list of paths.
  • fast (bool): parallel processing (default:False)
  • filterby_time (str): filter by time (default:None)
  • drop_index (bool): drop index (default:True)
  • to_dict (bool): output dictionary (default:False)
  • params (dict): parameters provided to the pd.read_csv function (default:{})
  • tables: number of tables (default:None).

Keyword parameters:

  • kws_apply_on_paths (dict): parameters provided to apply_on_paths.

Returns:

  • df (DataFrame): output dataframe.

TODOs: Parameter to report the creation dates of the newest and the oldest files.


function to_table

to_table(df, p, colgroupby=None, test=False, **kws)

Save table.

Parameters:

  • df (DataFrame): the input dataframe.
  • p (str): output path.
  • colgroupby (str|list): columns to groupby with to save the subsets of the data as separate files.
  • test (bool): testing on (default:False).

Keyword parameters:

  • kws (dict): parameters provided to the to_manytables function.

Returns:

  • p (str): path of the output.

function to_manytables

to_manytables(df, p, colgroupby, fmt='', ignore=False, **kws_get_chunks)

Save many table.

Parameters:

  • df (DataFrame): the input dataframe.
  • p (str): output path.
  • colgroupby (str|list): columns to groupby with to save the subsets of the data as separate files.
  • fmt (str): if '=' column names in the folder name e.g. col1=True.
  • ignore (bool): ignore the warnings (default:False).

Keyword parameters:

  • kws_get_chunks (dict): parameters provided to the get_chunks function.

Returns:

  • p (str): path of the output.

TODOs:

  • 1. Change in default parameter: fmt='='.

function to_table_pqt

to_table_pqt(df, p, engine='fastparquet', compression='gzip', **kws_pqt)

function tsv2pqt

tsv2pqt(p)

Convert tab-separated file to Apache parquet.

Parameters:

  • p (str): path of the input.

Returns:

  • p (str): path of the output.

function pqt2tsv

pqt2tsv(p: str)  str

Convert Apache parquet file to tab-separated.

Parameters:

  • p (str): path of the input.

Returns:

  • p (str): path of the output.

function read_excel

read_excel(
    p: str,
    sheet_name: str = None,
    kws_cloud: dict = {},
    test: bool = False,
    **kws
)

Read excel file

Parameters:

  • p (str): path of the file.
  • sheet_name (str|None): read 1st sheet if None (default:None)
  • kws_cloud (dict): parameters provided to read the file from the google drive (default:{})
  • test (bool): if False and sheet_name not provided, return all sheets as a dictionary, else if True, print list of sheets.

Keyword parameters:

  • kws: parameters provided to the excel reader.

function to_excel_commented

to_excel_commented(p: str, comments: dict, outp: str = None, author: str = None)

Add comments to the columns of excel file and save.

Args:

  • p (str): input path of excel file.
  • comments (dict): map between column names and comment e.g. description of the column.
  • outp (str): output path of excel file. Defaults to None.
  • author (str): author of the comments. Defaults to 'Author'.

TODOs: 1. Increase the limit on comments can be added to number of columns. Currently it is 26 i.e. upto Z1.


function to_excel

to_excel(
    sheetname2df: dict,
    outp: str,
    comments: dict = None,
    author: str = None,
    append: bool = False,
    **kws
)

Save excel file.

Parameters:

  • sheetname2df (dict): dictionary mapping the sheetname to the dataframe.
  • outp (str): output path.
  • append (bool): append the dataframes (default:False).
  • comments (dict): map between column names and comment e.g. description of the column.

Keyword parameters:

  • kws: parameters provided to the excel writer.

function check_chunks

check_chunks(outd, col, plot=True)

Create chunks of the tables.

Parameters:

  • outd (str): output directory.
  • col (str): the column with values that are used for getting the chunks.
  • plot (bool): plot the chunk sizes (default:True).

Returns:

  • df3 (DataFrame): output dataframe.

module roux.lib

Global Variables

  • df
  • set
  • str
  • dict
  • dfs
  • sys
  • text
  • io

function to_class

to_class(cls)

Get the decorator to attach functions.

Parameters:

  • cls (class): class object.

Returns:

  • decorator (decorator): decorator object.

References:

  • https: //gist.github.com/mgarod/09aa9c3d8a52a980bd4d738e52e5b97a

function decorator

decorator(func)

class rd

roux-dataframe (.rd) extension.

method __init__

__init__(pandas_obj)

module roux.lib.seq

For processing biological sequence data.

Global Variables

  • bed_colns

function reverse_complement

reverse_complement(s)

Reverse complement.

Args:

  • s (str): sequence

Returns:

  • s (str): reverse complemented sequence

function fa2df

fa2df(alignedfastap: str, ids2cols=False)  DataFrame

summary

Args:

  • alignedfastap (str): path.
  • ids2cols (bool, optional): ids of the sequences to columns. Defaults to False.

Returns:

  • DataFrame: output dataframe.

function to_genomeocoords

to_genomeocoords(genomecoord: str)  tuple

String-formated genome co-ordinates to separated values.

Args: genomecoord (str):

Raises:

  • ValueError: format of the genome co-ordinates.

Returns:

  • tuple: separated values i.e. chrom,start,end,strand

function to_bed

to_bed(df, col_genomeocoord)

Genome co-ordinates to bed.

Args:

  • df (DataFrame): input dataframe.
  • col_genomeocoord (str): column with the genome coordinates.

Returns:

  • DataFrame: output dataframe.

function read_fasta

read_fasta(fap: str, key_type: str = 'id', duplicates: bool = False)  dict

Read fasta

Args:

  • fap (str): path
  • key_type (str, optional): key type. Defaults to 'id'.
  • duplicates (bool, optional): duplicates present. Defaults to False.

Returns:

  • dict: data.

Notes:

  1. If duplicates key_type is set to description instead of id.

function to_fasta

to_fasta(
    sequences: dict,
    output_path: str,
    molecule_type: str,
    force: bool = True,
    **kws_SeqRecord
)  str

Save fasta file.

Args:

  • sequences (dict): dictionary mapping the sequence name to the sequence.
  • output_path (str): path of the fasta file.
  • force (bool): overwrite if file exists.

Returns:

  • output_path (str): path of the fasta file

module roux.lib.set

For processing list-like sets.


function union

union(l)

Union of lists.

Parameters:

  • l (list): list of lists.

Returns:

  • l (list): list.

function union

union(l)

Union of lists.

Parameters:

  • l (list): list of lists.

Returns:

  • l (list): list.

function intersection

intersection(l)

Intersections of lists.

Parameters:

  • l (list): list of lists.

Returns:

  • l (list): list.

function intersection

intersection(l)

Intersections of lists.

Parameters:

  • l (list): list of lists.

Returns:

  • l (list): list.

function nunion

nunion(l)

Count the items in union.

Parameters:

  • l (list): list of lists.

Returns:

  • i (int): count.

function nintersection

nintersection(l)

Count the items in intersetion.

Parameters:

  • l (list): list of lists.

Returns:

  • i (int): count.

function dropna

dropna(x)

Drop np.nan items from a list.

Parameters:

  • x (list): list.

Returns:

  • x (list): list.

function unique

unique(l)

Unique items in a list.

Parameters:

  • l (list): input list.

Returns:

  • l (list): list.

function list2str

list2str(x, ignore=False)

Returns string if single item in a list.

Parameters:

  • x (list): list

Returns:

  • s (str): string.

function unique_str

unique_str(l, **kws)

Unique single item from a list.

Parameters:

  • l (list): input list.

Returns:

  • l (list): list.

function nunique

nunique(l, **kws)

Count unique items in a list

Parameters:

  • l (list): list

Returns:

  • i (int): count.

function flatten

flatten(l)

List of lists to list.

Parameters:

  • l (list): input list.

Returns:

  • l (list): output list.

function get_alt

get_alt(l1, s)

Get alternate item between two.

Parameters:

  • l1 (list): list.
  • s (str): item.

Returns:

  • s (str): alternate item.

function jaccard_index

jaccard_index(l1, l2)

function intersections

intersections(dn2list, jaccard=False, count=True, fast=False, test=False)

Get intersections between lists.

Parameters:

  • dn2list (dist): dictionary mapping to lists.
  • jaccard (bool): return jaccard indices.
  • count (bool): return counts.
  • fast (bool): fast.
  • test (bool): verbose.

Returns:

  • df (DataFrame): output dataframe.

TODOs: 1. feed as an estimator to df.corr(). 2. faster processing by filling up the symetric half of the adjacency matrix.


function range_overlap

range_overlap(l1, l2)

Overlap between ranges.

Parameters:

  • l1 (list): start and end integers of one range.
  • l2 (list): start and end integers of other range.

Returns:

  • l (list): overlapped range.

function get_windows

get_windows(
    a,
    size=None,
    overlap=None,
    windows=None,
    overlap_fraction=None,
    stretch_last=False,
    out_ranges=True
)

Windows/segments from a range.

Parameters:

  • a (list): range.
  • size (int): size of the windows.
  • windows (int): number of windows.
  • overlap_fraction (float): overlap fraction.
  • overlap (int): overlap length.
  • stretch_last (bool): stretch last window.
  • out_ranges (bool): whether to output ranges.

Returns:

  • df1 (DataFrame): output dataframe.

Notes:

  1. For development, use of int provides np.floor.

function bools2intervals

bools2intervals(v)

Convert bools to intervals.

Parameters:

  • v (list): list of bools.

Returns:

  • l (list): intervals.

function list2ranges

list2ranges(l)

function get_pairs

get_pairs(
    items: list,
    items_with: list = None,
    size: int = 2,
    with_self: bool = False
)  DataFrame

Creates a dataframe with the paired items.

Parameters:

  • items: the list of items to pair.
  • items_with: list of items to pair with.
  • size: size of the combinations.
  • with_self: pair with self or not.

Returns: table with pairs of items.

Notes:

  1. the ids of the items are sorted e.g. 'a'-'b' not 'b'-'a'.

module roux.lib.str

For processing strings.


function substitution

substitution(s, i, replaceby)

Substitute character in a string.

Parameters:

  • s (string): string.
  • i (int): location.
  • replaceby (string): character to substitute with.

Returns:

  • s (string): output string.

function substitution

substitution(s, i, replaceby)

Substitute character in a string.

Parameters:

  • s (string): string.
  • i (int): location.
  • replaceby (string): character to substitute with.

Returns:

  • s (string): output string.

function replace_many

replace_many(
    s: str,
    replaces: dict,
    replacewith: str = '',
    ignore: bool = False
)

Rename by replacing sub-strings.

Parameters:

  • s (str): input string.
  • replaces (dict|list): from->to format or list containing substrings to remove.
  • replacewith (str): replace to in case replaces is a list.
  • ignore (bool): if True, not validate the successful replacements.

Returns:

  • s (DataFrame): output dataframe.

function replace_many

replace_many(
    s: str,
    replaces: dict,
    replacewith: str = '',
    ignore: bool = False
)

Rename by replacing sub-strings.

Parameters:

  • s (str): input string.
  • replaces (dict|list): from->to format or list containing substrings to remove.
  • replacewith (str): replace to in case replaces is a list.
  • ignore (bool): if True, not validate the successful replacements.

Returns:

  • s (DataFrame): output dataframe.

function tuple2str

tuple2str(tup, sep=' ')

Join tuple items.

Parameters:

  • tup (tuple|list): input tuple/list.
  • sep (str): separator between the items.

Returns:

  • s (str): output string.

function linebreaker

linebreaker(text, width=None, break_pt=None, sep='\n', **kws)

Insert newlines within a string.

Parameters:

  • text (str): string.
  • width (int): insert newline at this interval.
  • sep (string): separator to split the sub-strings.

Returns:

  • s (string): output string.

References:


function findall

findall(s, ss, outends=False, outstrs=False, suffixlen=0)

Find the substrings or their locations in a string.

Parameters:

  • s (string): input string.
  • ss (string): substring.
  • outends (bool): output end positions.
  • outstrs (bool): output strings.
  • suffixlen (int): length of the suffix.

Returns:

  • l (list): output list.

function get_marked_substrings

get_marked_substrings(
    s,
    leftmarker='{',
    rightmarker='}',
    leftoff=0,
    rightoff=0
)  list

Get the substrings flanked with markers from a string.

Parameters:

  • s (str): input string.
  • leftmarker (str): marker on the left.
  • rightmarker (str): marker on the right.
  • leftoff (int): offset on the left.
  • rightoff (int): offset on the right.

Returns:

  • l (list): list of substrings.

function get_marked_substrings

get_marked_substrings(
    s,
    leftmarker='{',
    rightmarker='}',
    leftoff=0,
    rightoff=0
)  list

Get the substrings flanked with markers from a string.

Parameters:

  • s (str): input string.
  • leftmarker (str): marker on the left.
  • rightmarker (str): marker on the right.
  • leftoff (int): offset on the left.
  • rightoff (int): offset on the right.

Returns:

  • l (list): list of substrings.

function mark_substrings

mark_substrings(s, ss, leftmarker='(', rightmarker=')')  str

Mark sub-string/s in a string.

Parameters:

  • s (str): input string.
  • ss (str): substring.
  • leftmarker (str): marker on the left.
  • rightmarker (str): marker on the right.

Returns:

  • s (str): string.

function get_bracket

get_bracket(s, leftmarker='(', righttmarker=')')  str

Get bracketed substrings.

Parameters:

  • s (string): string.
  • leftmarker (str): marker on the left.
  • rightmarker (str): marker on the right.

Returns:

  • s (str): string.

TODOs: 1. Use get_marked_substrings.


function align

align(
    s1: str,
    s2: str,
    prefix: bool = False,
    suffix: bool = False,
    common: bool = True
)  list

Align strings.

Parameters:

  • s1<.py#1.
  • s2<.py#2.
  • prefix (str): prefix.
  • suffix (str): suffix.
  • common (str): common substring.

Returns:

  • l (list): output list.

Notes:

  1. Code to test: [ get_prefix(source,target,common=False), get_prefix(source,target,common=True), get_suffix(source,target,common=False), get_suffix(source,target,common=True),]

function get_prefix

get_prefix(s1: str, s2: str, common: bool = True, clean: bool = True)  str

Get the prefix of the strings

Parameters:

  • s1 (str): 1st string.
  • s2 (str): 2nd string.
  • common (bool): get the common prefix (default:True).
  • clean (bool): clean the leading and trailing whitespaces (default:True).

Returns:

  • s (str): prefix.

function get_suffix

get_suffix(s1: str, s2: str, common: bool = True, clean: bool = True)  str

Get the suffix of the strings

Parameters:

  • s1 (str): 1st string.
  • s2 (str): 2nd string.
  • common (bool): get the common prefix (default:True).
  • clean (bool): clean the leading and trailing whitespaces (default:True).

Returns:

  • s (str): suffix.

function get_fix

get_fix(s1: str, s2: str, **kws: dict)  str

Infer common prefix or suffix.

Parameters:

  • s1 (str): 1st string.
  • s2 (str): 2nd string.

Keyword parameters:

  • kws: parameters provided to the get_prefix and get_suffix functions.

Returns:

  • s (str): prefix or suffix.

function removesuffix

removesuffix(s1: str, suffix: str)  str

Remove suffix.

Paramters: s1 (str): input string. suffix (str): suffix.

Returns:

  • s1 (str): string without the suffix.

TODOs: 1. Deprecate in py>39 use .removesuffix() instead.


function str2dict

str2dict(
    s: str,
    reversible: bool = True,
    sep: str = ';',
    sep_equal: str = '='
)  dict

String to dictionary.

Parameters:

  • s (str): string.
  • sep (str): separator between entries (default:';').
  • sep_equal (str): separator between the keys and the values (default:'=').

Returns:

  • d (dict): dictionary.

References:

  • 1. https: //stackoverflow.com/a/186873/3521099

function dict2str

dict2str(
    d1: dict,
    reversible: bool = True,
    sep: str = ';',
    sep_equal: str = '='
)  str

Dictionary to string.

Parameters:

  • d (dict): dictionary.
  • sep (str): separator between entries (default:';').
  • sep_equal (str): separator between the keys and the values (default:'=').
  • reversible (str): use json

Returns:

  • s (str): string.

function str2num

str2num(s: str)  float

String to number.

Parameters:

  • s (str): string.

Returns:

  • i (int): number.

function num2str

num2str(
    num: float,
    magnitude: bool = False,
    coff: float = 10000,
    decimals: int = 0
)  str

Number to string.

Parameters:

  • num (int): number.
  • magnitude (bool): use magnitudes (default:False).
  • coff (int): cutoff (default:10000).
  • decimals (int): decimal points (default:0).

Returns:

  • s (str): string.

TODOs 1. ~ if magnitude else not


function encode

encode(data, short: bool = False, method_short: str = 'sha256', **kws)  str

Encode the data as a string.

Parameters:

  • data (str|dict|Series): input data.
  • short (bool): Outputs short string, compatible with paths but non-reversible. Defaults to False.
  • method_short (str): method used for encoding when short=True.

Keyword parameters:

  • kws: parameters provided to encoding function.

Returns:

  • s (string): output string.

function decode

decode(s, out=None, **kws)

Decode data from a string.

Parameters:

  • s (string): encoded string.
  • out (str): output format (dict|df).

Keyword parameters:

  • kws: parameters provided to dict2df.

Returns:

  • d (dict|DataFrame): output data.

module roux.lib.sys

For processing file paths for example.


function basenamenoext

basenamenoext(p)

Basename without the extension.

Args:

  • p (str): path.

Returns:

  • s (str): output.

function remove_exts

remove_exts(p: str, exts: tuple = None)

Filename without the extension.

Args:

  • p (str): path.
  • exts (tuple): extensions.

Returns:

  • s (str): output.

function read_ps

read_ps(ps, test=True)  list

Read a list of paths.

Parameters:

  • ps (list|str): list of paths or a string with wildcard/s.
  • test (bool): testing.

Returns:

  • ps (list): list of paths.

function to_path

to_path(s, replacewith='_', verbose=False, coff_len_escape_replacement=100)

Normalise a string to be used as a path of file.

Parameters:

  • s (string): input string.
  • replacewith (str): replace the whitespaces or incompatible characters with.

Returns:

  • s (string): output string.

function to_path

to_path(s, replacewith='_', verbose=False, coff_len_escape_replacement=100)

Normalise a string to be used as a path of file.

Parameters:

  • s (string): input string.
  • replacewith (str): replace the whitespaces or incompatible characters with.

Returns:

  • s (string): output string.

function makedirs

makedirs(p: str, exist_ok=True, **kws)

Make directories recursively.

Args:

  • p (str): path.
  • exist_ok (bool, optional): no error if the directory exists. Defaults to True.

Returns:

  • p_ (str): the path of the directory.

function to_output_path

to_output_path(ps, outd=None, outp=None, suffix='')

Infer a single output path for a list of paths.

Parameters:

  • ps (list): list of paths.
  • outd (str): path of the output directory.
  • outp (str): path of the output file.
  • suffix (str): suffix of the filename.

Returns:

  • outp (str): path of the output file.

function to_output_paths

to_output_paths(
    input_paths: list = None,
    inputs: list = None,
    output_path: str = None,
    encode_short: bool = True,
    replaces_output_path=None,
    key_output_path: str = None,
    force: bool = False,
    verbose: bool = False
)  dict

Infer a output path for each of the paths or inputs.

Parameters:

  • input_paths (list) : list of input paths. Defaults to None.
  • inputs (list) : list of inputs e.g. dictionaries. Defaults to None.
  • output_path (str) : output path with a placeholder '{KEY}' to be replaced. Defaults to None.
  • encode_short: (bool) : short encoded string, else long encoded string (reversible) is used. Defaults to True.
  • replaces_output_path : list, dictionary or function to replace the input paths. Defaults to None.
  • key_output_path (str) : key to be used to incorporate output_path variable among the inputs. Defaults to None.
  • force (bool): overwrite the outputs. Defaults to False.
  • verbose (bool) : show verbose. Defaults to False.

Returns: dictionary with the output path mapped to input paths or inputs.


function get_encoding

get_encoding(p)

Get encoding of a file.

Parameters:

  • p (str): file path

Returns:

  • s (string): encoding.

function get_all_subpaths

get_all_subpaths(d='.', include_directories=False)

Get all the subpaths.

Args:

  • d (str, optional): description. Defaults to '.'.
  • include_directories (bool, optional): to include the directories. Defaults to False.

Returns:

  • paths (list): sub-paths.

function get_env

get_env(env_name: str, return_path: bool = False)

Get the virtual environment as a dictionary.

Args:

  • env_name (str): name of the environment.

Returns:

  • d (dict): parameters of the virtual environment.

function runbash

runbash(s1, env=None, test=False, **kws)

Run a bash command.

Args:

  • s1 (str): command.
  • env (str): environment name.
  • test (bool, optional): testing. Defaults to False.

Returns:

  • output: output of the subprocess.call function.

TODOs: 1. logp 2. error ignoring


function runbash_tmp

runbash_tmp(
    s1: str,
    env: str,
    df1=None,
    inp='INPUT',
    input_type='df',
    output_type='path',
    tmp_infn='in.txt',
    tmp_outfn='out.txt',
    outp=None,
    force=False,
    test=False,
    **kws
)

Run a bash command in /tmp directory.

Args:

  • s1 (str): command.
  • env (str): environment name.
  • df1 (DataFrame, optional): input dataframe. Defaults to None.
  • inp (str, optional): input path. Defaults to 'INPUT'.
  • input_type (str, optional): input type. Defaults to 'df'.
  • output_type (str, optional): output type. Defaults to 'path'.
  • tmp_infn (str, optional): temporary input file. Defaults to 'in.txt'.
  • tmp_outfn (str, optional): temporary output file.. Defaults to 'out.txt'.
  • outp (type, optional): output path. Defaults to None.
  • force (bool, optional): force. Defaults to False.
  • test (bool, optional): test. Defaults to False.

Returns:

  • output: output of the subprocess.call function.

function create_symlink

create_symlink(p: str, outp: str, test=False)

Create symbolic links.

Args:

  • p (str): input path.
  • outp (str): output path.
  • test (bool, optional): test. Defaults to False.

Returns:

  • outp (str): output path.

function input_binary

input_binary(q: str)

Get input in binary format.

Args:

  • q (str): question.

Returns:

  • b (bool): response.

function is_interactive

is_interactive()

Check if the UI is interactive e.g. jupyter or command line.


function is_interactive_notebook

is_interactive_notebook()

Check if the UI is interactive e.g. jupyter or command line.

Notes:

Reference:


function get_excecution_location

get_excecution_location(depth=1)

Get the location of the function being executed.

Args:

  • depth (int, optional): Depth of the location. Defaults to 1.

Returns:

  • tuple (tuple): filename and line number.

function get_datetime

get_datetime(outstr=True)

Get the date and time.

Args:

  • outstr (bool, optional): string output. Defaults to True.

Returns:

  • s : date and time.

function p2time

p2time(filename: str, time_type='m')

Get the creation/modification dates of files.

Args:

  • filename (str): filename.
  • time_type (str, optional): description. Defaults to 'm'.

Returns:

  • time (str): time.

function ps2time

ps2time(ps: list, **kws_p2time)

Get the times for a list of files.

Args:

  • ps (list): list of paths.

Returns:

  • ds (Series): paths mapped to corresponding times.

function get_logger

get_logger(program='program', argv=None, level=None, dp=None)

Get the logging object.

Args:

  • program (str, optional): name of the program. Defaults to 'program'.
  • argv (type, optional): arguments. Defaults to None.
  • level (type, optional): level of logging. Defaults to None.
  • dp (type, optional): description. Defaults to None.

module roux.lib.text

For processing text files.


function get_header

get_header(path: str, comment='#', lineno=None)

Get the header of a file.

Args:

  • path (str): path.
  • comment (str): comment identifier.
  • lineno (int): line numbers upto.

Returns:

  • lines (list): header.

function cat

cat(ps, outp)

Concatenate text files.

Args:

  • ps (list): list of paths.
  • outp (str): output path.

Returns:

  • outp (str): output path.

module roux.query.biomart

For querying BioMart database.

Global Variables

  • release2prefix

function get_ensembl_dataset_name

get_ensembl_dataset_name(x: str)  str

Get the name of the Ensembl dataset.

Args:

  • x (str): species name.

Returns:

  • str: output.

function query

query(
    species: str,
    release: int,
    attributes: list = None,
    filters: list = None,
    databasep: str = 'external/biomart/',
    dataset_name: str = None,
    force: bool = False,
    **kws_query
)  DataFrame

Query the biomart database.

Args:

  • species (str): species name.
  • release (int): Ensembl release.
  • attributes (list, optional): list of attributes. Defaults to None.
  • filters (list, optional): list of filters. Defaults to None.
  • databasep (str, optional): path to the local database folder. Defaults to 'data/database'.
  • dataset_name (str, optional): dataset name. Defaults to None.
  • force (bool, optional): overwrite output. Defaults to False.

Returns:

  • pd.DataFrame: output

Examples:

  1. Setting filters for the human data: filters={ # REMOVE: mitochondria.py# REMOVE: non protein coding 'biotype':['protein_coding'], }

module roux.query.ensembl

For querying Ensembl databases.

Global Variables

  • release2prefix

function to_gene_name

to_gene_name(k: str, ensembl: object)  str

Gene id to gene name.

Args:

  • k (str): gene id.
  • ensembl (object): ensembl object.

Returns:

  • str: gene name.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function to_protein_id

to_protein_id(k: str, ensembl: object)  str

Transcript id to protein id.

Args:

  • x (str): transcript id.
  • ensembl (str): ensembl object.

Returns:

  • str: protein id.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function to_gene_id

to_gene_id(k: str, ensembl: object)  str

Transcript id to gene id.

Args:

  • k (str): transcript id.
  • ensembl (object): ensembl object.

Returns:

  • str: gene id.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function to_transcript_id

to_transcript_id(k: str, ensembl: object)  str

Protein id to transcript id.

Args:

  • k (str): protein id.
  • ensembl (object): ensembl object.

Returns:

  • str: transcript id.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function to_dnaseq

to_dnaseq(k: str, ensembl: object)  str

Gene id to DNA sequence.

Args:

  • k (str): gene id.
  • ensembl (object): ensembl object.

Returns:

  • str: DNA sequence.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function to_protein_id_longest

to_protein_id_longest(k: str, ensembl: object)  str

Gene id to protein id of the longest protein.

Args:

  • k (str): gene id.
  • ensembl (object): ensembl object.

Returns:

  • str: protein id.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function to_protein_seq

to_protein_seq(k: str, ensembl: object, transcript: bool = False)  str

Protein/transcript id to protein sequence.

Args:

  • k (str): protein id.
  • ensembl (object): ensembl object.

Returns:

  • str: protein sequence.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function to_cdsseq

to_cdsseq(k: str, ensembl: object)  str

Transcript id to coding sequence (CDS).

Args:

  • k (str): transcript id.
  • ensembl (object): ensembl object.

Returns:

  • str: CDS sequence.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function get_utr_sequence

get_utr_sequence(k: str, ensembl: object, loc: str = 'five')  str

Protein id to UTR sequence.

Args:

  • k (str): transcript id.
  • ensembl (object): ensembl object.
  • loc (str): location of the UTR.

Returns:

  • str: UTR sequence.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function is_protein_coding

is_protein_coding(k: str, ensembl: object, geneid: bool = True)  bool

A gene or protein is protein coding or not.

Args:

  • k (str): protein/gene id.
  • ensembl (object): ensembl object.
  • geneid (bool): if gene id is provided.

Returns:

  • bool: is protein coding.

Notes:

  1. ensembl object. from pyensembl import EnsemblRelease ensembl EnsemblRelease(release=100)

function rest

rest(
    ids: list,
    function: str = 'lookup',
    target_taxon: str = '9606',
    release: str = '100',
    format_: str = 'full',
    test: bool = False,
    **kws
)

Query Ensembl database using REST API.

Args:

  • ids (list): ids.
  • function (str, optional): query function. Defaults to 'lookup'.
  • target_taxon (str, optional): taxonomic id of the species. Defaults to '9606'.
  • release (str, optional): ensembl release. Defaults to '100'.
  • format_ (str, optional): format of the output. Defaults to 'full'.
  • test (bool, optional): test mode. Defaults to False.

Keyword Args:

  • kws: additional queries.

Raises:

  • ValueError: ids should be str or list.

Returns:

  • dict: output.

function to_homology

to_homology(
    x: str,
    release: int = 100,
    homologytype: str = 'orthologues',
    outd: str = 'data/database',
    force: bool = False
)  dict

Query homology of a gene using Ensembl REST API.

Args:

  • x (str): gene id.
  • release (int, optional): Ensembl release number. Defaults to 100.
  • homologytype (str, optional): type of the homology. Defaults to 'orthologues'.
  • outd (str, optional): path of the output folder. Defaults to 'data/database'.
  • force (bool, optional): overwrite output. Defaults to False.

Returns:

  • dict: output.

References:


function to_domains

to_domains(
    x: str,
    release: int,
    species: str = 'homo_sapiens',
    outd: str = 'data/database',
    force: bool = False
)  DataFrame

Protein id to domains.

Args:

  • x (str): protein id.
  • release (int): Ensembl release.
  • species (str, optional): species name. Defaults to 'homo_sapiens'.
  • outd (str, optional): path of the output directory. Defaults to 'data/database'.
  • force (bool, optional): overwrite output. Defaults to False.

Returns:

  • pd.DataFrame: output.

function to_species_name

to_species_name(k: str)  str

Convert to species name.

Args:

  • k (type): taxonomic id.

Returns:

  • str: species name.

function to_taxid

to_taxid(k: str)  str

Convert to taxonomic ids.

Args:

  • k (str): species name.

Returns:

  • str: taxonomic id.

function convert_coords_human_assemblies

convert_coords_human_assemblies(
    release: int,
    chrom: str,
    start: int,
    end: int,
    frm: int = 38,
    to: int = 37,
    test: bool = False,
    force: bool = False
)  dict

Convert coordinates between human assemblies.

Args:

  • release (int): Ensembl release.
  • chrom (str): chromosome name.
  • start (int): start position.
  • end (int): end position.
  • frm (int, optional): assembly to convert from. Defaults to 38.
  • to (int, optional): assembly to convert to. Defaults to 37.
  • test (bool, optional): test mode. Defaults to False.
  • force (bool, optional): overwrite outputs. Defaults to False.

Returns:

  • dict: output.

function map_id

map_id(
    df1: DataFrame,
    gene_id: str,
    release: str,
    release_to: str,
    out: str = 'df',
    test: bool = False
)  DataFrame

Map ids between releases.

Args:

  • df1 (pd.DataFrame): input dataframe.
  • gene_id (str): gene id.
  • release (str): release to convert from.
  • release_to (str): release to convert to.
  • out (str, optional): output type. Defaults to 'df'.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • pd.DataFrame: output.

Notes:


function read_idmapper_output

read_idmapper_output(outp: str)  DataFrame

Read the output of Ensembl's idmapper.

Args:

  • outp (str): path to the file.

Returns:

  • pd.DataFrame: output.

function map_ids_

map_ids_(ids: list, df00: DataFrame, release: int, release_to: int)  DataFrame

Function for mapping many ids.

Args:

  • ids (list): list of ids.
  • df00 (pd.DataFrame): source dataframe.
  • release (str): release to convert from.
  • release_to (str): release to convert to.

Returns:

  • pd.DataFrame: output.

function map_ids

map_ids(
    srcp: str,
    dbp: str,
    ids: list,
    release: int = 75,
    release_to: int = 100,
    species: str = 'human',
    test: bool = False
)  DataFrame

Map many ids between Ensembl releases.

Args:

  • srcp (str): path to the IDmapper.pl file.
  • dbp (str): path to the database.
  • ids (list): list of ids.
  • release (str): release to convert from.
  • release_to (str): release to convert to.
  • species (str, optional): species name. Defaults to 'human'.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • pd.DataFrame: output.

Examples: srcp='deps/ensembl-tools/scripts/id_history_converter/IDmapper.pl', dbp='data/database/ensembl_id_history_converter/db.pqt', ids=ensembl.gene_ids(),

module roux.query

module roux.run

For access to a few functions from the terminal.

module roux.stat.binary

For processing binary data.


function compare_bools_jaccard

compare_bools_jaccard(x, y)

Compare bools in terms of the jaccard index.

Args:

  • x (list): list of bools.
  • y (list): list of bools.

Returns:

  • float: jaccard index.

function compare_bools_jaccard_df

compare_bools_jaccard_df(df: DataFrame)  DataFrame

Pairwise compare bools in terms of the jaccard index.

Args:

  • df (DataFrame): dataframe with boolean columns.

Returns:

  • DataFrame: matrix with comparisons between the columns.

function classify_bools

classify_bools(l: list)  str

Classify bools.

Args:

  • l (list): list of bools

Returns:

  • str: classification.

function frac

frac(x: list)  float

Fraction.

Args:

  • x (list): list of bools.

Returns:

  • float: fraction of True values.

function perc

perc(x: list)  float

Percentage.

Args:

  • x (list): list of bools.

Returns:

  • float: Percentage of the True values

function get_stats_confusion_matrix

get_stats_confusion_matrix(df_: DataFrame)  DataFrame

Get stats confusion matrix.

Args:

  • df_ (DataFrame): Confusion matrix.

Returns:

  • DataFrame: stats.

function get_cutoff

get_cutoff(
    y_true,
    y_score,
    method,
    show_diagonal=True,
    show_area=True,
    show_cutoff=True,
    color='k',
    returns=['ax'],
    ax=None
)

Obtain threshold based on ROC or PR curve.

Returns: Table:

  • columns: values
  • method: ROC, PR
  • variable: threshold (index), TPR, FPR, TP counts, precision, recall values: Plots: AUC ROC, TPR vs TP counts PR Specificity vs TP counts Dictionary: Thresholds from AUC, PR

module roux.stat.classify

For classification.


function drop_low_complexity

drop_low_complexity(
    df1: DataFrame,
    min_nunique: int,
    max_inflation: int,
    max_nunique: int = None,
    cols: list = None,
    cols_keep: list = [],
    test: bool = False,
    verbose: bool = False
)  DataFrame

Remove low-complexity columns from the data.

Args:

  • df1 (pd.DataFrame): input data.
  • min_nunique (int): minimum unique values.
  • max_inflation (int): maximum over-representation of the values.
  • cols (list, optional): columns. Defaults to None.
  • cols_keep (list, optional): columns to keep. Defaults to [].
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • pd.DataFrame: output data.

function get_Xy_for_classification

get_Xy_for_classification(
    df1: DataFrame,
    coly: str,
    qcut: float = None,
    drop_xs_low_complexity: bool = False,
    min_nunique: int = 5,
    max_inflation: float = 0.5,
    **kws
)  dict

Get X matrix and y vector.

Args:

  • df1 (pd.DataFrame): input data, should be indexed.
  • coly (str): column with y values, bool if qcut is None else float/int
  • qcut (float, optional): quantile cut-off. Defaults to None.
  • drop_xs_low_complexity (bool, optional): to drop columns with <5 unique values. Defaults to False.
  • min_nunique (int, optional): minimum unique values in the column. Defaults to 5.
  • max_inflation (float, optional): maximum inflation. Defaults to 0.5.

Keyword arguments:

  • kws: parameters provided to drop_low_complexity.

Returns:

  • dict: output.

function get_cvsplits

get_cvsplits(
    X: <built-in function array>,
    y: <built-in function array>,
    cv: int = 5,
    random_state: int = None,
    outtest: bool = True
)  dict

Get cross-validation splits.

Args:

  • X (np.array): X matrix.
  • y (np.array): y vector.
  • cv (int, optional): cross validations. Defaults to 5.
  • random_state (int, optional): random state. Defaults to None.
  • outtest (bool, optional): output testing. Defaults to True.

Returns:

  • dict: output.

function get_grid_search

get_grid_search(
    modeln: str,
    X: <built-in function array>,
    y: <built-in function array>,
    param_grid: dict = {},
    cv: int = 5,
    n_jobs: int = 6,
    random_state: int = None,
    scoring: str = 'balanced_accuracy',
    **kws
)  object

Grid search.

Args:

  • modeln (str): name of the model.
  • X (np.array): X matrix.
  • y (np.array): y vector.
  • param_grid (dict, optional): parameter grid. Defaults to {}.
  • cv (int, optional): cross-validations. Defaults to 5.
  • n_jobs (int, optional): number of cores. Defaults to 6.
  • random_state (int, optional): random state. Defaults to None.
  • scoring (str, optional): scoring system. Defaults to 'balanced_accuracy'.

Keyword arguments:

  • kws: parameters provided to the GridSearchCV function.

Returns:

  • object: grid_search.

References:

  • 1. https: //scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
  • 2. https: //scikit-learn.org/stable/modules/model_evaluation.html

function get_estimatorn2grid_search

get_estimatorn2grid_search(
    estimatorn2param_grid: dict,
    X: DataFrame,
    y: Series,
    **kws
)  dict

Estimator-wise grid search.

Args:

  • estimatorn2param_grid (dict): estimator name to the grid search map.
  • X (pd.DataFrame): X matrix.
  • y (pd.Series): y vector.

Returns:

  • dict: output.

function get_test_scores

get_test_scores(d1: dict)  DataFrame

Test scores.

Args:

  • d1 (dict): dictionary with objects.

Returns:

  • pd.DataFrame: output.

TODOs: Get best param index.


function plot_metrics

plot_metrics(outd: str, plot: bool = False)  DataFrame

Plot performance metrics.

Args:

  • outd (str): output directory.
  • plot (bool, optional): make plots. Defaults to False.

Returns:

  • pd.DataFrame: output data.

function get_probability

get_probability(
    estimatorn2grid_search: dict,
    X: <built-in function array>,
    y: <built-in function array>,
    colindex: str,
    coff: float = 0.5,
    test: bool = False
)

Classification probability.

Args:

  • estimatorn2grid_search (dict): estimator to the grid search map.
  • X (np.array): X matrix.
  • y (np.array): y vector.
  • colindex (str): index column.
  • coff (float, optional): cut-off. Defaults to 0.5.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • pd.DataFrame: output.

function run_grid_search

run_grid_search(
    df: DataFrame,
    colindex: str,
    coly: str,
    n_estimators: int,
    qcut: float = None,
    evaluations: list = ['prediction', 'feature importances', 'partial dependence'],
    estimatorn2param_grid: dict = None,
    drop_xs_low_complexity: bool = False,
    min_nunique: int = 5,
    max_inflation: float = 0.5,
    cols_keep: list = [],
    outp: str = None,
    test: bool = False,
    **kws
)  dict

Run grid search.

Args:

  • df (pd.DataFrame): input data.
  • colindex (str): column with the index.
  • coly (str): column with y values. Data type bool if qcut is None else float/int.
  • n_estimators (int): number of estimators.
  • qcut (float, optional): quantile cut-off. Defaults to None.
  • evaluations (list, optional): evaluations types. Defaults to ['prediction','feature importances', 'partial dependence', ].
  • estimatorn2param_grid (dict, optional): estimator to the parameter grid map. Defaults to None.
  • drop_xs_low_complexity (bool, optional): drop the low complexity columns. Defaults to False.
  • min_nunique (int, optional): minimum unique values allowed. Defaults to 5.
  • max_inflation (float, optional): maximum inflation allowed. Defaults to 0.5.
  • cols_keep (list, optional): columns to keep. Defaults to [].
  • outp (str, optional): output path. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.

Keyword arguments:

  • kws: parameters provided to get_estimatorn2grid_search.

Returns:

  • dict: estimator to grid search map.

function plot_feature_predictive_power

plot_feature_predictive_power(
    df3: DataFrame,
    ax: Axes = None,
    figsize: list = [3, 3],
    **kws
)  Axes

Plot feature-wise predictive power.

Args:

  • df3 (pd.DataFrame): input data.
  • ax (plt.Axes, optional): axes object. Defaults to None.
  • figsize (list, optional): figure size. Defaults to [3,3].

Returns:

  • plt.Axes: output.

function get_feature_predictive_power

get_feature_predictive_power(
    d0: dict,
    df01: DataFrame,
    n_splits: int = 5,
    n_repeats: int = 10,
    random_state: int = None,
    plot: bool = False,
    drop_na: bool = False,
    **kws
)  DataFrame

get_feature_predictive_power summary

Notes:

x-values should be scale and sign agnostic.

Args:

  • d0 (dict): input dictionary.
  • df01 (pd.DataFrame): input data,
  • n_splits (int, optional): number of splits. Defaults to 5.
  • n_repeats (int, optional): number of repeats. Defaults to 10.
  • random_state (int, optional): random state. Defaults to None.
  • plot (bool, optional): plot. Defaults to False.
  • drop_na (bool, optional): drop missing values. Defaults to False.

Returns:

  • pd.DataFrame: output data.

function get_feature_importances

get_feature_importances(
    estimatorn2grid_search: dict,
    X: DataFrame,
    y: Series,
    scoring: str = 'roc_auc',
    n_repeats: int = 20,
    n_jobs: int = 6,
    random_state: int = None,
    plot: bool = False,
    test: bool = False,
    **kws
)  DataFrame

Feature importances.

Args:

  • estimatorn2grid_search (dict): map between estimator name and grid search object.
  • X (pd.DataFrame): X matrix.
  • y (pd.Series): y vector.
  • scoring (str, optional): scoring type. Defaults to 'roc_auc'.
  • n_repeats (int, optional): number of repeats. Defaults to 20.
  • n_jobs (int, optional): number of cores. Defaults to 6.
  • random_state (int, optional): random state. Defaults to None.
  • plot (bool, optional): plot. Defaults to False.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • pd.DataFrame: output data.

function get_partial_dependence

get_partial_dependence(
    estimatorn2grid_search: dict,
    X: DataFrame,
    y: Series
)  DataFrame

Partial dependence.

Args:

  • estimatorn2grid_search (dict): map between estimator name and grid search object.
  • X (pd.DataFrame): X matrix.
  • y (pd.Series): y vector.

Returns:

  • pd.DataFrame: output data.

module roux.stat.cluster

For clustering data.


function check_clusters

check_clusters(df: DataFrame)

Check clusters.

Args:

  • df (DataFrame): dataframe.

function get_clusters

get_clusters(
    X: <built-in function array>,
    n_clusters: int,
    random_state=88,
    params={},
    test=False
)  dict

Get clusters.

Args:

  • X (np.array): vector
  • n_clusters (int): int
  • random_state (int, optional): random state. Defaults to 88.
  • params (dict, optional): parameters for the MiniBatchKMeans function. Defaults to {}.
  • test (bool, optional): test. Defaults to False.

Returns: dict:


function get_n_clusters_optimum

get_n_clusters_optimum(df5: DataFrame, test=False)  int

Get n clusters optimum.

Args:

  • df5 (DataFrame): input dataframe
  • test (bool, optional): test. Defaults to False.

Returns:

  • int: knee point.

function plot_silhouette

plot_silhouette(df: DataFrame, n_clusters_optimum=None, ax=None)

Plot silhouette

Args:

  • df (DataFrame): input dataframe.
  • n_clusters_optimum (int, optional): number of clusters. Defaults to None:int.
  • ax (axes, optional): axes object. Defaults to None:axes.

Returns:

  • ax (axes, optional): axes object. Defaults to None:axes.

function get_clusters_optimum

get_clusters_optimum(
    X: <built-in function array>,
    n_clusters=range(2, 11),
    params_clustering={},
    test=False
)  dict

Get optimum clusters.

Args:

  • X (np.array): samples to cluster in indexed format.
  • n_clusters (int, optional): description. Defaults to range(2,11).
  • params_clustering (dict, optional): parameters provided to get_clusters. Defaults to {}.
  • test (bool, optional): test. Defaults to False.

Returns:

  • dict: description

function get_gmm_params

get_gmm_params(g, x, n_clusters=2, test=False)

Intersection point of the two peak Gaussian mixture Models (GMMs).

Args:

  • out (str): coff only or params for all the parameters.

function get_gmm_intersection

get_gmm_intersection(x, two_pdfs, means, weights, test=False)

function cluster_1d

cluster_1d(
    ds: Series,
    n_clusters: int,
    clf_type='gmm',
    random_state=1,
    test=False,
    returns=['coff'],
    **kws_clf
)  dict

Cluster 1D data.

Args:

  • ds (Series): series.
  • n_clusters (int): number of clusters.
  • clf_type (str, optional): type of classification. Defaults to 'gmm'.
  • random_state (int, optional): random state. Defaults to 88.
  • test (bool, optional): test. Defaults to False.
  • returns (list, optional): return format. Defaults to ['df','coff','ax','model'].
  • ax (axes, optional): axes object. Defaults to None.

Raises:

  • ValueError: clf_type

Returns:

  • dict: description

function get_pos_umap

get_pos_umap(df1, spread=100, test=False, k='', **kws)  DataFrame

Get positions of the umap points.

Args:

  • df1 (DataFrame): input dataframe
  • spread (int, optional): spead extent. Defaults to 100.
  • test (bool, optional): test. Defaults to False.
  • k (str, optional): number of clusters. Defaults to ''.

Returns:

  • DataFrame: output dataframe.

module roux.stat.compare

For comparison related stats.


function get_cols_x_for_comparison

get_cols_x_for_comparison(
    df1: DataFrame,
    cols_y: list,
    cols_index: list,
    cols_drop: list = [],
    cols_dropby_patterns: list = [],
    coff_rs: float = 0.7,
    min_nunique: int = 5,
    max_inflation: int = 50,
    verbose: bool = False,
    test: bool = False
)  dict

Identify X columns.

Parameters:

  • df1 (pd.DataFrame): input table.
  • cols_y (list): y columns.

function to_filteredby_samples

to_filteredby_samples(
    df1: DataFrame,
    colindex: str,
    colsample: str,
    coff_samples_min: int,
    colsubset: str,
    coff_subsets_min: int = 2
)  DataFrame

Filter table before calculating differences. (1) Retain minimum number of samples per item representing a subset and (2) Retain minimum number of subsets per item.

Parameters: df1 (pd.DataFrame): input table. colindex (str): column containing items. colsample (str): column containing samples. coff_samples_min (int): minimum number of samples. colsubset (str): column containing subsets. coff_subsets_min (int): minimum number of subsets. Defaults to 2.

Returns: pd.DataFrame

Examples:

Parameters: colindex='genes id', colsample='sample id', coff_samples_min=3, colsubset= 'pLOF or WT' coff_subsets_min=2,


function to_preprocessed_data

to_preprocessed_data(
    df1: DataFrame,
    columns: dict,
    fill_missing_desc_value: bool = False,
    fill_missing_cont_value: bool = False,
    normby_zscore: bool = False,
    verbose: bool = False,
    test: bool = False
)  DataFrame

function get_comparison

get_comparison(
    df1: DataFrame,
    d1: dict = None,
    coff_p: float = 0.05,
    between_ys: bool = False,
    verbose: bool = False,
    **kws
)

Compare the x and y columns.

Parameters:

  • df1 (pd.DataFrame): input table.
  • d1 (dict): columns dict, output of get_cols_x_for_comparison.
  • between_ys (bool): compare y's

Notes:

Column information: d1={'cols_index': ['id'], 'cols_x': {'cont': [], 'desc': []}, 'cols_y': {'cont': [], 'desc': []}} Comparison types: 1. continuous vs continuous -> correlation 2. decrete vs continuous -> difference 3. decrete vs decrete -> FE or chi square


function compare_strings

compare_strings(l0: list, l1: list, cutoff: float = 0.5)  DataFrame

Compare two lists of strings.

Parameters:

  • l0 (list): list of strings.
  • l1 (list): list of strings to compare with.
  • cutoff (float): threshold to filter the comparisons.

Returns: table with the similarity scores.

TODOs: 1. Add option for semantic similarity.

module roux.stat.corr

For correlation stats.


function get_spearmanr

get_spearmanr(
    x: <built-in function array>,
    y: <built-in function array>
)  tuple

Get Spearman correlation coefficient.

Args:

  • x (np.array): x vector.
  • y (np.array): y vector.

Returns:

  • tuple: rs, p-value

function get_pearsonr

get_pearsonr(x: <built-in function array>, y: <built-in function array>)  tuple

Get Pearson correlation coefficient.

Args:

  • x (np.array): x vector.
  • y (np.array): y vector.

Returns:

  • tuple: rs, p-value

function get_corr_bootstrapped

get_corr_bootstrapped(
    x: <built-in function array>,
    y: <built-in function array>,
    method='spearman',
    ci_type='max',
    cv: int = 5,
    random_state=1,
    verbose=False
)  tuple

Get correlations after bootstraping.

Args:

  • x (np.array): x vector.
  • y (np.array): y vector.
  • method (str, optional): method name. Defaults to 'spearman'.
  • ci_type (str, optional): confidence interval type. Defaults to 'max'.
  • cv (int, optional): number of bootstraps. Defaults to 5.
  • random_state (int, optional): random state. Defaults to 1.

Returns:

  • tuple: mean correlation coefficient, confidence interval

function corr_to_str

corr_to_str(
    method: str,
    r: float,
    p: float,
    fmt='<',
    n=True,
    ci=None,
    ci_type=None,
    magnitide=True
)  str

Correlation to string

Args:

  • method (str): method name.
  • r (float): correlation coefficient.
  • p (float): p-value
  • fmt (str, optional): format of the p-value. Defaults to '<'.
  • n (bool, optional): sample size. Defaults to True.
  • ci (type, optional): confidence interval. Defaults to None.
  • ci_type (type, optional): confidence interval type. Defaults to None.
  • magnitide (bool, optional): show magnitude of the sample size. Defaults to True.

Returns:

  • str: string with the correation stats.

function get_corr

get_corr(
    x: <built-in function array>,
    y: <built-in function array>,
    method='spearman',
    bootstrapped=False,
    ci_type='max',
    magnitide=True,
    outstr=False,
    **kws
)

Correlation between vectors (wrapper).

Args:

  • x (np.array): x.
  • y (np.array): y.
  • method (str, optional): method name. Defaults to 'spearman'.
  • bootstrapped (bool, optional): bootstraping. Defaults to False.
  • ci_type (str, optional): confidence interval type. Defaults to 'max'.
  • magnitide (bool, optional): show magnitude. Defaults to True.
  • outstr (bool, optional): output as string. Defaults to False.

Keyword arguments:

  • kws: parameters provided to get_corr_bootstrapped function.

function get_corrs

get_corrs(
    df1: DataFrame,
    method: str,
    cols: list,
    cols_with: list = [],
    coff_inflation_min: float = None,
    test: bool = False,
    **kws
)

Correlate columns of a dataframes.

Args:

  • df1 (DataFrame): input dataframe.
  • method (str): method of correlation spearman or pearson.
  • cols (str): columns.
  • cols_with (str): columns to correlate with i.e. variable2.

Keyword arguments:

  • kws: parameters provided to get_corr function.

Returns:

  • DataFrame: output dataframe.

TODOs: 0. use lib.set.get_pairs to get the combinations. 1. Provide 2D array to scipy.stats.spearmanr? 2. Add parallel processing through fast parameter.


function get_partial_corrs

get_partial_corrs(
    df: DataFrame,
    xs: list,
    ys: list,
    method='spearman',
    splits=5
)  DataFrame

Get partial correlations.

Args:

  • df (DataFrame): input dataframe.
  • xs (list): columns used as x variables.
  • ys (list): columns used as y variables.
  • method (str, optional): method name. Defaults to 'spearman'.
  • splits (int, optional): number of splits. Defaults to 5.

Returns:

  • DataFrame: output dataframe.

function check_collinearity

check_collinearity(
    df1: DataFrame,
    threshold: float = 0.7,
    colvalue: str = '$r_s$',
    cols_variable: list = ['variable1', 'variable2'],
    coff_pval: float = 0.05,
    method: str = 'spearman',
    coff_inflation_min: int = 50
)  Series

Check collinearity.

Args:

  • df1 (DataFrame): input dataframe.
  • threshold (float): minimum threshold for the colinearity.

Returns:

  • DataFrame: output dataframe.

TODOs: 1. Calculate variance inflation factor (VIF).


function pairwise_chi2

pairwise_chi2(df1: DataFrame, cols_values: list)  DataFrame

Pairwise chi2 test.

Args:

  • df1 (DataFrame): pd.DataFrame
  • cols_values (list): list of columns.

Returns:

  • DataFrame: output dataframe.

TODOs: 0. use lib.set.get_pairs to get the combinations.

module roux.stat.diff

For difference related stats.


function get_demo_data

get_demo_data()  DataFrame

Demo data to test the differences.


function compare_classes

compare_classes(x, y, method=None)

function compare_classes_many

compare_classes_many(df1: DataFrame, cols_y: list, cols_x: list)  DataFrame

function get_pval

get_pval(
    df: DataFrame,
    colvalue='value',
    colsubset='subset',
    colvalue_bool=False,
    colindex=None,
    subsets=None,
    test=False,
    fun=None
)  tuple

Get p-value.

Args:

  • df (DataFrame): input dataframe.
  • colvalue (str, optional): column with values. Defaults to 'value'.
  • colsubset (str, optional): column with subsets. Defaults to 'subset'.
  • colvalue_bool (bool, optional): column with boolean values. Defaults to False.
  • colindex (str, optional): column with the index. Defaults to None.
  • subsets (list, optional): subset types. Defaults to None.
  • test (bool, optional): test. Defaults to False.
  • fun (function, optional): function. Defaults to None.

Raises:

  • ArgumentError: colvalue or colsubset not found in df.
  • ValueError: need only 2 subsets.

Returns:

  • tuple: stat,p-value

function get_stat

get_stat(
    df1: DataFrame,
    colsubset: str,
    colvalue: str,
    colindex: str,
    subsets=None,
    cols_subsets=['subset1', 'subset2'],
    df2=None,
    stats=[<function mean at 0x7fb280212d40>, <function median at 0x7fb2800ccd40>, <function var at 0x7fb280216200>, <built-in function len>],
    coff_samples_min=None,
    verb=False,
    **kws
)  DataFrame

Get statistics.

Args:

  • df1 (DataFrame): input dataframe.
  • colvalue (str, optional): column with values. Defaults to 'value'.
  • colsubset (str, optional): column with subsets. Defaults to 'subset'.
  • colindex (str, optional): column with the index. Defaults to None.
  • subsets (list, optional): subset types. Defaults to None.
  • cols_subsets (list, optional): columns with subsets. Defaults to ['subset1', 'subset2'].
  • df2 (DataFrame, optional): second dataframe. Defaults to None.
  • stats (list, optional): summary statistics. Defaults to [np.mean,np.median,np.var]+[len].
  • coff_samples_min (int, optional): minimum sample size required. Defaults to None.
  • verb (bool, optional): verbose. Defaults to False.

Keyword Arguments:

  • kws: parameters provided to get_pval function.

Raises:

  • ArgumentError: colvalue or colsubset not found in df.
  • ValueError: len(subsets)<2

Returns:

  • DataFrame: output dataframe.

TODOs: 1. Rename to more specific get_diff, also other get_stat*/get_pval* functions.


function get_stats

get_stats(
    df1: DataFrame,
    colsubset: str,
    cols_value: list,
    colindex: str,
    subsets=None,
    df2=None,
    cols_subsets=['subset1', 'subset2'],
    stats=[<function mean at 0x7fb280212d40>, <function median at 0x7fb2800ccd40>, <function var at 0x7fb280216200>, <built-in function len>],
    axis=0,
    test=False,
    **kws
)  DataFrame

Get statistics by iterating over columns wuth values.

Args:

  • df1 (DataFrame): input dataframe.
  • colsubset (str, optional): column with subsets.
  • cols_value (list): list of columns with values.
  • colindex (str, optional): column with the index.
  • subsets (list, optional): subset types. Defaults to None.
  • df2 (DataFrame, optional): second dataframe, e.g. pd.DataFrame({"subset1":['test'],"subset2":['reference']}). Defaults to None.
  • cols_subsets (list, optional): columns with subsets. Defaults to ['subset1', 'subset2'].
  • stats (list, optional): summary statistics. Defaults to [np.mean,np.median,np.var]+[len].
  • axis (int, optional): 1 if different tests else use 0. Defaults to 0.

Keyword Arguments:

  • kws: parameters provided to get_pval function.

Raises:

  • ArgumentError: colvalue or colsubset not found in df.
  • ValueError: len(subsets)<2

Returns:

  • DataFrame: output dataframe.

TODOs: 1. No column prefix if len(cols_value)==1.


function get_significant_changes

get_significant_changes(
    df1: DataFrame,
    coff_p=0.025,
    coff_q=0.1,
    alpha=None,
    changeby='mean',
    value_aggs=['mean', 'median']
)  DataFrame

Get significant changes.

Args:

  • df1 (DataFrame): input dataframe.
  • coff_p (float, optional): cutoff on p-value. Defaults to 0.025.
  • coff_q (float, optional): cutoff on q-value. Defaults to 0.1.
  • alpha (float, optional): alias for coff_p. Defaults to None.
  • changeby (str, optional): "" if check for change by both mean and median. Defaults to "".
  • value_aggs (list, optional): values to aggregate. Defaults to ['mean','median'].

Returns:

  • DataFrame: output dataframe.

function apply_get_significant_changes

apply_get_significant_changes(
    df1: DataFrame,
    cols_value: list,
    cols_groupby: list,
    cols_grouped: list,
    fast=False,
    **kws
)  DataFrame

Apply on dataframe to get significant changes.

Args:

  • df1 (DataFrame): input dataframe.
  • cols_value (list): columns with values.
  • cols_groupby (list): columns with groups.

Returns:

  • DataFrame: output dataframe.

function get_stats_groupby

get_stats_groupby(
    df1: DataFrame,
    cols_group: list,
    coff_p: float = 0.05,
    coff_q: float = 0.1,
    alpha=None,
    fast=False,
    **kws
)  DataFrame

Iterate over groups, to get the differences.

Args:

  • df1 (DataFrame): input dataframe.
  • cols_group (list): columns to interate over.
  • coff_p (float, optional): cutoff on p-value. Defaults to 0.025.
  • coff_q (float, optional): cutoff on q-value. Defaults to 0.1.
  • alpha (float, optional): alias for coff_p. Defaults to None.
  • fast (bool, optional): parallel processing. Defaults to False.

Returns:

  • DataFrame: output dataframe.

function get_diff

get_diff(
    df1: DataFrame,
    cols_x: list,
    cols_y: list,
    cols_index: list,
    cols_group: list,
    coff_p: float = None,
    test: bool = False,
    **kws
)  DataFrame

Wrapper around the get_stats_groupby

Keyword parameters: cols=['variable x','variable y'], coff_p=0.05, coff_q=0.01, colindex=['id'],


function binby_pvalue_coffs

binby_pvalue_coffs(
    df1: DataFrame,
    coffs=[0.01, 0.05, 0.1],
    color=False,
    testn='MWU test, FDR corrected',
    colindex='genes id',
    colgroup='tissue',
    preffix='',
    colns=None,
    palette=None
)  tuple

Bin data by pvalue cutoffs.

Args:

  • df1 (DataFrame): input dataframe.
  • coffs (list, optional): cut-offs. Defaults to [0.01,0.05,0.25].
  • color (bool, optional): color asignment. Defaults to False.
  • testn (str, optional): test number. Defaults to 'MWU test, FDR corrected'.
  • colindex (str, optional): column with index. Defaults to 'genes id'.
  • colgroup (str, optional): column with the groups. Defaults to 'tissue'.
  • preffix (str, optional): prefix. Defaults to ''.
  • colns (type, optional): columns number. Defaults to None.
  • notcountedpalette (type, optional): description. Defaults to None.

Returns:

  • tuple: output.

Notes:

  1. To be deprecated in the favor of the functions used for enrichment analysis for example.

module roux.stat.enrich

For enrichment related stats.


function get_enrichment

get_enrichment(
    df1: DataFrame,
    df2: DataFrame,
    background: int,
    colid: str = 'gene id',
    colref: str = 'gene set id',
    colrefname: str = 'gene set name',
    colreftype: str = 'gene set type',
    colrank: str = 'rank',
    outd: str = None,
    name: str = None,
    cutoff: float = 0.05,
    permutation_num: int = 1000,
    verbose: bool = False,
    no_plot: bool = True,
    **kws_prerank
)

Get enrichments between sets.

Args:

  • df1 (pd.DataFrame): test data.
  • df2 (pd.DataFrame): reference set data.
  • background (int): background size.
  • colid (str, optional): column containing unique ids of the elements. Defaults to 'gene id'.
  • colref (str, optional): column containing the unique ids of the sets. Defaults to 'gene set id'.
  • colrefname (str, optional): column containing names of the sets. Defaults to 'gene set name'.
  • colreftype (str, optional): column containing the type/group name of the sets. Defaults to 'gene set type'.
  • colrank (str, optional): column containing the ranks. Defaults to 'rank'.
  • outd (str, optional): output directory path. Defaults to None.
  • name (str, optional): name of the result. Defaults to None.
  • cutoff (float, optional): p-value cutoff. Defaults to 0.05.
  • verbose (bool, optional): verbose. Defaults to False.
  • no_plot (bool, optional): do no plot. Defaults to True.

Returns:

  • pd.DataFrame: if rank -> high rank first within the leading edge gene ids.

Notes:

  1. Unique ids are provided as inputs.

function get_enrichments

get_enrichments(
    df1: DataFrame,
    df2: DataFrame,
    background: int,
    coltest: str = 'subset',
    colid: str = 'gene id',
    colref: str = 'gene set id',
    colreftype: str = 'gene set type',
    fast: bool = False,
    **kws
)  DataFrame

Get enrichments between sets, iterate over types/groups of test elements e.g. upregulated and downregulated genes.

Args:

  • df1 (pd.DataFrame): test data.
  • df2 (pd.DataFrame): reference set data.
  • background (int): background size.
  • colid (str, optional): column containing unique ids of the elements. Defaults to 'gene id'.
  • colref (str, optional): column containing the unique ids of the sets. Defaults to 'gene set id'.
  • colrefname (str, optional): column containing names of the sets. Defaults to 'gene set name'.
  • colreftype (str, optional): column containing the type/group name of the sets. Defaults to 'gene set type'.
  • fast (bool, optional): parallel processing. Defaults to False.

Returns:

  • pd.DataFrame: output.

module roux.stat.fit

For fitting data.


function fit_curve_fit

fit_curve_fit(
    func,
    xdata: <built-in function array> = None,
    ydata: <built-in function array> = None,
    bounds: tuple = (-inf, inf),
    test=False,
    plot=False
)  tuple

Wrapper around scipy's curve_fit.

Args:

  • func (function): fitting function.
  • xdata (np.array, optional): x data. Defaults to None.
  • ydata (np.array, optional): y data. Defaults to None.
  • bounds (tuple, optional): bounds. Defaults to (-np.inf, np.inf).
  • test (bool, optional): test. Defaults to False.
  • plot (bool, optional): plot. Defaults to False.

Returns:

  • tuple: output.

function fit_gauss_bimodal

fit_gauss_bimodal(
    data: <built-in function array>,
    bins: int = 50,
    expected: tuple = (1, 0.2, 250, 2, 0.2, 125),
    test=False
)  tuple

Fit bimodal gaussian distribution to the data in vector format.

Args:

  • data (np.array): vector.
  • bins (int, optional): bins. Defaults to 50.
  • expected (tuple, optional): expected parameters. Defaults to (1,.2,250,2,.2,125).
  • test (bool, optional): test. Defaults to False.

Returns:

  • tuple: description

Notes:

Observed better performance with roux.stat.cluster.cluster_1d.


function get_grid

get_grid(
    x: <built-in function array>,
    y: <built-in function array>,
    z: <built-in function array> = None,
    off: int = 0,
    grids: int = 100,
    method='linear',
    test=False,
    **kws
)  tuple

2D grids from 1d data.

Args:

  • x (np.array): vector.
  • y (np.array): vector.
  • z (np.array, optional): vector. Defaults to None.
  • off (int, optional): offsets. Defaults to 0.
  • grids (int, optional): grids. Defaults to 100.
  • method (str, optional): method. Defaults to 'linear'.
  • test (bool, optional): test. Defaults to False.

Returns:

  • tuple: output.

function fit_gaussian2d

fit_gaussian2d(
    x: <built-in function array>,
    y: <built-in function array>,
    z: <built-in function array>,
    grid=True,
    grids=20,
    method='linear',
    off=0,
    rescalez=True,
    test=False
)  tuple

Fit gaussian 2D.

Args:

  • x (np.array): vector.
  • y (np.array): vector.
  • z (np.array): vector.
  • grid (bool, optional): grid. Defaults to True.
  • grids (int, optional): grids. Defaults to 20.
  • method (str, optional): method. Defaults to 'linear'.
  • off (int, optional): offsets. Defaults to 0.
  • rescalez (bool, optional): rescalez. Defaults to True.
  • test (bool, optional): test. Defaults to False.

Returns:

  • tuple: output.

function fit_2d_distribution_kde

fit_2d_distribution_kde(
    x: <built-in function array>,
    y: <built-in function array>,
    bandwidth: float,
    xmin: float = None,
    xmax: float = None,
    xbins=100j,
    ymin: float = None,
    ymax: float = None,
    ybins=100j,
    test=False,
    **kwargs
)  tuple

2D kernel density estimate (KDE).

Notes:

Cut off outliers: quantile_coff=0.01 params_grid=merge_dicts([ df01.loc[:,var2col.values()].quantile(quantile_coff).rename(index=flip_dict({f"{k}min":var2col[k] for k in var2col})).to_dict(), df01.loc[:,var2col.values()].quantile(1-quantile_coff).rename(index=flip_dict({f"{k}max":var2col[k] for k in var2col})).to_dict(), ])

Args:

  • x (np.array): vector.
  • y (np.array): vector.
  • bandwidth (float): bandwidth
  • xmin (float, optional): x minimum. Defaults to None.
  • xmax (float, optional): x maximum. Defaults to None.
  • xbins (type, optional): x bins. Defaults to 100j.
  • ymin (float, optional): y minimum. Defaults to None.
  • ymax (float, optional): y maximum. Defaults to None.
  • ybins (type, optional): y bins. Defaults to 100j.
  • test (bool, optional): test. Defaults to False.

Returns:

  • tuple: output.

function check_poly_fit

check_poly_fit(d: DataFrame, xcol: str, ycol: str, degmax: int = 5)  DataFrame

Check the fit of a polynomial equations.

Args:

  • d (pd.DataFrame): input dataframe.
  • xcol (str): column containing the x values.
  • ycol (str): column containing the y values.
  • degmax (int, optional): degree maximum. Defaults to 5.

Returns:

  • pd.DataFrame: description

function mlr_2

mlr_2(df: DataFrame, coly: str, colxs: list)  tuple

Multiple linear regression between two variables.

Args:

  • df (pd.DataFrame): input dataframe.
  • coly (str): column containing y values.
  • colxs (list): columns containing x values.

Returns:

  • tuple: output.

function get_mlr_2_str

get_mlr_2_str(df: DataFrame, coly: str, colxs: list)  str

Get the result of the multiple linear regression between two variables as a string.

Args:

  • df (pd.DataFrame): input dataframe.
  • coly (str): column containing y values.
  • colxs (list): columns containing x values.

Returns:

  • str: output.

module roux.stat.io

For input/output of stats.


function perc_label

perc_label(a, b=None, bracket=True)

function pval2annot

pval2annot(
    pval: float,
    alternative: str = None,
    alpha: float = 0.05,
    fmt: str = '*',
    power: bool = True,
    linebreak: bool = False,
    replace_prefix: str = None
)

P/Q-value to annotation.

Parameters:

  • fmt (str): *|<|'num'

module roux.stat

Global Variables

  • binary
  • io

module roux.stat.network

For network related stats.


function get_subgraphs

get_subgraphs(df1: DataFrame, source: str, target: str)  DataFrame

Subgraphs from the the edge list.

Args:

  • df1 (pd.DataFrame): input dataframe containing edge-list.
  • source (str): source node.
  • target (str): taget node.

Returns:

  • pd.DataFrame: output.

module roux.stat.norm

For normalisation.


function norm_by_quantile

norm_by_quantile(X: <built-in function array>)  <built-in function array>

Normalize the columns of X to each have the same distribution.

Notes:

Given an expression matrix (microarray data, read counts, etc) of M genes by N samples, quantile normalization ensures all samples have the same spread of data (by construction). The data across each row are averaged to obtain an average column. Each column quantile is replaced with the corresponding quantile of the average column.

Parameters:

  • X : 2D array of float, shape (M, N). The input data, with M rows (genes/features) and N columns (samples).

Returns:

  • Xn : 2D array of float, shape (M, N). The normalized data.

function norm_by_gaussian_kde

norm_by_gaussian_kde(
    values: <built-in function array>
)  <built-in function array>

Normalise matrix by gaussian KDE.

Args:

  • values (np.array): input matrix.

Returns:

  • np.array: output matrix.

References:

  • https: //github.com/saezlab/protein_attenuation/blob/6c1e81af37d72ef09835ee287f63b000c7c6663c/src/protein_attenuation/utils.py

function zscore

zscore(df: DataFrame, cols: list = None)  DataFrame

Z-score.

Args:

  • df (pd.DataFrame): input table.

Returns:

  • pd.DataFrame: output table.

function zscore_robust

zscore_robust(a: <built-in function array>)  <built-in function array>

Robust Z-score.

Args:

  • a (np.array): input data.

Returns:

  • np.array: output.

Example: t = sc.stats.norm.rvs(size=100, scale=1, random_state=123456) plt.hist(t,bins=40) plt.hist(apply_zscore_robust(t),bins=40) print(np.median(t),np.median(apply_zscore_robust(t)))

module roux.stat.paired

For paired stats.


function get_ratio_sorted

get_ratio_sorted(a: float, b: float, increase=True)  float

Get ratio sorted.

Args:

  • a<.py#1.
  • b<.py#2.
  • increase (bool, optional): check for increase. Defaults to True.

Returns:

  • float: output.

function diff

diff(a: float, b: float, absolute=True)  float

Get difference

Args:

  • a<.py#1.
  • b<.py#2.
  • absolute (bool, optional): get absolute difference. Defaults to True.

Returns:

  • float: output.

function get_diff_sorted

get_diff_sorted(a: float, b: float)  float

Difference sorted/absolute.

Args:

  • a<.py#1.
  • b<.py#2.

Returns:

  • float: output.

function balance

balance(a: float, b: float, absolute=True)  float

Balance.

Args:

  • a<.py#1.
  • b<.py#2.
  • absolute (bool, optional): absolute difference. Defaults to True.

Returns:

  • float: output.

function get_paired_sets_stats

get_paired_sets_stats(l1: list, l2: list, test: bool = False)  list

Paired stats comparing two sets.

Args:

  • l1<.py#1.
  • l2<.py#2.
  • test (bool): test mode. Defaults to False.

Returns:

  • list: tuple (overlap, intersection, union, ratio).

function get_stats_paired

get_stats_paired(
    df1: DataFrame,
    cols: list,
    input_logscale: bool,
    prefix: str = None,
    drop_cols: bool = False,
    unidirectional_stats: list = ['min', 'max'],
    fast: bool = False
)  DataFrame

Paired stats, row-wise.

Args:

  • df1 (pd.DataFrame): input data.
  • cols (list): columns.
  • input_logscale (bool): if the input data is log-scaled.
  • prefix (str, optional): prefix of the output column/s. Defaults to None.
  • drop_cols (bool, optional): drop these columns. Defaults to False.
  • unidirectional_stats (list, optional): column-wise status. Defaults to ['min','max'].
  • fast (bool, optional): parallel processing. Defaults to False.

Returns:

  • pd.DataFrame: output dataframe.

function get_stats_paired_agg

get_stats_paired_agg(
    x: <built-in function array>,
    y: <built-in function array>,
    ignore: bool = False,
    verb: bool = True
)  Series

Paired stats aggregated, for example, to classify 2D distributions.

Args:

  • x (np.array): x vector.
  • y (np.array): y vector.
  • ignore (bool, optional): suppress warnings. Defaults to False.
  • verb (bool, optional): verbose. Defaults to True.

Returns:

  • pd.Series: output.

function classify_sharing

classify_sharing(
    df1: DataFrame,
    column_value: str,
    bins: list = [0, 25, 75, 100],
    labels: list = ['low', 'medium', 'high'],
    prefix: str = '',
    verbose: bool = False
)  DataFrame

Classify sharing % calculated from Jaccard index.

Parameters:

  • df1 (pd.DataFrame): input table.
  • column_value (str): column with values.
  • bins (list): bins. Defaults to [0,25,75,100].
  • labels (list): bin labels. Defaults to ['low','medium','high'],
  • prefix (str): prefix of the columns.
  • verbose (bool): verbose. Defaults to False.

module roux.stat.regress

For regression.


function to_columns_renamed_for_regression

to_columns_renamed_for_regression(df1: DataFrame, columns: dict)  DataFrame

[UNDER DEVELOPMENT]


function check_covariates

check_covariates(df1, covariates, colindex, plot: bool = False)

[UNDER DEVELOPMENT] Quality check covariates for redundancy.

Todos: Support continuous value covariates using from roux.stat.compare import get_comparison.


function to_input_data_for_regression

to_input_data_for_regression(
    df1: DataFrame,
    cols_y: list,
    cols_index: list,
    desc_test_values: dict,
    verbose: bool = False,
    test: bool = False,
    **kws
)  tuple

Input data for the regression.

Parameters:

  • df1 (pd.DataFrame): input data.
  • cols_y (list): y columns.
  • cols_index (list): index columns.

Returns: Output table.


function to_formulas

to_formulas(
    formula: str,
    covariates: list,
    covariate_dtypes: dict = None
)  list

[UNDER DEVELOPMENT] Generate formulas.

Notes:

covariate_dtypes=data.dtypes.to_dict()


function get_stats_regression

get_stats_regression(
    data: DataFrame,
    formulas: dict = {},
    variable: str = None,
    converged_only=False,
    out='df',
    verb=False,
    test=False,
    **kws_model
)  DataFrame

Get stats from regression models.

Args:

  • data (DataFrame): input dataframe.
  • formulas (dict, optional): base formula e.g. 'y ~ x' to model name map. Defaults to {}.
  • variable (str, optional): variable name e.g. 'C(variable)[T.True]', used to retrieve the stats for. Defaults to None.
  • # covariates (list, optional): variables. Defaults to None.
  • converged_only (bool, optional): get the stats from the converged models only. Defaults to False.
  • out (str, optional): output format. Defaults to 'df'.
  • verb (bool, optional): verbose. Defaults to False.
  • test (bool, optional): test. Defaults to False.

Returns:

  • DataFrame: output.

function to_filteredby_variable

to_filteredby_variable(
    df1: DataFrame,
    variable: str,
    colindex: str,
    coff_q: float = 0.1,
    coff_p_covariates: float = 0.05,
    plot: bool = False,
    test: bool = False
)  DataFrame

Filter regression statistics.

Args:

  • df1 (DataFrame): input dataframe.
  • variable (str): variable name to filter by.
  • colindex (str): columns with index.
  • coff_q (float, optional): cut-off on the q-value. Defaults to 0.1.
  • by_covariates (bool, optional): filter by these covaliates. Defaults to True.
  • coff_p_covariates (float, optional): cut-off on the p-value for the covariates. Defaults to 0.05.
  • test (bool, optional): test. Defaults to False.

Raises:

  • ValueError: pval.

Returns:

  • DataFrame: output.

Notes:

Filtering steps: 1. By variable of interest. 2. By statistical significance. 3. By statistical significance of co-variates.


function run_lr_test

run_lr_test(
    data: DataFrame,
    formula: str,
    covariate: str,
    col_group: str,
    params_model: dict = {'reml': False}
)  tuple

Run LR test.

Args:

  • data (pd.DataFrame): input data.
  • formula (str): formula.
  • covariate (str): covariate.
  • col_group (str): column with the group.
  • params_model (dict, optional): parameters of the model. Defaults to {'reml':False}.

Returns:

  • tuple: output tupe (stat, pval,dres).

function plot_residuals_versus_fitted

plot_residuals_versus_fitted(model: object)  Axes

plot Residuals Versus Fitted (RVF).

Args:

  • model (object): model.

Returns:

  • plt.Axes: output.

function plot_residuals_versus_groups

plot_residuals_versus_groups(model: object)  Axes

plot Residuals Versus groups.

Args:

  • model (object): model.

Returns:

  • plt.Axes: output.

function plot_model_qcs

plot_model_qcs(model: object)

Plot Quality Checks.

Args:

  • model (object): model.

module roux.stat.set

For set related stats.


function get_intersection_stats

get_intersection_stats(df, coltest, colset, background_size=None)

function get_set_enrichment_stats

get_set_enrichment_stats(test, sets, background, fdr_correct=True)

test: get_set_enrichment_stats(background=range(120), test=range(100), sets={f"set {i}":list(np.unique(np.random.randint(low=100,size=i+1))) for i in range(100)}) # background is int get_set_enrichment_stats(background=110, test=unique(range(100)), sets={f"set {i}":unique(np.random.randint(low=140,size=i+1)) for i in range(0,140,10)})


function test_set_enrichment

test_set_enrichment(tests_set2elements, test2_set2elements, background_size)

function get_paired_sets_stats

get_paired_sets_stats(l1, l2)

overlap, intersection, union, ratio


function get_enrichment

get_enrichment(
    df1,
    df2,
    background,
    colid='gene id',
    colref='gene set id',
    colrefname='gene set name',
    colreftype='gene set type',
    colrank='rank',
    outd=None,
    name=None,
    cutoff=0.05,
    permutation_num=1000,
    verbose=False,
    no_plot=True,
    **kws_prerank
)

:return leading edge gene ids: high rank first


function get_enrichments

get_enrichments(
    df1,
    df2,
    background,
    coltest='subset',
    colid='gene id',
    colref='gene set id',
    colreftype='gene set type',
    fast=False,
    **kws
)

:param df1: test sets :param df2: reference sets

module roux.stat.solve

For solving equations.


function get_intersection_locations

get_intersection_locations(
    y1: <built-in function array>,
    y2: <built-in function array>,
    test: bool = False,
    x: <built-in function array> = None
)  list

Get co-ordinates of the intersection (x[idx]).

Args:

  • y1 (np.array): vector.
  • y2 (np.array): vector.
  • test (bool, optional): test mode. Defaults to False.
  • x (np.array, optional): vector. Defaults to None.

Returns:

  • list: output.

module roux.stat.transform

For transformations.


function plog

plog(x, p: float, base: int)

Psudo-log.

Args:

  • x (float|np.array): input.
  • p (float): pseudo-count.
  • base (int): base of the log.

Returns: output.


function anti_plog

anti_plog(x, p: float, base: int)

Anti-psudo-log.

Args:

  • x (float|np.array): input.
  • p (float): pseudo-count.
  • base (int): base of the log.

Returns: output.


function log_pval

log_pval(
    x,
    errors: str = 'raise',
    replace_zero_with: float = None,
    p_min: float = None
)

Transform p-values to Log10.

Paramters: x: input. errors (str): Defaults to 'raise' else replace (in case of visualization only). p_min (float): Replace zeros with this value. Note: to be used for visualization only.

Returns: output.


function get_q

get_q(ds1: Series, col: str = None, verb: bool = True, test_coff: float = 0.1)

To FDR corrected P-value.


function glog

glog(x: float, l=2)

Generalised logarithm.

Args:

  • x (float): input.
  • l (int, optional): psudo-count. Defaults to 2.

Returns:

  • float: output.

function rescale

rescale(
    a: <built-in function array>,
    range1: tuple = None,
    range2: tuple = [0, 1]
)  <built-in function array>

Rescale within a new range.

Args:

  • a (np.array): input vector.
  • range1 (tuple, optional): existing range. Defaults to None.
  • range2 (tuple, optional): new range. Defaults to [0,1].

Returns:

  • np.array: output.

function rescale_divergent

rescale_divergent(df1: DataFrame, col: str)  DataFrame

Rescale divergently i.e. two-sided.

Args:

  • df1 (pd.DataFrame): description
  • col (str): column.

Returns:

  • pd.DataFrame: column.

Notes:

Under development.

module roux.stat.variance

For variance related stats.


function confidence_interval_95

confidence_interval_95(x: <built-in function array>)  float

95% confidence interval.

Args:

  • x (np.array): input vector.

Returns:

  • float: output.

function get_ci

get_ci(rs, ci_type, outstr=False)

module roux.viz.annot

For annotations.


function set_label

set_label(
    s: str,
    ax: Axes,
    x: float = 0,
    y: float = 0,
    ha: str = 'left',
    va: str = 'top',
    loc=None,
    off_loc=0.01,
    title: bool = False,
    **kws
)  Axes

Set label on a plot.

Args:

  • x (float): x position.
  • y (float): y position.
  • s (str): label.
  • ax (plt.Axes): plt.Axes object.
  • ha (str, optional): horizontal alignment. Defaults to 'left'.
  • va (str, optional): vertical alignment. Defaults to 'top'.
  • loc (int, optional): location of the label. 1:'upper right', 2:'upper left', 3:'lower left':3, 4:'lower right'
  • offs_loc (tuple,optional): x and y location offsets.
  • title (bool, optional): set as title. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function annot_side

annot_side(
    ax: Axes,
    df1: DataFrame,
    colx: str,
    coly: str,
    cols: str = None,
    hue: str = None,
    loc: str = 'right',
    scatter=False,
    lines=True,
    text=True,
    invert_xaxis: bool = False,
    offx3: float = 0.15,
    offymin: float = 0.1,
    offymax: float = 0.9,
    offx_text: float = 0,
    offy_text: float = 0,
    break_pt: int = 25,
    length_axhline: float = 3,
    va: str = 'bottom',
    zorder: int = 1,
    color: str = 'gray',
    kws_line: dict = {},
    kws_scatter: dict = {'zorder': 2, 'alpha': 0.75, 'marker': '|', 's': 100},
    **kws_text
)  Axes

Annot elements of the plots on the of the side plot.

Args:

  • df1 (pd.DataFrame): input data
  • colx (str): column with x values.
  • coly (str): column with y values.
  • cols (str): column with labels.
  • hue (str): column with colors of the labels.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • loc (str, optional): location. Defaults to 'right'.
  • invert_xaxis (bool, optional): invert xaxis. Defaults to False.
  • offx3 (float, optional): x-offset for bend position of the arrow. Defaults to 0.15.
  • offymin (float, optional): x-offset minimum. Defaults to 0.1.
  • offymax (float, optional): x-offset maximum. Defaults to 0.9.
  • break_pt (int, optional): break point of the labels. Defaults to 25.
  • length_axhline (float, optional): length of the horizontal line i.e. the "underline". Defaults to 3.
  • zorder (int, optional): z-order. Defaults to 1.
  • color (str, optional): color of the line. Defaults to 'gray'.
  • kws_line (dict, optional): parameters for formatting the line. Defaults to {}.

Keyword Args:

  • kws: parameters provided to the ax.text function.

Returns:

  • plt.Axes: plt.Axes object.

function annot_corners

annot_corners(
    ax: Axes,
    df1: DataFrame,
    colx: str,
    coly: str,
    coltext: str,
    off: float = 0.1,
    **kws
)  Axes

Annotate points above and below the diagonal.


function confidence_ellipse

confidence_ellipse(x, y, ax, n_std=3.0, facecolor='none', **kwargs)

Create a plot of the covariance confidence ellipse of x and y.

Parameters:

---------- x, y : array-like, shape (n, ) Input data.

ax : matplotlib.axes.Axes The axes object to draw the ellipse into.

n_std : float The number of standard deviations to determine the ellipse's radiuses.

**kwargs Forwarded to ~matplotlib.patches.Ellipse

Returns ------- matplotlib.patches.Ellipse

References ---------- https://matplotlib.org/3.5.0/gallery/statistics/confidence_ellipse.html


function show_box

show_box(
    ax: Axes,
    xy: tuple,
    width: float,
    height: float,
    fill: str = None,
    alpha: float = 1,
    lw: float = 1.1,
    ec: str = 'k',
    clip_on: bool = False,
    scale_width: float = 1,
    scale_height: float = 1,
    xoff: float = 0,
    yoff: float = 0,
    **kws
)  Axes

Highlight sections of a plot e.g. heatmap by drawing boxes.

Args:

  • xy (tuple): position of left, bottom corner of the box.
  • width (float): width.
  • height (float): height.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • fill (str, optional): fill the box with color. Defaults to None.
  • alpha (float, optional): alpha of color. Defaults to 1.
  • lw (float, optional): line width. Defaults to 1.1.
  • ec (str, optional): edge color. Defaults to 'k'.
  • clip_on (bool, optional): clip the boxes by the axis limit. Defaults to False.
  • scale_width (float, optional): scale width. Defaults to 1.
  • scale_height (float, optional): scale height. Defaults to 1.
  • xoff (float, optional): x-offset. Defaults to 0.
  • yoff (float, optional): y-offset. Defaults to 0.

Keyword Args:

  • kws: parameters provided to the Rectangle function.

Returns:

  • plt.Axes: plt.Axes object.

function annot_confusion_matrix

annot_confusion_matrix(df_: DataFrame, ax: Axes = None, off: float = 0.5)  Axes

Annotate a confusion matrix.

Args:

  • df_ (pd.DataFrame): input data.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • off (float, optional): offset. Defaults to 0.5.

Returns:

  • plt.Axes: plt.Axes object.

function get_logo_ax

get_logo_ax(
    ax: Axes,
    size: float = 0.5,
    bbox_to_anchor: list = None,
    loc: str = 1,
    axes_kwargs: dict = {'zorder': -1}
)  Axes

Get plt.Axes for placing the logo.

Args:

  • ax (plt.Axes): plt.Axes object.
  • size (float, optional): size of the subplot. Defaults to 0.5.
  • bbox_to_anchor (list, optional): location. Defaults to None.
  • loc (str, optional): location. Defaults to 1.
  • axes_kwargs (type, optional): parameters provided to inset_axes. Defaults to {'zorder':-1}.

Returns:

  • plt.Axes: plt.Axes object.

function set_logo

set_logo(
    imp: str,
    ax: Axes,
    size: float = 0.5,
    bbox_to_anchor: list = None,
    loc: str = 1,
    axes_kwargs: dict = {'zorder': -1},
    params_imshow: dict = {'aspect': 'auto', 'alpha': 1, 'interpolation': 'catrom'},
    test: bool = False,
    force: bool = False
)  Axes

Set logo.

Args:

  • imp (str): path to the logo file.
  • ax (plt.Axes): plt.Axes object.
  • size (float, optional): size of the subplot. Defaults to 0.5.
  • bbox_to_anchor (list, optional): location. Defaults to None.
  • loc (str, optional): location. Defaults to 1.
  • axes_kwargs (type, optional): parameters provided to inset_axes. Defaults to {'zorder':-1}.
  • params_imshow (type, optional): parameters provided to the imshow function. Defaults to {'aspect':'auto','alpha':1, 'interpolation':'catrom'}.
  • test (bool, optional): test mode. Defaults to False.
  • force (bool, optional): overwrite file. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function color_ax

color_ax(ax: Axes, c: str, linewidth: float = None)  Axes

Color border of plt.Axes.

Args:

  • ax (plt.Axes): plt.Axes object.
  • c (str): color.
  • linewidth (float, optional): line width. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function annot_n_legend

annot_n_legend(ax, df1: DataFrame, colid: str, colgroup: str, **kws)

module roux.viz.ax_

For setting up subplots.


function set_

set_(ax: Axes, test: bool = False, **kws)  Axes

Ser many axis parameters.

Args:

  • ax (plt.Axes): plt.Axes object.
  • test (bool, optional): test mode. Defaults to False.

Keyword Args:

  • kws: parameters provided to the ax.set function.

Returns:

  • plt.Axes: plt.Axes object.

function set_ylabel

set_ylabel(
    ax: Axes,
    s: str = None,
    x: float = -0.1,
    y: float = 1.02,
    xoff: float = 0,
    yoff: float = 0
)  Axes

Set ylabel horizontal.

Args:

  • ax (plt.Axes): plt.Axes object.
  • s (str, optional): ylabel. Defaults to None.
  • x (float, optional): x position. Defaults to -0.1.
  • y (float, optional): y position. Defaults to 1.02.
  • xoff (float, optional): x offset. Defaults to 0.
  • yoff (float, optional): y offset. Defaults to 0.

Returns:

  • plt.Axes: plt.Axes object.

function rename_labels

rename_labels(ax, d1)

function rename_ticklabels

rename_ticklabels(
    ax: Axes,
    axis: str,
    rename: dict = None,
    replace: dict = None,
    ignore: bool = False
)  Axes

Rename the ticklabels.

Args:

  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • axis (str): axis (x|y).
  • rename (dict, optional): replace strings. Defaults to None.
  • replace (dict, optional): replace sub-strings. Defaults to None.
  • ignore (bool, optional): ignore warnings. Defaults to False.

Raises:

  • ValueError: either rename or replace should be provided.

Returns:

  • plt.Axes: plt.Axes object.

function get_ticklabel_position

get_ticklabel_position(ax: Axes, axis: str)  Axes

Get positions of the ticklabels.

Args:

  • ax (plt.Axes): plt.Axes object.
  • axis (str): axis (x|y).

Returns:

  • plt.Axes: plt.Axes object.

function get_ticklabel_position

get_ticklabel_position(ax: Axes, axis: str)  Axes

Get positions of the ticklabels.

Args:

  • ax (plt.Axes): plt.Axes object.
  • axis (str): axis (x|y).

Returns:

  • plt.Axes: plt.Axes object.

function set_ticklabels_color

set_ticklabels_color(ax: Axes, ticklabel2color: dict, axis: str = 'y')  Axes

Set colors to ticklabels.

Args:

  • ax (plt.Axes): plt.Axes object.
  • ticklabel2color (dict): colors of the ticklabels.
  • axis (str): axis (x|y).

Returns:

  • plt.Axes: plt.Axes object.

function set_ticklabels_color

set_ticklabels_color(ax: Axes, ticklabel2color: dict, axis: str = 'y')  Axes

Set colors to ticklabels.

Args:

  • ax (plt.Axes): plt.Axes object.
  • ticklabel2color (dict): colors of the ticklabels.
  • axis (str): axis (x|y).

Returns:

  • plt.Axes: plt.Axes object.

function format_ticklabels

format_ticklabels(
    ax: Axes,
    axes: tuple = ['x', 'y'],
    n: int = None,
    fmt: str = None,
    font: str = None
)  Axes

format_ticklabels

Args:

  • ax (plt.Axes): plt.Axes object.
  • axes (tuple, optional): axes. Defaults to ['x','y'].
  • n (int, optional): number of ticks. Defaults to None.
  • fmt (str, optional): format. Defaults to None.
  • font (str, optional): font. Defaults to 'DejaVu Sans Mono'.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: 1. include color_ticklabels


function set_equallim

set_equallim(
    ax: Axes,
    diagonal: bool = False,
    difference: float = None,
    format_ticks: bool = True,
    **kws_format_ticklabels
)  Axes

Set equal axis limits.

Args:

  • ax (plt.Axes): plt.Axes object.
  • diagonal (bool, optional): show diagonal. Defaults to False.
  • difference (float, optional): difference from . Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function get_axlims

get_axlims(ax: Axes)  Axes

Get axis limits.

Args:

  • ax (plt.Axes): plt.Axes object.

Returns:

  • plt.Axes: plt.Axes object.

function set_axlims

set_axlims(ax: Axes, off: float, axes: list = ['x', 'y'])  Axes

Set axis limits.

Args:

  • ax (plt.Axes): plt.Axes object.
  • off (float): offset.
  • axes (list, optional): axis name/s. Defaults to ['x','y'].

Returns:

  • plt.Axes: plt.Axes object.

function get_axlimsby_data

get_axlimsby_data(
    X: Series,
    Y: Series,
    off: float = 0.2,
    equal: bool = False
)  Axes

Infer axis limits from data.

Args:

  • X (pd.Series): x values.
  • Y (pd.Series): y values.
  • off (float, optional): offsets. Defaults to 0.2.
  • equal (bool, optional): equal limits. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function split_ticklabels

split_ticklabels(
    ax: Axes,
    axis='x',
    grouped=False,
    group_x=0.01,
    group_prefix=None,
    group_loc='left',
    group_colors=None,
    group_alpha=0.2,
    show_group_line=True,
    show_group_span=True,
    sep: str = '-',
    pad_major=6,
    **kws
)  Axes

Split ticklabels into major and minor. Two minor ticks are created per major tick.

Args:

  • ax (plt.Axes): plt.Axes object.
  • sep (str, optional): separator within the tick labels. Defaults to ' '.

Returns:

  • plt.Axes: plt.Axes object.

function set_grids

set_grids(ax: Axes, axis: str = None)  Axes

Show grids.

Args:

  • ax (plt.Axes): plt.Axes object.
  • axis (str, optional): axis name. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function rename_legends

rename_legends(ax: Axes, replaces: dict, **kws_legend)  Axes

Rename legends.

Args:

  • ax (plt.Axes): plt.Axes object.
  • replaces (dict): description

Returns:

  • plt.Axes: plt.Axes object.

function append_legends

append_legends(ax: Axes, labels: list, handles: list, **kws)  Axes

Append to legends.

Args:

  • ax (plt.Axes): plt.Axes object.
  • labels (list): labels.
  • handles (list): handles.

Returns:

  • plt.Axes: plt.Axes object.

function sort_legends

sort_legends(ax: Axes, sort_order: list = None, **kws)  Axes

Sort or filter legends.

Args:

  • ax (plt.Axes): plt.Axes object.
  • sort_order (list, optional): order of legends. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

Notes:

  1. Filter the legends by providing the indices of the legends to keep.

function drop_duplicate_legend

drop_duplicate_legend(ax, **kws)

function reset_legend_colors

reset_legend_colors(ax)

Reset legend colors.

Args:

  • ax (plt.Axes): plt.Axes object.

Returns:

  • plt.Axes: plt.Axes object.

function set_legends_merged

set_legends_merged(axs)

Reset legend colors.

Args:

  • axs (list): list of plt.Axes objects.

Returns:

  • plt.Axes: first plt.Axes object in the list.

function set_legend_custom

set_legend_custom(
    ax: Axes,
    legend2param: dict,
    param: str = 'color',
    lw: float = 1,
    marker: str = 'o',
    markerfacecolor: bool = True,
    size: float = 10,
    color: str = 'k',
    linestyle: str = '',
    title_ha: str = 'center',
    frameon: bool = True,
    **kws
)  Axes

Set custom legends.

Args:

  • ax (plt.Axes): plt.Axes object.
  • legend2param (dict): legend name to parameter to change e.g. name of the color.
  • param (str, optional): parameter to change. Defaults to 'color'.
  • lw (float, optional): line width. Defaults to 1.
  • marker (str, optional): marker type. Defaults to 'o'.
  • markerfacecolor (bool, optional): marker face color. Defaults to True.
  • size (float, optional): size of the markers. Defaults to 10.
  • color (str, optional): color of the markers. Defaults to 'k'.
  • linestyle (str, optional): line style. Defaults to ''.
  • title_ha (str, optional): title horizontal alignment. Defaults to 'center'.
  • frameon (bool, optional): show frame. Defaults to True.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: 1. differnet number of points for eachh entry

from matplotlib.legend_handler import HandlerTuple l1, = plt.plot(-1, -1, lw=0, marker="o", markerfacecolor='k', markeredgecolor='k') l2, = plt.plot(-0.5, -1, lw=0, marker="o", markerfacecolor="none", markeredgecolor='k') plt.legend([(l1,), (l1, l2)], ["test 1", "test 2"],

  • handler_map={tuple: HandlerTuple(2)} )

References:

  • https: //matplotlib.org/stable/api/markers_api.html
  • http: //www.cis.jhu.edu/~shanest/mpt/js/mathjax/mathjax-dev/fonts/Tables/STIX/STIX/All/All.html

function get_line_cap_length

get_line_cap_length(ax: Axes, linewidth: float)  Axes

Get the line cap length.

Args:

  • ax (plt.Axes): plt.Axes object
  • linewidth (float): width of the line.

Returns:

  • plt.Axes: plt.Axes object

function get_subplot_dimentions

get_subplot_dimentions(ax=None)

Calculate the aspect ratio of plt.Axes.

Args:

  • ax (plt.Axes): plt.Axes object

Returns:

  • plt.Axes: plt.Axes object

References:

  • https: //github.com/matplotlib/matplotlib/issues.py#issuecomment-285472404

function set_colorbar

set_colorbar(
    fig: object,
    ax: Axes,
    ax_pc: Axes,
    label: str,
    bbox_to_anchor: tuple = (0.05, 0.5, 1, 0.45),
    orientation: str = 'vertical'
)

Set colorbar.

Args:

  • fig (object): figure object.
  • ax (plt.Axes): plt.Axes object.
  • ax_pc (plt.Axes): plt.Axes object for the colorbar.
  • label (str): label
  • bbox_to_anchor (tuple, optional): location. Defaults to (0.05, 0.5, 1, 0.45).
  • orientation (str, optional): orientation. Defaults to "vertical".

Returns: figure object.


function set_colorbar_label

set_colorbar_label(ax: Axes, label: str)  Axes

Find colorbar and set label for it.

Args:

  • ax (plt.Axes): plt.Axes object.
  • label (str): label.

Returns:

  • plt.Axes: plt.Axes object.

module roux.viz.bar

For bar plots.


function plot_barh

plot_barh(
    df1: DataFrame,
    colx: str,
    coly: str,
    colannnotside: str = None,
    x1: float = None,
    offx: float = 0,
    ax: Axes = None,
    **kws
)  Axes

Plot horizontal bar plot with text on them.

Args:

  • df1 (pd.DataFrame): input data.
  • colx (str): x column.
  • coly (str): y column.
  • colannnotside (str): column with annotations to show on the right side of the plot.
  • x1 (float): x position of the text.
  • offx (float): x-offset of x1, multiplier.
  • color (str): color of the bars.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the barh function.

Returns:

  • plt.Axes: plt.Axes object.

function plot_value_counts

plot_value_counts(
    df: DataFrame,
    col: str,
    logx: bool = False,
    kws_hist: dict = {'bins': 10},
    kws_bar: dict = {},
    grid: bool = False,
    axes: list = None,
    fig: object = None,
    hist: bool = True
)

Plot pandas's value_counts.

Args:

  • df (pd.DataFrame): input data value_counts.
  • col (str): column with counts.
  • logx (bool, optional): x-axis on log-scale. Defaults to False.
  • kws_hist (type, optional): parameters provided to the hist function. Defaults to {'bins':10}.
  • kws_bar (dict, optional): parameters provided to the bar function. Defaults to {}.
  • grid (bool, optional): show grids or not. Defaults to False.
  • axes (list, optional): list of plt.axes. Defaults to None.
  • fig (object, optional): figure object. Defaults to None.
  • hist (bool, optional): show histgram. Defaults to True.

function plot_barh_stacked_percentage

plot_barh_stacked_percentage(
    df1: DataFrame,
    coly: str,
    colannot: str,
    color: str = None,
    yoff: float = 0,
    ax: Axes = None
)  Axes

Plot horizontal stacked bar plot with percentages.

Args:

  • df1 (pd.DataFrame): input data. values in rows sum to 100%.
  • coly (str): y column. yticklabels, e.g. retained and dropped.
  • colannot (str): column with annotations.
  • color (str, optional): color. Defaults to None.
  • yoff (float, optional): y-offset. Defaults to 0.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function plot_bar_serial

plot_bar_serial(
    d1: dict,
    polygon: bool = False,
    polygon_x2i: float = 0,
    labelis: list = [],
    y: float = 0,
    ylabel: str = None,
    off_arrowy: float = 0.15,
    kws_rectangle={'height': 0.5, 'linewidth': 1},
    ax: Axes = None
)  Axes

Barplots with serial increase in resolution.

Args:

  • d1 (dict): dictionary with the data.
  • polygon (bool, optional): show polygon. Defaults to False.
  • polygon_x2i (float, optional): connect polygon to this subset. Defaults to 0.
  • labelis (list, optional): label these subsets. Defaults to [].
  • y (float, optional): y position. Defaults to 0.
  • ylabel (str, optional): y label. Defaults to None.
  • off_arrowy (float, optional): offset for the arrow. Defaults to 0.15.
  • kws_rectangle (type, optional): parameters provided to the rectangle function. Defaults to dict(height=0.5,linewidth=1).
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function plot_barh_stacked_percentage_intersections

plot_barh_stacked_percentage_intersections(
    df0: DataFrame,
    colxbool: str,
    colybool: str,
    colvalue: str,
    colid: str,
    colalt: str,
    colgroupby: str,
    coffgroup: float = 0.95,
    ax: Axes = None
)  Axes

Plot horizontal stacked bar plot with percentages and intesections.

Args:

  • df0 (pd.DataFrame): input data.
  • colxbool (str): x column.
  • colybool (str): y column.
  • colvalue (str): column with the values.
  • colid (str): column with ids.
  • colalt (str): column with the alternative subset.
  • colgroupby (str): column with groups.
  • coffgroup (float, optional): cut-off between the groups. Defaults to 0.95.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

Examples:

Parameters: colxbool='paralog', colybool='essential', colvalue='value', colid='gene id', colalt='singleton', coffgroup=0.95, colgroupby='tissue',


function to_input_data_sankey

to_input_data_sankey(
    df0,
    colid,
    cols_groupby=None,
    colall='all',
    remove_all=False
)

function plot_sankey

plot_sankey(
    df1,
    cols_groupby=None,
    hues=None,
    node_color=None,
    link_color=None,
    info=None,
    x=None,
    y=None,
    colors=None,
    hovertemplate=None,
    text_width=20,
    convert=True,
    width=400,
    height=400,
    outp=None,
    validate=True,
    test=False,
    **kws
)

module roux.viz.colors

For setting up colors.


function rgbfloat2int

rgbfloat2int(rgb_float)

function get_colors_default

get_colors_default()  list

get default colors.

Returns:

  • list: colors.

function get_ncolors

get_ncolors(
    n: int,
    cmap: str = 'Spectral',
    ceil: bool = False,
    test: bool = False,
    N: int = 20,
    out: str = 'hex',
    **kws_get_cmap_section
)  list

Get colors.

Args:

  • n (int): number of colors to get.
  • cmap (str, optional): colormap. Defaults to 'Spectral'.
  • ceil (bool, optional): ceil. Defaults to False.
  • test (bool, optional): test mode. Defaults to False.
  • N (int, optional): number of colors in the colormap. Defaults to 20.
  • out (str, optional): output. Defaults to 'hex'.

Returns:

  • list: colors.

function get_val2color

get_val2color(
    ds: Series,
    vmin: float = None,
    vmax: float = None,
    cmap: str = 'Reds'
)  dict

Get color for a value.

Args:

  • ds (pd.Series): values.
  • vmin (float, optional): minimum value. Defaults to None.
  • vmax (float, optional): maximum value. Defaults to None.
  • cmap (str, optional): colormap. Defaults to 'Reds'.

Returns:

  • dict: output.

function saturate_color

saturate_color(color, alpha: float)  object

Saturate a color.

Args: color (type):

  • alpha (float): alpha level.

Returns:

  • object: output.

References:

  • https: //stackoverflow.com/a/60562502/3521099

function mix_colors

mix_colors(d: dict)  str

Mix colors.

Args:

  • d (dict): colors to alpha map.

Returns:

  • str: hex color.

References:

  • https: //stackoverflow.com/a/61488997/3521099

function make_cmap

make_cmap(cs: list, N: int = 20, **kws)

Create a colormap.

Args:

  • cs (list): colors
  • N (int, optional): resolution i.e. number of colors. Defaults to 20.

Returns: cmap.


function get_cmap_section

get_cmap_section(
    cmap,
    vmin: float = 0.0,
    vmax: float = 1.0,
    n: int = 100
)  object

Get section of a colormap.

Args:

  • cmap (object| str): colormap.
  • vmin (float, optional): minimum value. Defaults to 0.0.
  • vmax (float, optional): maximum value. Defaults to 1.0.
  • n (int, optional): resolution i.e. number of colors. Defaults to 100.

Returns:

  • object: cmap.

function append_cmap

append_cmap(
    cmap: str = 'Reds',
    color: str = '#D3DDDC',
    cmap_min: float = 0.2,
    cmap_max: float = 0.8,
    ncolors: int = 100,
    ncolors_min: int = 1,
    ncolors_max: int = 0
)

Append a color to colormap.

Args:

  • cmap (str, optional): colormap. Defaults to 'Reds'.
  • color<.py#D3DDDC'.
  • cmap_min (float, optional): cmap_min. Defaults to 0.2.
  • cmap_max (float, optional): cmap_max. Defaults to 0.8.
  • ncolors (int, optional): number of colors. Defaults to 100.
  • ncolors_min (int, optional): number of colors minimum. Defaults to 1.
  • ncolors_max (int, optional): number of colors maximum. Defaults to 0.

Returns: cmap.

References:

  • https: //matplotlib.org/stable/tutorials/colors/colormap-manipulation.html

module roux.viz.compare

For comparative plots.


function plot_comparisons

plot_comparisons(
    plot_data,
    x,
    ax=None,
    output_dir_path=None,
    force=False,
    return_path=False
)

Parameters:

  • plot_data: output of .stat.compare.get_comparison

Notes:

sample type: different sample of the same data.

module roux.viz.dist

For distribution plots.


function hist_annot

hist_annot(
    dplot: DataFrame,
    colx: str,
    colssubsets: list = [],
    bins: int = 100,
    subset_unclassified: bool = True,
    cmap: str = 'hsv',
    ymin=None,
    ymax=None,
    ylimoff: float = 1,
    ywithinoff: float = 1.2,
    annotaslegend: bool = True,
    annotn: bool = True,
    params_scatter: dict = {'zorder': 2, 'alpha': 0.1, 'marker': '|'},
    xlim: tuple = None,
    ax: Axes = None,
    **kws
)  Axes

Annoted histogram.

Args:

  • dplot (pd.DataFrame): input dataframe.
  • colx (str): x column.
  • colssubsets (list, optional): columns indicating subsets. Defaults to [].
  • bins (int, optional): bins. Defaults to 100.
  • subset_unclassified (bool, optional): call non-annotated subset as 'unclassified'. Defaults to True.
  • cmap (str, optional): colormap. Defaults to 'Reds_r'.
  • ylimoff (float, optional): y-offset for y-axis limit . Defaults to 1.2.
  • ywithinoff (float, optional): y-offset for the distance within labels. Defaults to 1.2.
  • annotaslegend (bool, optional): convert labels to legends. Defaults to True.
  • annotn (bool, optional): annotate sample sizes. Defaults to True.
  • params_scatter (type, optional): parameters of the scatter plot. Defaults to {'zorder':2,'alpha':0.1,'marker':'|'}.
  • xlim (tuple, optional): x-axis limits. Defaults to None.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the hist function.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: For scatter, use annot_side with loc='top'.


function plot_gmm

plot_gmm(
    x: Series,
    coff: float = None,
    mix_pdf: object = None,
    two_pdfs: tuple = None,
    weights: tuple = None,
    n_clusters: int = 2,
    bins: int = 20,
    show_cutoff: bool = True,
    show_cutoff_line: bool = True,
    colors: list = ['gray', 'gray', 'lightgray'],
    out_coff: bool = False,
    hist: bool = True,
    test: bool = False,
    ax: Axes = None,
    kws_axvline={'color': 'k'},
    **kws
)  Axes

Plot Gaussian mixture Models (GMMs).

Args:

  • x (pd.Series): input vector.
  • coff (float, optional): intersection between two fitted distributions. Defaults to None.
  • mix_pdf (object, optional): Probability density function of the mixed distribution. Defaults to None.
  • two_pdfs (tuple, optional): Probability density functions of the separate distributions. Defaults to None.
  • weights (tuple, optional): weights of the individual distributions. Defaults to None.
  • n_clusters (int, optional): number of distributions. Defaults to 2.
  • bins (int, optional): bins. Defaults to 50.
  • colors (list, optional): colors of the invividual distributions and of the mixed one. Defaults to ['gray','gray','lightgray']. 'gray'
  • out_coff (bool,False): return the cutoff. Defaults to False.
  • hist (bool, optional): show histogram. Defaults to True.
  • test (bool, optional): test mode. Defaults to False.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the hist function.
  • kws_axvline: parameters provided to the axvline function.

Returns:

  • plt.Axes: plt.Axes object.

function plot_normal

plot_normal(x: Series, ax: Axes = None)  Axes

Plot normal distribution.

Args:

  • x (pd.Series): input vector.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function plot_dists

plot_dists(
    df1: DataFrame,
    x: str,
    y: str,
    colindex: str,
    hue: str = None,
    order: list = None,
    hue_order: list = None,
    kind: str = 'box',
    show_p: bool = True,
    show_n: bool = True,
    show_n_prefix: str = '',
    show_n_ha='left',
    alternative: str = 'two-sided',
    offx_n: float = 0,
    xlim: tuple = None,
    xscale: str = 'linear',
    offx_pval: float = 0.05,
    offy_pval: float = None,
    saturate_color_alpha: float = 1.5,
    ax: Axes = None,
    test: bool = False,
    kws_stats: dict = {},
    **kws
)  Axes

Plot distributions.

Args:

  • df1 (pd.DataFrame): input data.
  • x (str): x column.
  • y (str): y column.
  • colindex (str): index column.
  • hue (str, optional): column with values to be encoded as hues. Defaults to None.
  • order (list, optional): order of categorical values. Defaults to None.
  • hue_order (list, optional): order of values to be encoded as hues. Defaults to None.
  • kind (str, optional): kind of distribution. Defaults to 'box'.
  • show_p (bool, optional): show p-values. Defaults to True.
  • show_n (bool, optional): show sample sizes. Defaults to True.
  • show_n_prefix (str, optional): show prefix of sample size label i.e. n=. Defaults to ''.
  • offx_n (float, optional): x-offset for the sample size label. Defaults to 0.
  • xlim (tuple, optional): x-axis limits. Defaults to None.
  • offx_pval (float, optional): x-offset for the p-value labels. Defaults to 0.05.
  • offy_pval (float, optional): y-offset for the p-value labels. Defaults to None.
  • saturate_color_alpha (float, optional): saturation of the color. Defaults to 1.5.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.
  • kws_stats (dict, optional): parameters provided to the stat function. Defaults to {}.

Keyword Args:

  • kws: parameters provided to the seaborn function.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: 1. Sort categories. 2. Change alpha of the boxplot rather than changing saturation of the swarmplot.


function pointplot_groupbyedgecolor

pointplot_groupbyedgecolor(data: DataFrame, ax: Axes = None, **kws)  Axes

Plot seaborn's pointplot grouped by edgecolor of points.

Args:

  • data (pd.DataFrame): input data.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the seaborn's pointplot function.

Returns:

  • plt.Axes: plt.Axes object.

module roux.viz.figure

For setting up figures.


function get_subplots

get_subplots(nrows: int, ncols: int, total: int = None)  list

Get subplots.

Args:

  • nrows (int): number of rows.
  • ncols (int): number of columns.
  • total (int, optional): total subplots. Defaults to None.

Returns:

  • list: list of plt.Axes objects.

function labelplots

labelplots(
    fig,
    axes: list = None,
    labels: list = None,
    xoff: float = 0,
    yoff: float = 0,
    custom_positions: dict = {},
    size: float = 20,
    va: str = 'bottom',
    ha: str = 'right',
    test: bool = False,
    **kws_text
)

Label (sub)plots.

Args:

  • fig : plt.figure object.
  • axes (type): list of plt.Axes objects.
  • xoff (int, optional): x offset. Defaults to 0.
  • yoff (int, optional): y offset. Defaults to 0.
  • params_alignment (dict, optional): alignment parameters. Defaults to {}.
  • params_text (dict, optional): parameters provided to plt.text. Defaults to {'size':20,'va':'bottom', 'ha':'right' }.
  • test (bool, optional): test mode. Defaults to False.

module roux.viz.heatmap

For heatmaps.


function plot_table

plot_table(
    df1: DataFrame,
    xlabel: str = None,
    ylabel: str = None,
    annot: bool = True,
    cbar: bool = False,
    linecolor: str = 'k',
    linewidths: float = 1,
    cmap: str = None,
    sorty: bool = False,
    linebreaky: bool = False,
    scales: tuple = [1, 1],
    ax: Axes = None,
    **kws
)  Axes

Plot to show a table.

Args:

  • df1 (pd.DataFrame): input data.
  • xlabel (str, optional): x label. Defaults to None.
  • ylabel (str, optional): y label. Defaults to None.
  • annot (bool, optional): show numbers. Defaults to True.
  • cbar (bool, optional): show colorbar. Defaults to False.
  • linecolor (str, optional): line color. Defaults to 'k'.
  • linewidths (float, optional): line widths. Defaults to 1.
  • cmap (str, optional): color map. Defaults to None.
  • sorty (bool, optional): sort rows. Defaults to False.
  • linebreaky (bool, optional): linebreak for y labels. Defaults to False.
  • scales (tuple, optional): scale of the table. Defaults to [1,1].
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the sns.heatmap function.

Returns:

  • plt.Axes: plt.Axes object.

function plot_crosstab

plot_crosstab(
    df1: DataFrame,
    cols: list = None,
    alpha: float = 0.05,
    method: str = None,
    confusion: bool = False,
    rename_cols: bool = False,
    sort_cols: tuple = [True, True],
    annot_pval: str = 'bottom',
    cmap: str = 'Reds',
    ax: Axes = None,
    **kws
)  Axes

Plot crosstab table.

Args:

  • df1 (pd.DataFrame): input data
  • cols (list, optional): columns. Defaults to None.
  • alpha (float, optional): alpha for the stats. Defaults to 0.05.
  • method (str, optional): method to check the association ['chi2','FE']. Defaults to None.
  • rename_cols (bool, optional): rename the columns. Defaults to True.
  • annot_pval (str, optional): annotate p-values. Defaults to 'bottom'.
  • cmap (str, optional): colormap. Defaults to 'Reds'.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Raises:

  • ValueError: annot_pval position should be the allowed one.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: 1. Use compare_classes to get the stats.

module roux.viz.image

For visualization of images.


function plot_image

plot_image(
    imp: str,
    ax: Axes = None,
    force=False,
    margin=0,
    axes=False,
    test=False,
    **kwarg
)  Axes

Plot image e.g. schematic.

Args:

  • imp (str): path of the image.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • force (bool, optional): overwrite output. Defaults to False.
  • margin (int, optional): margins. Defaults to 0.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

:param kwarg: cairosvg: {'dpi':500,'scale':2}; imagemagick: {'trim':False,'alpha':False}

module roux.viz.io

For input/output of plots.


function to_plotp

to_plotp(
    ax: Axes = None,
    prefix: str = 'plot/plot_',
    suffix: str = '',
    fmts: list = ['png']
)  str

Infer output path for a plot.

Args:

  • ax (plt.Axes): plt.Axes object.
  • prefix (str, optional): prefix with directory path for the plot. Defaults to 'plot/plot_'.
  • suffix (str, optional): suffix of the filename. Defaults to ''.
  • fmts (list, optional): formats of the images. Defaults to ['png'].

Returns:

  • str: output path for the plot.

function savefig

savefig(
    plotp: str,
    tight_layout: bool = True,
    bbox_inches: list = None,
    fmts: list = ['png'],
    savepdf: bool = False,
    normalise_path: bool = True,
    replaces_plotp: dict = None,
    dpi: int = 500,
    force: bool = True,
    kws_replace_many: dict = {},
    kws_savefig: dict = {},
    **kws
)  str

Wrapper around plt.savefig.

Args:

  • plotp (str): output path or plt.Axes object.
  • tight_layout (bool, optional): tight_layout. Defaults to True.
  • bbox_inches (list, optional): bbox_inches. Defaults to None.
  • savepdf (bool, optional): savepdf. Defaults to False.
  • normalise_path (bool, optional): normalise_path. Defaults to True.
  • replaces_plotp (dict, optional): replaces_plotp. Defaults to None.
  • dpi (int, optional): dpi. Defaults to 500.
  • force (bool, optional): overwrite output. Defaults to True.
  • kws_replace_many (dict, optional): parameters provided to the replace_many function. Defaults to {}.

Keyword Args:

  • kws: parameters provided to to_plotp function.
  • kws_savefig: parameters provided to to_savefig function.
  • kws_replace_many: parameters provided to replace_many function.

Returns:

  • str: output path.

function savelegend

savelegend(
    plotp: str,
    legend: object,
    expand: list = [-5, -5, 5, 5],
    **kws_savefig
)  str

Save only the legend of the plot/figure.

Args:

  • plotp (str): output path.
  • legend (object): legend object.
  • expand (list, optional): expand. Defaults to [-5,-5,5,5].

Returns:

  • str: output path.

References:

  • 1. https: //stackoverflow.com/a/47749903/3521099

function update_kws_plot

update_kws_plot(kws_plot: dict, kws_plotp: dict, test: bool = False)  dict

Update the input parameters.

Args:

  • kws_plot (dict): input parameters.
  • kws_plotp (dict): saved parameters.
  • test (bool, optional): description. Defaults to False.

Returns:

  • dict: updated parameters.

function get_plot_inputs

get_plot_inputs(plotp: str, df1: DataFrame, kws_plot: dict, outd: str)  tuple

Get plot inputs.

Args:

  • plotp (str): path of the plot.
  • df1 (pd.DataFrame): data for the plot.
  • kws_plot (dict): parameters of the plot.
  • outd (str): output directory.

Returns:

  • tuple: (path,dataframe,dict)

function log_code

log_code()

Log the code.


function get_lines

get_lines(
    logp: str = 'log_notebook.log',
    sep: str = '# plot',
    test: bool = False
)  list

Get lines from the log.

Args:

  • logp (str, optional): path to the log file. Defaults to 'log_notebook.log'.
  • sep<.py# plot'.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • list: lines of code.

function to_script

to_script(
    srcp: str,
    plotp: str,
    defn: str = 'plot_',
    s4: str = '    ',
    test: bool = False,
    **kws
)  str

Save the script with the code for the plot.

Args:

  • srcp (str): path of the script.
  • plotp (str): path of the plot.
  • defn (str, optional): prefix of the function. Defaults to "plot_".
  • s4 (str, optional): a tab. Defaults to ' '.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: path of the script.

TODOs: 1. Compatible with names of the input dataframes other that df1. 1. Get the variable name of the dataframe

def get_df_name(df): name =[x for x in globals() if globals()[x] is df and not x.startswith('-')][0] return name

  1. Replace df1 with the variable name of the dataframe.

function to_plot

to_plot(
    plotp: str,
    df1: DataFrame = None,
    kws_plot: dict = {},
    logp: str = 'log_notebook.log',
    sep: str = '# plot',
    validate: bool = False,
    show_path: bool = False,
    show_path_offy: float = -0.2,
    force: bool = True,
    test: bool = False,
    quiet: bool = True,
    **kws
)  str

Save a plot.

Args:

  • plotp (str): output path.
  • df1 (pd.DataFrame, optional): dataframe with plotting data. Defaults to None.
  • kws_plot (dict, optional): parameters for plotting. Defaults to dict().
  • logp (str, optional): path to the log. Defaults to 'log_notebook.log'.
  • sep<.py# plot'.
  • validate (bool, optional): validate the "readability" using read_plot function. Defaults to False.
  • show_path (bool, optional): show path on the plot. Defaults to False.
  • show_path_offy (float, optional): y-offset for the path label. Defaults to 0.
  • force (bool, optional): overwrite output. Defaults to True.
  • test (bool, optional): test mode. Defaults to False.
  • quiet (bool, optional): quiet mode. Defaults to False.

Returns:

  • str: output path.

Notes:

Requirement: 1. Start logging in the jupyter notebook. from IPython import get_ipython log_notebookp=f'log_notebook.log';open(log_notebookp, 'w').close();get_ipython().run_line_magic('logstart','{log_notebookp} over')


function read_plot

read_plot(p: str, safe: bool = False, test: bool = False, **kws)  Axes

Generate the plot from data, parameters and a script.

Args:

  • p (str): path of the plot saved using to_plot function.
  • safe (bool, optional): read as an image. Defaults to False.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function to_concat

to_concat(
    ps: list,
    how: str = 'h',
    use_imagemagick: bool = False,
    use_conda_env: bool = False,
    test: bool = False,
    **kws_outp
)  str

Concat images.

Args:

  • ps (list): list of paths.
  • how (str, optional): horizontal (h) or vertical v. Defaults to 'h'.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: path of the output.

function to_montage

to_montage(
    ps: list,
    layout: str,
    source_path: str = None,
    env_name: str = None,
    hspace: float = 0,
    vspace: float = 0,
    output_path: str = None,
    test: bool = False,
    **kws_outp
)  str

To montage.

Args:

  • ps (type): list of paths.
  • layout (type): layout of the images.
  • hspace (int, optional): horizontal space. Defaults to 0.
  • vspace (int, optional): vertical space. Defaults to 0.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: path of the output.

function to_gif

to_gif(
    ps: list,
    outp: str,
    duration: int = 200,
    loop: int = 0,
    optimize: bool = True
)  str

Convert to GIF.

Args:

  • ps (list): list of paths.
  • outp (str): output path.
  • duration (int, optional): duration. Defaults to 200.
  • loop (int, optional): loop or not. Defaults to 0.
  • optimize (bool, optional): optimize the size. Defaults to True.

Returns:

  • str: output path.

References:

  • 1. https: //pillow.readthedocs.io/en/stable/handbook.py#gif
  • 2. https: //stackoverflow.com/a/57751793/3521099

function to_data

to_data(path: str)  str

Convert to base64 string.

Args:

  • path (str): path of the input.

Returns: base64 string.


function to_convert

to_convert(filep: str, outd: str = None, fmt: str = 'JPEG')  str

Convert format of image using PIL.

Args:

  • filep (str): input path.
  • outd (str, optional): output directory. Defaults to None.
  • fmt (str, optional): format of the output. Defaults to "JPEG".

Returns:

  • str: output path.

function to_raster

to_raster(
    plotp: str,
    dpi: int = 500,
    alpha: bool = False,
    trim: bool = False,
    force: bool = False,
    test: bool = False
)  str

to_raster summary

Args:

  • plotp (str): input path.
  • dpi (int, optional): DPI. Defaults to 500.
  • alpha (bool, optional): transparency. Defaults to False.
  • trim (bool, optional): trim margins. Defaults to False.
  • force (bool, optional): overwrite output. Defaults to False.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: description

Notes:

  1. Runs a bash command: convert -density 300 -trim.

function to_rasters

to_rasters(plotd, ext='svg')

Convert many images to raster. Uses inkscape.

Args:

  • plotd (str): directory.
  • ext (str, optional): extension of the output. Defaults to 'svg'.

module roux.viz.line

For line plots.


function plot_range

plot_range(
    df00: DataFrame,
    colvalue: str,
    colindex: str,
    k: str,
    headsize: int = 15,
    headcolor: str = 'lightgray',
    ax: Axes = None
)  Axes

Plot range/intervals e.g. genome coordinates as lines.

Args:

  • df00 (pd.DataFrame): input data.
  • colvalue (str): column with values.
  • colindex (str): column with ids.
  • k (str): subset name.
  • headsize (int, optional): margin at top. Defaults to 15.
  • headcolor (str, optional): color of the margin. Defaults to 'lightgray'.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function plot_connections

plot_connections(
    dplot: DataFrame,
    label2xy: dict,
    colval: str = '$r_{s}$',
    line_scale: int = 40,
    legend_title: str = 'similarity',
    label2rename: dict = None,
    element2color: dict = None,
    xoff: float = 0,
    yoff: float = 0,
    rectangle: dict = {'width': 0.2, 'height': 0.32},
    params_text: dict = {'ha': 'center', 'va': 'center'},
    params_legend: dict = {'bbox_to_anchor': (1.1, 0.5), 'ncol': 1, 'frameon': False},
    legend_elements: list = [],
    params_line: dict = {'alpha': 1},
    ax: Axes = None,
    test: bool = False
)  Axes

Plot connections between points with annotations.

Args:

  • dplot (pd.DataFrame): input data.
  • label2xy (dict): label to position.
  • colval (str, optional): column with values. Defaults to '{s}$'.
  • line_scale (int, optional): line_scale. Defaults to 40.
  • legend_title (str, optional): legend_title. Defaults to 'similarity'.
  • label2rename (dict, optional): label2rename. Defaults to None.
  • element2color (dict, optional): element2color. Defaults to None.
  • xoff (float, optional): xoff. Defaults to 0.
  • yoff (float, optional): yoff. Defaults to 0.
  • rectangle (type, optional): rectangle. Defaults to {'width':0.2,'height':0.32}.
  • params_text (type, optional): params_text. Defaults to {'ha':'center','va':'center'}.
  • params_legend (type, optional): params_legend. Defaults to {'bbox_to_anchor':(1.1, 0.5), 'ncol':1, 'frameon':False}.
  • legend_elements (list, optional): legend_elements. Defaults to [].
  • params_line (type, optional): params_line. Defaults to {'alpha':1}.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function plot_kinetics

plot_kinetics(
    df1: DataFrame,
    x: str,
    y: str,
    hue: str,
    cmap: str = 'Reds_r',
    ax: Axes = None,
    test: bool = False,
    kws_legend: dict = {},
    **kws_set
)  Axes

Plot time-dependent kinetic data.

Args:

  • df1 (pd.DataFrame): input data.
  • x (str): x column.
  • y (str): y column.
  • hue (str): hue column.
  • cmap (str, optional): colormap. Defaults to 'Reds_r'.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.
  • kws_legend (dict, optional): legend parameters. Defaults to {}.

Returns:

  • plt.Axes: plt.Axes object.

function plot_steps

plot_steps(
    df1: DataFrame,
    col_step_name: str,
    col_step_size: str,
    ax: Axes = None,
    test: bool = False
)  Axes

Plot step-wise changes in numbers, e.g. for a filtering process.

Args:

  • df1 (pd.DataFrame): input data.
  • col_step_size (str): column containing the numbers.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

module roux.viz

Global Variables

  • colors
  • figure
  • io
  • ax_
  • annot

module roux.viz.scatter

For scatter plots.


function plot_trendline

plot_trendline(
    dplot: DataFrame,
    colx: str,
    coly: str,
    params_plot: dict = {'color': 'r', 'lw': 2},
    poly: bool = False,
    lowess: bool = True,
    linestyle: str = 'solid',
    params_poly: dict = {'deg': 1},
    params_lowess: dict = {'frac': 0.7, 'it': 5},
    ax: Axes = None,
    **kws
)  Axes

Plot a trendline.

Args:

  • dplot (pd.DataFrame): input dataframe.
  • colx (str): x column.
  • coly (str): y column.
  • params_plot (dict, optional): parameters provided to the plot. Defaults to {'color':'r','linestyle':'solid','lw':2}.
  • poly (bool, optional): apply polynomial function. Defaults to False.
  • lowess (bool, optional): apply lowess function. Defaults to True.
  • params_poly (type, optional): parameters provided to the polynomial function. Defaults to {'deg':1}.
  • params_lowess (type, optional): parameters provided to the lowess function.. Defaults to {'frac':0.7,'it':5}.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the plot function.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: 1. Label with goodness of fit, r (y_hat vs y)


function plot_scatter

plot_scatter(
    dplot: DataFrame,
    colx: str,
    coly: str,
    colz: str = None,
    kind: str = 'scatter',
    trendline_method: str = 'poly',
    stat_method: str = 'spearman',
    bootstrapped: bool = False,
    cmap: str = 'Reds',
    label_colorbar: str = None,
    gridsize: int = 25,
    bbox_to_anchor: list = [1, 1],
    loc: str = 'upper left',
    title: str = None,
    params_plot: dict = {},
    params_plot_trendline: dict = {},
    params_set_label: dict = {},
    ax: Axes = None,
    **kws
)  Axes

Plot scatter.

Args:

  • dplot (pd.DataFrame): input dataframe.
  • colx (str): x column.
  • coly (str): y column.
  • colz (str, optional): z column. Defaults to None.
  • kind (str, optional): kind of scatter. Defaults to 'hexbin'.
  • trendline_method (str, optional): trendline method ['poly','lowess']. Defaults to 'poly'.
  • stat_method (str, optional): method of annoted stats ['mlr',"spearman"]. Defaults to "spearman".
  • bootstrapped (bool, optional): bootstrap data. Defaults to False.
  • cmap (str, optional): colormap. Defaults to 'Reds'.
  • label_colorbar (str, optional): label of the colorbar. Defaults to None.
  • gridsize (int, optional): number of grids in the hexbin. Defaults to 25.
  • bbox_to_anchor (list, optional): location of the legend. Defaults to [1,1].
  • loc (str, optional): location of the legend. Defaults to 'upper left'.
  • title (str, optional): title of the plot. Defaults to None.
  • params_plot (dict, optional): parameters provided to the plot function. Defaults to {}.
  • params_plot_trendline (dict, optional): parameters provided to the plot_trendline function. Defaults to {}.
  • params_set_label (dict, optional): parameters provided to the set_label function. Defaults to dict(x=0,y=1).
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the plot function.

Returns:

  • plt.Axes: plt.Axes object.

Notes:

For a rasterized scatter plot set scatter_kws={'rasterized': True} TODOs: 1. Access the function as an attribute of roux-data i.e. rd.


function plot_qq

plot_qq(x: Series)  Axes

plot QQ.

Args:

  • x (pd.Series): input vector.

Returns:

  • plt.Axes: plt.Axes object.

function plot_ranks

plot_ranks(
    df1: DataFrame,
    colid: str,
    colx: str,
    coly: str = 'rank',
    ascending: bool = True,
    ax=None,
    **kws
)  Axes

Plot rankings.

Args:

  • dplot (pd.DataFrame): input data.
  • colx (str): x column.
  • coly (str): y column.
  • colid (str): column with unique ids.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the seaborn.scatterplot function.

Returns:

  • plt.Axes: plt.Axes object.

function plot_volcano

plot_volcano(
    data: DataFrame,
    colx: str,
    coly: str,
    colindex: str,
    hue: str = 'x',
    style: str = 'P=0',
    show_labels: int = None,
    show_outlines: int = None,
    outline_colors: list = ['k'],
    collabel: str = None,
    show_line=True,
    line_pvalue=0.1,
    line_x=0.0,
    show_text: bool = True,
    text_increase: str = None,
    text_decrease: str = None,
    text_diff: str = None,
    legend: bool = False,
    verbose: bool = False,
    p_min: float = 0.01,
    ax=None,
    kws_legend: dict = {},
    **kws_scatterplot
)  Axes

[UNDER DEVELOPMENT]Volcano plot.

Parameters:

Keyword parameters:

Returns: plt.Axes

module roux.viz.sequence

For plotting sequences.


function plot_domain

plot_domain(
    d1: dict,
    x: float = 1,
    xoff: float = 0,
    y: float = 1,
    height: float = 0.8,
    ax: Axes = None,
    **kws
)  Axes

Plot protein domain.

Args:

  • d1 (dict): plotting data including intervals.
  • x (float, optional): x position. Defaults to 1.
  • xoff (float, optional): x-offset. Defaults to 0.
  • y (float, optional): y position. Defaults to 1.
  • height (float, optional): height. Defaults to 0.8.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function plot_protein

plot_protein(
    df: DataFrame,
    ax: Axes = None,
    label: str = None,
    alignby: str = None,
    test: bool = False,
    **kws
)  Axes

Plot protein.

Args:

  • df (pd.DataFrame): input data.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • label (str, optional): proein name. Defaults to None.
  • alignby (str, optional): align proteins by this domain. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function plot_gene

plot_gene(
    df1: DataFrame,
    label: str = None,
    kws_plot: dict = {},
    test: bool = False,
    outd: str = None,
    ax: Axes = None,
    off_figw: float = 1,
    off_figh: float = 1
)  Axes

Plot genes.

Args:

  • df1 (pd.DataFrame): input data.
  • label (str, optional): label to show. Defaults to None.
  • kws_plot (dict, optional): parameters provided to the plot function. Defaults to {}.
  • test (bool, optional): test mode. Defaults to False.
  • outd (str, optional): output directory. Defaults to None.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • off_figw (float, optional): width offset. Defaults to 1.
  • off_figh (float, optional): height offset. Defaults to 1.

Returns:

  • plt.Axes: plt.Axes object.

function plot_genes_legend

plot_genes_legend(df: DataFrame, d1: dict)

Make the legends for the genes.

Args:

  • df (pd.DataFrame): input data.
  • d1 (dict): plotting data.

function plot_genes_data

plot_genes_data(
    df1: DataFrame,
    release: int,
    species: str,
    custom: bool = False,
    colsort: str = None,
    cmap: str = 'Spectral',
    fast: bool = False
)  tuple

Plot gene-wise data.

Args:

  • df1 (pd.DataFrame): input data.
  • release (int): Ensembl release.
  • species (str): species name.
  • custom (bool, optional): customised. Defaults to False.
  • colsort (str, optional): column to sort by. Defaults to None.
  • cmap (str, optional): colormap. Defaults to 'Spectral'.
  • fast (bool, optional): parallel processing. Defaults to False.

Returns:

  • tuple: (dataframe, dictionary)

function plot_genes

plot_genes(
    df1,
    custom=False,
    colsort=None,
    release=100,
    cmap='Spectral',
    **kws_plot_gene
)

Plot many genes.

Args:

  • df1 (pd.DataFrame): input data.
  • release (int): Ensembl release.
  • custom (bool, optional): customised. Defaults to False.
  • colsort (str, optional): column to sort by. Defaults to None.
  • cmap (str, optional): colormap. Defaults to 'Spectral'.

Keyword Args:

  • kws_plot_gene: parameters provided to the plot_genes_data function.

Returns:

  • tuple: (dataframe, dictionary)

module roux.viz.sets

For plotting sets.


function plot_venn

plot_venn(
    ds1: Series,
    ax: Axes = None,
    figsize: tuple = [2.5, 2.5],
    show_n: bool = True,
    outmore=False,
    **kws
)  Axes

Plot Venn diagram.

Args:

  • ds1 (pd.Series): input pandas.Series or dictionary. Subsets in the index levels, mapped to counts.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • figsize (tuple, optional): figure size. Defaults to [2.5,2.5].
  • show_n (bool, optional): show sample sizes. Defaults to True.

Returns:

  • plt.Axes: plt.Axes object.

function plot_intersections

plot_intersections(
    ds1: Series,
    item_name: str = None,
    figsize: tuple = [4, 4],
    text_width: float = 2,
    yorder: list = None,
    sort_by: str = 'cardinality',
    sort_categories_by: str = None,
    element_size: int = 40,
    facecolor: str = 'gray',
    bari_annot: int = None,
    totals_bar: bool = False,
    totals_text: bool = True,
    intersections_ylabel: float = None,
    intersections_min: float = None,
    test: bool = False,
    annot_text: bool = False,
    set_ylabelx: float = -0.25,
    set_ylabely: float = 0.5,
    **kws
)  Axes

Plot upset plot.

Args:

  • ds1 (pd.Series): input vector.
  • item_name (str, optional): name of items. Defaults to None.
  • figsize (tuple, optional): figure size. Defaults to [4,4].
  • text_width (float, optional): max. width of the text. Defaults to 2.
  • yorder (list, optional): order of y elements. Defaults to None.
  • sort_by (str, optional): sorting method. Defaults to 'cardinality'.
  • sort_categories_by (str, optional): sorting method. Defaults to None.
  • element_size (int, optional): size of elements. Defaults to 40.
  • facecolor (str, optional): facecolor. Defaults to 'gray'.
  • bari_annot (int, optional): annotate nth bar. Defaults to None.
  • totals_text (bool, optional): show totals. Defaults to True.
  • intersections_ylabel (float, optional): y-label of the intersections. Defaults to None.
  • intersections_min (float, optional): intersection minimum to show. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.
  • annot_text (bool, optional): annotate text. Defaults to False.
  • set_ylabelx (float, optional): x position of the ylabel. Defaults to -0.25.
  • set_ylabely (float, optional): y position of the ylabel. Defaults to 0.5.

Keyword Args:

  • kws: parameters provided to the upset.plot function.

Returns:

  • plt.Axes: plt.Axes object.

Notes:

sort_by:{‘cardinality’, ‘degree’} If ‘cardinality’, subset are listed from largest to smallest. If ‘degree’, they are listed in order of the number of categories intersected. sort_categories_by:{‘cardinality’, None} Whether to sort the categories by total cardinality, or leave them in the provided order. References: https://upsetplot.readthedocs.io/en/stable/api.html


function plot_enrichment

plot_enrichment(
    data: DataFrame,
    x: str,
    y: str,
    s: str,
    hue='Q',
    xlabel=None,
    ylabel='significance\n(-log10(Q))',
    size: int = None,
    color: str = None,
    annots_side: int = 5,
    annots_side_labels=None,
    coff_fdr: float = None,
    xlim: tuple = None,
    xlim_off: float = 0.2,
    ylim: tuple = None,
    ax: Axes = None,
    break_pt: int = 25,
    annot_coff_fdr: bool = False,
    kws_annot: dict = {'loc': 'right', 'offx3': 0.15},
    returns='ax',
    **kwargs
)  Axes

Plot enrichment stats.

Args:

 - <b>`data`</b> (pd.DataFrame):  input data. 
 - <b>`x`</b> (str):  x column. 
 - <b>`y`</b> (str):  y column. 
 - <b>`s`</b> (str):  size column. 
 - <b>`size`</b> (int, optional):  size of the points. Defaults to None. 
 - <b>`color`</b> (str, optional):  color of the points. Defaults to None. 
 - <b>`annots_side`</b> (int, optional):  how many labels to show on side. Defaults to 5. 
 - <b>`coff_fdr`</b> (float, optional):  FDR cutoff. Defaults to None. 
 - <b>`xlim`</b> (tuple, optional):  x-axis limits. Defaults to None. 
 - <b>`xlim_off`</b> (float, optional):  x-offset on limits. Defaults to 0.2. 
 - <b>`ylim`</b> (tuple, optional):  y-axis limits. Defaults to None. 
 - <b>`ax`</b> (plt.Axes, optional):  `plt.Axes` object. Defaults to None. 
 - <b>`break_pt`</b> (int, optional):  break point (' ') for the labels. Defaults to 25. 
 - <b>`annot_coff_fdr`</b> (bool, optional):  show FDR cutoff. Defaults to False. 
 - <b>`kws_annot`</b> (dict, optional):  parameters provided to the `annot_side` function. Defaults to dict( loc='right', annot_count_max=5, offx3=0.15, ). 

Keyword Args: - kwargs: parameters provided to the sns.scatterplot function.

Returns:

 - <b>`plt.Axes`</b>:  `plt.Axes` object. 

module roux.workflow.df

For management of tables.


function exclude_items

exclude_items(df1: DataFrame, metadata: dict)  DataFrame

Exclude items from the table with the workflow info.

Args:

  • df1 (pd.DataFrame): input table.
  • metadata (dict): metadata of the repository.

Returns:

  • pd.DataFrame: output.

module roux.workflow.function

For function management.


function get_quoted_path

get_quoted_path(s1: str)  str

Quoted paths.

Args:

  • s1 (str): path.

Returns:

  • str: quoted path.

function get_path

get_path(
    s: str,
    validate: bool,
    prefixes=['data/', 'metadata/', 'plot/'],
    test=False
)  str

Extract pathsfrom a line of code.

Args:

  • s (str): line of code.
  • validate (bool): validate the output.
  • prefixes (list, optional): allowed prefixes. Defaults to ['data/','metadata/','plot/'].
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: path.

TODOs: 1. Use wildcards i.e. *'s.


function remove_dirs_from_outputs

remove_dirs_from_outputs(outputs: list, test: bool = False)  list

Remove directories from the output paths.

Args:

  • outputs (list): output paths.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • list: paths.

function get_ios

get_ios(l: list, test=False)  tuple

Get input and output (IO) paths.

Args:

  • l (list): list of lines of code.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • tuple: paths of inputs and outputs.

function get_name

get_name(s: str, i: int, sep_step: str = '## step')  str

Get name of the function.

Args:

  • s (str): lines in markdown format.
  • sep_step<.py# step".
  • i (int): index of the step.

Returns:

  • str: name of the function.

function get_step

get_step(
    l: list,
    name: str,
    sep_step: str = '## step',
    sep_step_end: str = '## tests',
    test=False,
    tab='    '
)  dict

Get code for a step.

Args:

  • l (list): list of lines of code
  • name (str): name of the function.
  • test (bool, optional): test mode. Defaults to False.
  • tab (str, optional): tab format. Defaults to ' '.

Returns:

  • dict: step name to code map.

function to_task

to_task(
    notebookp,
    task=None,
    sep_step: str = '## step',
    sep_step_end: str = '## tests',
    notebook_suffix: str = '_v',
    force=False,
    validate=False,
    path_prefix=None,
    verbose=True,
    test=False
)  str

Get the lines of code for a task (script to be saved as an individual .py file).

Args:

  • notebookp (type): path of the notebook.
  • sep_step<.py# step".
  • sep_step_end<.py# tests".
  • notebook_suffix (str, optional): suffix of the notebook file to be considered as a "task".
  • force (bool, optional): overwrite output. Defaults to False.
  • validate (bool, optional): validate output. Defaults to False.
  • path_prefix (type, optional): prefix to the path. Defaults to None.
  • verbose (bool, optional): show verbose. Defaults to True.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: lines of the code.

module roux.workflow.io

For input/output of workflow.


function clear_variables

clear_variables(dtype=None, variables=None)

Clear dataframes from the workspace.


function clear_dataframes

clear_dataframes()

function get_lines

get_lines(p: str, keep_comments: bool = True)  list

Get lines of code from notebook.

Args:

  • p (str): path to notebook.
  • keep_comments (bool, optional): keep comments. Defaults to True.

Returns:

  • list: lines.

function to_py

to_py(
    notebookp: str,
    pyp: str = None,
    force: bool = False,
    **kws_get_lines
)  str

To python script (.py).

Args:

  • notebookp (str): path to the notebook path.
  • pyp (str, optional): path to the python file. Defaults to None.
  • force (bool, optional): overwrite output. Defaults to False.

Returns:

  • str: path of the output.

function import_from_file

import_from_file(pyp: str)

Import functions from python (.py) file.

Args:

  • pyp (str): python file (.py).

function to_parameters

to_parameters(f: object, test: bool = False)  dict

Get function to parameters map.

Args:

  • f (object): function.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • dict: output.

function read_nb_md

read_nb_md(p: str)  list

Read notebook's documentation in the markdown cells.

Args:

  • p (str): path of the notebook.

Returns:

  • list: lines of the strings.

function read_config

read_config(p: str, config_base=None, convert_dtype: bool = True)

Read configuration.

Parameters:

  • p (str): input path.

function read_metadata

read_metadata(
    p: str = './metadata.yaml',
    ind: str = './metadata/',
    max_paths: int = 30,
    **kws_read_config
)  dict

Read metadata.

Args:

  • p (str, optional): file containing metadata. Defaults to './metadata.yaml'.
  • ind (str, optional): directory containing specific setings and other data to be incorporated into metadata. Defaults to './metadata/'.

Returns:

  • dict: output.

TODOs: 1. Metadata files include colors.yaml, database.yaml, constants.yaml etc.


function to_info

to_info(p: str = '*_*_v*.ipynb', outp: str = 'README.md')  str

Save README.md file.

Args:

  • p (str, optional): path of the notebook files that would be converted to "tasks". Defaults to '__v*.ipynb'.
  • outp (str, optional): path of the output file. Defaults to 'README.md'.

Returns:

  • str: path of the output file.

function make_symlinks

make_symlinks(
    d1: dict,
    d2: dict,
    project_path: str,
    data: bool = True,
    notebook_suffix: str = '_v',
    test: bool = False
)  list

Make symbolic links.

Args:

  • d1 (dict): project name to repo name.
  • d2 (dict): task name to tuple containing from project name to project name.
  • project_path (str): path of the repository.
  • data (bool, optional): make links for the data. Defaults to True.
  • notebook_suffix (str, optional): suffix of the notebook file to be considered as a "task".
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • list: list of commands.

function to_workflow

to_workflow(df2: DataFrame, workflowp: str, tab: str = '    ')  str

Save workflow file.

Args:

  • df2 (pd.DataFrame): input table.
  • workflowp (str): path of the workflow file.
  • tab (str, optional): tab format. Defaults to ' '.

Returns:

  • str: path of the workflow file.

function create_workflow_report

create_workflow_report(workflowp: str, env: str)  int

Create report for the workflow run.

Parameters:

  • workflowp (str): path of the workflow file (snakemake).
  • env (str): name of the conda virtual environment where required the workflow dependency is available i.e. snakemake.

function to_diff_notebooks

to_diff_notebooks(
    notebook_paths,
    url_prefix='https://localhost:8888/nbdime/difftool?',
    remove_prefix='file://',
    verbose=True
)  list

"Diff" notebooks using nbdiff (https://nbdime.readthedocs.io/en/latest/)

Todos: 1. Deprecate if functionality added to nbdiff-web.

module roux.workflow.knit

For workflow set up.


function nb_to_py

nb_to_py(
    notebookp: str,
    test: bool = False,
    validate: bool = True,
    sep_step: str = '## step',
    notebook_suffix: str = '_v'
)

notebook to script.

Args:

  • notebookp (str): path to the notebook.
  • sep_step<.py# step".
  • notebook_suffix (str, optional): suffix of the notebook file to be considered as a "task".
  • test (bool, optional): test mode. Defaults to False.
  • validate (bool, optional): validate. Defaults to True.

TODOs: 1. Add check_outputs parameter to only filter out non-executable code (i.e. tests) if False else edit the code.


function sort_stepns

sort_stepns(l: list)  list

Sort steps (functions) of a task (script).

Args:

  • l (list): list of steps.

Returns:

  • list: sorted list of steps.

module roux.workflow

Global Variables

  • io
  • df

module roux.workflow.monitor

For workflow monitors.


function plot_workflow_log

plot_workflow_log(dplot: DataFrame)  Axes

Plot workflow log.

Args:

  • dplot (pd.DataFrame): input data (dparam).

Returns:

  • plt.Axes: output.

TODOs: 1. use the statistics tagged as ## stats.

module roux.workflow.task

For task management.


function run_notebook

run_notebook(
    parameters: dict,
    input_notebook_path: str,
    kernel: str,
    output_notebook_path: str = None,
    test=False,
    verbose=False,
    **kws_papermill
)

[UNDER DEVELOPMENT] Execute a single notebooks.


function run_notebooks

run_notebooks(
    input_notebook_path: str,
    inputs: list,
    output_path: str,
    kernel: str,
    fast: bool = False,
    test1: bool = False,
    force: bool = False,
    test: bool = False,
    verbose: bool = False,
    **kws_papermill
)

[UNDER DEVELOPMENT] Execute a list of notebooks.

TODOs: 1. Integrate with apply_on_paths for parallel processing etc. 2. Reporting by quarto?

module roux.workflow.version

For version control.


function git_commit

git_commit(repop: str, suffix_message: str = '')

Version control.

Args:

  • repop (str): path to the repository.
  • suffix_message (str, optional): add suffix to the version (commit) message. Defaults to ''.

module roux.workflow.workflow

For workflow management.


function get_scripts

get_scripts(
    ps: list,
    notebook_prefix: str = '\\d{2}',
    notebook_suffix: str = '_v\\d{2}',
    test: bool = False,
    fast: bool = True,
    cores: int = 6,
    force: bool = False,
    tab: str = '    ',
    **kws
)  DataFrame

Get scripts.

Args:

  • ps (list): paths.
  • notebook_prefix (str, optional): prefix of the notebook file to be considered as a "task".
  • notebook_suffix (str, optional): suffix of the notebook file to be considered as a "task".
  • test (bool, optional): test mode. Defaults to False.
  • fast (bool, optional): parallel processing. Defaults to True.
  • cores (int, optional): cores to use. Defaults to 6.
  • force (bool, optional): overwrite the outputs. Defaults to False.
  • tab (str, optional): tab in spaces. Defaults to ' '.

Returns:

  • pd.DataFrame: output table.

function to_scripts

to_scripts(
    packagep: str,
    notebooksdp: str,
    validate: bool = False,
    ps: list = None,
    notebook_prefix: str = '\\d{2}',
    notebook_suffix: str = '_v\\d{2}',
    scripts: bool = True,
    workflow: bool = True,
    sep_step: str = '## step',
    todos: bool = False,
    git: bool = True,
    clean: bool = False,
    test: bool = False,
    force: bool = True,
    tab: str = '    ',
    **kws
)

To scripts.

Args:

  • # packagen (str): package name.
  • packagep (str): path to the package.
  • notebooksdp (str, optional): path to the notebooks. Defaults to None.
  • validate (bool, optional): validate if functions are formatted correctly. Defaults to False.
  • ps (list, optional): paths. Defaults to None.
  • notebook_prefix (str, optional): prefix of the notebook file to be considered as a "task".
  • notebook_suffix (str, optional): suffix of the notebook file to be considered as a "task".
  • scripts (bool, optional): make scripts. Defaults to True.
  • workflow (bool, optional): make workflow file. Defaults to True.
  • sep_step<.py# step".
  • todos (bool, optional): show todos. Defaults to False.
  • git (bool, optional): save version. Defaults to True.
  • clean (bool, optional): clean temporary files. Defaults to False.
  • test (bool, optional): test mode. Defaults to False.
  • force (bool, optional): overwrite outputs. Defaults to True.
  • tab (str, optional): tab size. Defaults to ' '.

Keyword parameters:

  • kws: parameters provided to the get_script function, including sep_step and sep_step_end

TODOs:

  • 1. For version control, use https: //github.com/jupyterlab/jupyterlab-git.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roux-0.0.7.tar.gz (777.2 kB view hashes)

Uploaded Source

Built Distribution

roux-0.0.7-py3-none-any.whl (493.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page