Skip to main content

Convenience functions.

Project description

PyPIPython build Issues
Downloads GNU License

logo

roux

Convenience functions in Python.
Examples · Explore the API

image

Examples

⌗ Dataframes.
⌗⌗ Paired Dataframes.
💾 General Input/Output.
⬤⬤ Sets.
🔤 Strings encoding/decoding.
🗃 File paths Input/Output.
🏷 Classification.
✨ Clustering.
✨ Correlations.
✨ Differences.
📈 Data fitting.
📊 Data normalization.
⬤⬤ Comparison between sets.
📈🔖Annotating visualisations.
🔧 Subplot-level adjustments.
📈 Diagrams.
📈 Distribution plots.
📈 Wrapper around Series plotting functions.
📈📈Annotating figure.
📈💾 Visualizations Input/Output.
📈 Line plots.
📈 Scatter plots.
📈⬤⬤ Plots of sets.
📈🎨✨ Visualizations theming.
⚙️🗺️ Reading multiple configs.
⚙️⏩ Running multiple tasks.
⚙️⏩ Workflow using notebooks

Installation

pip install roux              # with basic dependencies  
pip install roux[all]         # with all the additional dependencies (recommended). 

With additional dependencies as required:

pip install roux[viz]         # for visualizations e.g. seaborn etc.
pip install roux[data]        # for data operations e.g. reading excel files etc.
pip install roux[stat]        # for statistics e.g. statsmodels etc.
pip install roux[fast]        # for faster processing e.g. parallelization etc.
pip install roux[workflow]    # for workflow operations e.g. omegaconf etc.
pip install roux[interactive] # for interactive operations in jupyter notebook e.g. watermark, icecream etc.

Command-line usage

ℹ️ Available command line tools and their usage.
roux --help

⭐ Remove *'s from a jupyter notebook'.
roux removestar path/to/notebook

🗺️ Read configuration.
roux read-config path/to/file

🗺️ Read metadata.
roux read-metadata path/to/file

📁 Find the latest and the oldest file in a list.
roux read-ps list_of_paths

💾 Backup a directory with a timestamp (ISO).
roux backup path/to/directory

How to cite?

  1. Using BibTeX:
@software{Dandage_roux,
  title   = {roux: Streamlined and Versatile Data Processing Toolkit},
  author  = {Dandage, Rohan},
  year    = {2024},
  url     = {https://zenodo.org/doi/10.5281/zenodo.2682670},
  version = {0.1.2},
  note    = {The URL is a DOI link to the permanent archive of the software.},
}
  1. DOI link: DOI, or

  2. Using citation information from CITATION.CFF file.

Future directions, for which contributions are welcome

  • Addition of visualization function as attributes to rd dataframes.
  • Refactoring of the workflow functions.

Similar projects

API

module roux.viz.compare

For comparative plots.


function plot_comparisons

plot_comparisons(
    plot_data,
    x,
    ax=None,
    output_dir_path=None,
    force=False,
    return_path=False
)

Parameters:

  • plot_data: output of .stat.compare.get_comparison

Notes:

sample type: different sample of the same data.

module roux.stat.cluster

For clustering data.


function check_clusters

check_clusters(df: DataFrame)

Check clusters.

Args:

  • df (DataFrame): dataframe.

function get_clusters

get_clusters(
    X: <built-in function array>,
    n_clusters: int,
    random_state=88,
    params={},
    test=False
)  dict

Get clusters.

Args:

  • X (np.array): vector
  • n_clusters (int): int
  • random_state (int, optional): random state. Defaults to 88.
  • params (dict, optional): parameters for the MiniBatchKMeans function. Defaults to {}.
  • test (bool, optional): test. Defaults to False.

Returns: dict:


function get_n_clusters_optimum

get_n_clusters_optimum(df5: DataFrame, test=False)  int

Get n clusters optimum.

Args:

  • df5 (DataFrame): input dataframe
  • test (bool, optional): test. Defaults to False.

Returns:

  • int: knee point.

function plot_silhouette

plot_silhouette(df: DataFrame, n_clusters_optimum=None, ax=None)

Plot silhouette

Args:

  • df (DataFrame): input dataframe.
  • n_clusters_optimum (int, optional): number of clusters. Defaults to None:int.
  • ax (axes, optional): axes object. Defaults to None:axes.

Returns:

  • ax (axes, optional): axes object. Defaults to None:axes.

function get_clusters_optimum

get_clusters_optimum(
    X: <built-in function array>,
    n_clusters=range(2, 11),
    params_clustering={},
    test=False
)  dict

Get optimum clusters.

Args:

  • X (np.array): samples to cluster in indexed format.
  • n_clusters (int, optional): description. Defaults to range(2,11).
  • params_clustering (dict, optional): parameters provided to get_clusters. Defaults to {}.
  • test (bool, optional): test. Defaults to False.

Returns:

  • dict: description

function get_gmm_params

get_gmm_params(g, x, n_clusters=2, test=False)

Intersection point of the two peak Gaussian mixture Models (GMMs).

Args:

  • out (str): coff only or params for all the parameters.

function get_gmm_intersection

get_gmm_intersection(x, two_pdfs, means, weights, test=False)

function cluster_1d

cluster_1d(
    ds: Series,
    n_clusters: int,
    clf_type='gmm',
    random_state=1,
    test=False,
    returns=['coff'],
    **kws_clf
)  dict

Cluster 1D data.

Args:

  • ds (Series): series.
  • n_clusters (int): number of clusters.
  • clf_type (str, optional): type of classification. Defaults to 'gmm'.
  • random_state (int, optional): random state. Defaults to 88.
  • test (bool, optional): test. Defaults to False.
  • returns (list, optional): return format. Defaults to ['df','coff','ax','model'].
  • ax (axes, optional): axes object. Defaults to None.

Raises:

  • ValueError: clf_type

Returns:

  • dict: description

function get_pos_umap

get_pos_umap(df1, spread=100, test=False, k='', **kws)  DataFrame

Get positions of the umap points.

Args:

  • df1 (DataFrame): input dataframe
  • spread (int, optional): spead extent. Defaults to 100.
  • test (bool, optional): test. Defaults to False.
  • k (str, optional): number of clusters. Defaults to ''.

Returns:

  • DataFrame: output dataframe.

module roux.workflow.version

For version control.


function git_commit

git_commit(repop: str, suffix_message: str = '', force=False)

Version control.

Args:

  • repop (str): path to the repository.
  • suffix_message (str, optional): add suffix to the version (commit) message. Defaults to ''.

module roux.workflow.log


function print_parameters

print_parameters(d: dict)

Print a directory with parameters as lines of code

Parameters:

  • d (dict): directory with parameters

function test_params

test_params(params, i=0)

module roux.workflow.io

For input/output of workflow.


function clear_variables

clear_variables(dtype=None, variables=None)

Clear dataframes from the workspace.


function clear_dataframes

clear_dataframes()

function to_py

to_py(
    notebookp: str,
    pyp: str = None,
    force: bool = False,
    **kws_get_lines
)  str

To python script (.py).

Args:

  • notebookp (str): path to the notebook path.
  • pyp (str, optional): path to the python file. Defaults to None.
  • force (bool, optional): overwrite output. Defaults to False.

Returns:

  • str: path of the output.

function to_nb_cells

to_nb_cells(notebook, outp, new_cells, validate_diff=None)

Replace notebook cells.


function import_from_file

import_from_file(pyp: str)

Import functions from python (.py) file.

Args:

  • pyp (str): python file (.py).

function infer_parameters

infer_parameters(input_value, default_value)

Infer the input values and post warning messages.

Parameters:

  • input_value: the primary value.
  • default_value: the default/alternative/inferred value.

Returns: Inferred value.


function to_parameters

to_parameters(f: object, test: bool = False)  dict

Get function to parameters map.

Args:

  • f (object): function.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • dict: output.

function read_config

read_config(
    p: str,
    config_base=None,
    inputs=None,
    append_to_key=None,
    convert_dtype: bool = True,
    verbose: bool = True
)

Read configuration.

Parameters:

  • p (str): input path.
  • config_base: base config with the inputs for the interpolations

function read_metadata

read_metadata(
    p: str,
    ind: str = None,
    max_paths: int = 30,
    config_path_key: str = 'config_path',
    config_paths: list = [],
    config_paths_auto=False,
    verbose: bool = False,
    **kws_read_config
)  dict

Read metadata.

Args:

  • p (str, optional): file containing metadata. Defaults to './metadata.yaml'.
  • ind (str, optional): directory containing specific setings and other data to be incorporated into metadata. Defaults to './metadata/'.

Returns:

  • dict: output.

function to_workflow

to_workflow(df2: DataFrame, workflowp: str, tab: str = '    ')  str

Save workflow file.

Args:

  • df2 (pd.DataFrame): input table.
  • workflowp (str): path of the workflow file.
  • tab (str, optional): tab format. Defaults to ' '.

Returns:

  • str: path of the workflow file.

function create_workflow_report

create_workflow_report(workflowp: str, env: str)  int

Create report for the workflow run.

Parameters:

  • workflowp (str): path of the workflow file (snakemake).
  • env (str): name of the conda virtual environment where required the workflow dependency is available i.e. snakemake.

function replacestar

replacestar(
    input_path,
    output_path=None,
    replace_from='from roux.global_imports import *',
    in_place: bool = False,
    attributes={'pandarallel': ['parallel_apply'], 'rd': ['.rd.', '.log.']},
    verbose: bool = False,
    test: bool = False,
    **kws_fix_code
)

Post-development, replace wildcard (global) import from roux i.e. 'from roux.global_imports import *' with individual imports with accompanying documentation.

Usage: For notebooks developed using roux.global_imports.

Parameters input_path (str): path to the .py or .ipynb file. output_path (str): path to the output. py_path (str): path to the intermediate .py file. in_place (bool): whether to carry out the modification in place. return_replacements (bool): return dict with strings to be replaced. attributes (dict): attribute names mapped to their keywords for searching. verbose (bool): verbose toggle. test (bool): test-mode if output file not provided and in-place modification not allowed.

Returns:

  • output_path (str): path to the modified notebook.

Examples: roux replacestar -i notebook.ipynb roux replacestar -i notebooks/*.ipynb


function replacestar_ruff

replacestar_ruff(
    p: str,
    outp: str,
    replace: str = 'from roux.global_imports import *',
    verbose=True
)  str

function post_code

post_code(p: str, lint: bool, format: bool, verbose: bool = True)

function to_clean_nb

to_clean_nb(
    p,
    outp: str = None,
    in_place: bool = False,
    temp_outp: str = None,
    clear_outputs=False,
    drop_code_lines_containing=['.*%run .*', '^#\\s*.*=.*', '^#\\s*".*', "^#\\s*'.*", '^#\\s*f".*', "^#\\s*f'.*", '^#\\s*df.*', '^#\\s*.*kws_.*', '^\\s*#\\s*$', '^\\s*#\\s*break\\s*$', '\\[X', '\\[old ', '#old', '# old', '\\[not used', '# not used', '#tmp', '# tmp', '#temp', '# temp', 'check ', 'checking', '# check', '\\[SKIP', 'DEBUG '],
    drop_headers_containing=['check', '[check', 'old', '[old', 'tmp', '[tmp'],
    lint=False,
    format=False,
    **kws_fix_code
)  str

Wraper around the notebook post-processing functions.

Usage: For notebooks developed using roux.global_imports.

On command line:

single input roux to-clean-nb in.ipynb out.ipynb -c -l -f

multiple inputs roux to-clean-nb "in*.ipynb" -i -c -l -f

Parameters:

  • temp_outp (str): path to the intermediate output.

module roux.viz.image

For visualization of images.


function plot_image

plot_image(
    imp: str,
    ax: Axes = None,
    force=False,
    margin=0,
    axes=False,
    test=False,
    **kwarg
)  Axes

Plot image e.g. schematic.

Args:

  • imp (str): path of the image.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • force (bool, optional): overwrite output. Defaults to False.
  • margin (int, optional): margins. Defaults to 0.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

:param kwarg: cairosvg: {'dpi':500,'scale':2}; imagemagick: {'trim':False,'alpha':False}


function plot_images

plot_images(image_paths, ncols=3, title_func=None, size=3)

module roux.lib.sys

For processing file paths for example.


function basenamenoext

basenamenoext(p)

Basename without the extension.

Args:

  • p (str): path.

Returns:

  • s (str): output.

function remove_exts

remove_exts(p: str)

Filename without the extension.

Args:

  • p (str): path.

Returns:

  • s (str): output.

function read_ps

read_ps(ps, test: bool = True, verbose: bool = True)  list

Read a list of paths.

Parameters:

  • ps (list|str): list of paths or a string with wildcard/s.
  • test (bool): testing.
  • verbose (bool): verbose.

Returns:

  • ps (list): list of paths.

function to_path

to_path(s, replacewith='_', verbose=False, coff_len_escape_replacement=100)

Normalise a string to be used as a path of file.

Parameters:

  • s (string): input string.
  • replacewith (str): replace the whitespaces or incompatible characters with.

Returns:

  • s (string): output string.

function to_path

to_path(s, replacewith='_', verbose=False, coff_len_escape_replacement=100)

Normalise a string to be used as a path of file.

Parameters:

  • s (string): input string.
  • replacewith (str): replace the whitespaces or incompatible characters with.

Returns:

  • s (string): output string.

function makedirs

makedirs(p: str, exist_ok=True, **kws)

Make directories recursively.

Args:

  • p (str): path.
  • exist_ok (bool, optional): no error if the directory exists. Defaults to True.

Returns:

  • p_ (str): the path of the directory.

function to_output_path

to_output_path(ps, outd=None, outp=None, suffix='')

Infer a single output path for a list of paths.

Parameters:

  • ps (list): list of paths.
  • outd (str): path of the output directory.
  • outp (str): path of the output file.
  • suffix (str): suffix of the filename.

Returns:

  • outp (str): path of the output file.

function to_output_paths

to_output_paths(
    input_paths: list = None,
    inputs: list = None,
    output_path_base: str = None,
    encode_short: bool = True,
    replaces_output_path=None,
    key_output_path: str = None,
    force: bool = False,
    verbose: bool = False
)  dict

Infer a output path for each of the paths or inputs.

Parameters:

  • input_paths (list) : list of input paths. Defaults to None.
  • inputs (list) : list of inputs e.g. dictionaries. Defaults to None.
  • output_path_base (str) : output path with a placeholder '{KEY}' to be replaced. Defaults to None.
  • encode_short: (bool) : short encoded string, else long encoded string (reversible) is used. Defaults to True.
  • replaces_output_path : list, dictionary or function to replace the input paths. Defaults to None.
  • key_output_path (str) : key to be used to incorporate output_path variable among the inputs. Defaults to None.
  • force (bool): overwrite the outputs. Defaults to False.
  • verbose (bool) : show verbose. Defaults to False.

Returns: dictionary with the output path mapped to input paths or inputs.

TODOs: 1. Placeholders other than {KEY}.


function get_encoding

get_encoding(p)

Get encoding of a file.

Parameters:

  • p (str): file path

Returns:

  • s (string): encoding.

function get_all_subpaths

get_all_subpaths(d='.', include_directories=False)

Get all the subpaths.

Args:

  • d (str, optional): description. Defaults to '.'.
  • include_directories (bool, optional): to include the directories. Defaults to False.

Returns:

  • paths (list): sub-paths.

function get_env

get_env(env_name: str, return_path: bool = False)

Get the virtual environment as a dictionary.

Args:

  • env_name (str): name of the environment.

Returns:

  • d (dict): parameters of the virtual environment.

function run_com

run_com(com: str, env=None, test: bool = False, **kws)

Run a bash command.

Args:

  • com (str): command.
  • env (str): environment name.
  • test (bool, optional): testing. Defaults to False.

Returns:

  • output: output of the subprocess.call function.

TODOs: 1. logp 2. error ignoring


function run_com

run_com(com: str, env=None, test: bool = False, **kws)

Run a bash command.

Args:

  • com (str): command.
  • env (str): environment name.
  • test (bool, optional): testing. Defaults to False.

Returns:

  • output: output of the subprocess.call function.

TODOs: 1. logp 2. error ignoring


function runbash_tmp

runbash_tmp(
    s1: str,
    env: str,
    df1=None,
    inp='INPUT',
    input_type='df',
    output_type='path',
    tmp_infn='in.txt',
    tmp_outfn='out.txt',
    outp=None,
    force=False,
    test=False,
    **kws
)

Run a bash command in /tmp directory.

Args:

  • s1 (str): command.
  • env (str): environment name.
  • df1 (DataFrame, optional): input dataframe. Defaults to None.
  • inp (str, optional): input path. Defaults to 'INPUT'.
  • input_type (str, optional): input type. Defaults to 'df'.
  • output_type (str, optional): output type. Defaults to 'path'.
  • tmp_infn (str, optional): temporary input file. Defaults to 'in.txt'.
  • tmp_outfn (str, optional): temporary output file.. Defaults to 'out.txt'.
  • outp (type, optional): output path. Defaults to None.
  • force (bool, optional): force. Defaults to False.
  • test (bool, optional): test. Defaults to False.

Returns:

  • output: output of the subprocess.call function.

function create_symlink

create_symlink(p: str, outp: str, test=False, force=False)

Create symbolic links.

Args:

  • p (str): input path.
  • outp (str): output path.
  • test (bool, optional): test. Defaults to False.

Returns:

  • outp (str): output path.

TODOs:

  • Use pathlib``: Path(p).symlink_to(Path(outp))

function input_binary

input_binary(q: str)

Get input in binary format.

Args:

  • q (str): question.

Returns:

  • b (bool): response.

function is_interactive

is_interactive()

Check if the UI is interactive e.g. jupyter or command line.


function is_interactive_notebook

is_interactive_notebook()

Check if the UI is interactive e.g. jupyter or command line.

Notes:

Reference:


function get_excecution_location

get_excecution_location(depth=1)

Get the location of the function being executed.

Args:

  • depth (int, optional): Depth of the location. Defaults to 1.

Returns:

  • tuple (tuple): filename and line number.

function get_datetime

get_datetime(outstr: bool = True, fmt='%G%m%dT%H%M%S')

Get the date and time.

Args:

  • outstr (bool, optional): string output. Defaults to True.
  • fmt (str): format of the string.

Returns:

  • s : date and time.

function p2time

p2time(filename: str, time_type='m')

Get the creation/modification dates of files.

Args:

  • filename (str): filename.
  • time_type (str, optional): description. Defaults to 'm'.

Returns:

  • time (str): time.

function ps2time

ps2time(ps: list, **kws_p2time)

Get the times for a list of files.

Args:

  • ps (list): list of paths.

Returns:

  • ds (Series): paths mapped to corresponding times.

function get_logger

get_logger(program='program', argv=None, level=None, dp=None)

Get the logging object.

Args:

  • program (str, optional): name of the program. Defaults to 'program'.
  • argv (type, optional): arguments. Defaults to None.
  • level (type, optional): level of logging. Defaults to None.
  • dp (type, optional): description. Defaults to None.

function tree

tree(folder_path: str, log=True)

function grep

grep(
    p: str,
    checks: list,
    exclude: list = [],
    exclude_str: list = [],
    verbose: bool = True
)  list

To get the output of grep as a list of strings.

Parameters:

  • p (str): input path

module roux.stat.transform

For transformations.


function plog

plog(x, p: float, base: int)

Psudo-log.

Args:

  • x (float|np.array): input.
  • p (float): pseudo-count.
  • base (int): base of the log.

Returns: output.


function anti_plog

anti_plog(x, p: float, base: int)

Anti-psudo-log.

Args:

  • x (float|np.array): input.
  • p (float): pseudo-count.
  • base (int): base of the log.

Returns: output.


function log_pval

log_pval(
    x,
    errors: str = 'raise',
    replace_zero_with: float = None,
    p_min: float = None
)

Transform p-values to Log10.

Paramters: x: input. errors (str): Defaults to 'raise' else replace (in case of visualization only). p_min (float): Replace zeros with this value. Note: to be used for visualization only.

Returns: output.


function get_q

get_q(ds1: Series, col: str = None, verb: bool = True, test_coff: float = 0.1)

To FDR corrected P-value.


function glog

glog(x: float, l=2)

Generalised logarithm.

Args:

  • x (float): input.
  • l (int, optional): psudo-count. Defaults to 2.

Returns:

  • float: output.

function rescale

rescale(
    a: <built-in function array>,
    range1: tuple = None,
    range2: tuple = [0, 1]
)  <built-in function array>

Rescale within a new range.

Args:

  • a (np.array): input vector.
  • range1 (tuple, optional): existing range. Defaults to None.
  • range2 (tuple, optional): new range. Defaults to [0,1].

Returns:

  • np.array: output.

function rescale_divergent

rescale_divergent(df1: DataFrame, col: str, col_sign: str = None)  DataFrame

Rescale divergently i.e. two-sided.

Args:

  • df1 (pd.DataFrame): description
  • col (str): column.

Returns:

  • pd.DataFrame: column.

Notes:

Under development.

module roux.lib.ds

For processing pandas Series.


function get_near_quantile

get_near_quantile(x: Series, q: float)

Retrieve the nearest value to a quantile.

module roux.viz.dist

For distribution plots.


function hist_annot

hist_annot(
    dplot: DataFrame,
    colx: str,
    colssubsets: list = [],
    bins: int = 100,
    subset_unclassified: bool = True,
    cmap: str = 'hsv',
    ymin=None,
    ymax=None,
    ylimoff: float = 1,
    ywithinoff: float = 1.2,
    annotaslegend: bool = True,
    annotn: bool = True,
    params_scatter: dict = {'zorder': 2, 'alpha': 0.1, 'marker': '|'},
    xlim: tuple = None,
    ax: Axes = None,
    **kws
)  Axes

Annoted histogram.

Args:

  • dplot (pd.DataFrame): input dataframe.
  • colx (str): x column.
  • colssubsets (list, optional): columns indicating subsets. Defaults to [].
  • bins (int, optional): bins. Defaults to 100.
  • subset_unclassified (bool, optional): call non-annotated subset as 'unclassified'. Defaults to True.
  • cmap (str, optional): colormap. Defaults to 'Reds_r'.
  • ylimoff (float, optional): y-offset for y-axis limit . Defaults to 1.2.
  • ywithinoff (float, optional): y-offset for the distance within labels. Defaults to 1.2.
  • annotaslegend (bool, optional): convert labels to legends. Defaults to True.
  • annotn (bool, optional): annotate sample sizes. Defaults to True.
  • params_scatter (type, optional): parameters of the scatter plot. Defaults to {'zorder':2,'alpha':0.1,'marker':'|'}.
  • xlim (tuple, optional): x-axis limits. Defaults to None.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the hist function.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: For scatter, use annot_side with loc='top'.


function plot_gmm

plot_gmm(
    x: Series,
    coff: float = None,
    mix_pdf: object = None,
    two_pdfs: tuple = None,
    weights: tuple = None,
    n_clusters: int = 2,
    bins: int = 20,
    show_cutoff: bool = True,
    show_cutoff_line: bool = True,
    colors: list = ['gray', 'gray', 'lightgray'],
    out_coff: bool = False,
    hist: bool = True,
    test: bool = False,
    ax: Axes = None,
    kws_axvline={'color': 'k'},
    **kws
)  Axes

Plot Gaussian mixture Models (GMMs).

Args:

  • x (pd.Series): input vector.
  • coff (float, optional): intersection between two fitted distributions. Defaults to None.
  • mix_pdf (object, optional): Probability density function of the mixed distribution. Defaults to None.
  • two_pdfs (tuple, optional): Probability density functions of the separate distributions. Defaults to None.
  • weights (tuple, optional): weights of the individual distributions. Defaults to None.
  • n_clusters (int, optional): number of distributions. Defaults to 2.
  • bins (int, optional): bins. Defaults to 50.
  • colors (list, optional): colors of the invividual distributions and of the mixed one. Defaults to ['gray','gray','lightgray']. 'gray'
  • out_coff (bool,False): return the cutoff. Defaults to False.
  • hist (bool, optional): show histogram. Defaults to True.
  • test (bool, optional): test mode. Defaults to False.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the hist function.
  • kws_axvline: parameters provided to the axvline function.

Returns:

  • plt.Axes: plt.Axes object.

function plot_normal

plot_normal(x: Series, ax: Axes = None)  Axes

Plot normal distribution.

Args:

  • x (pd.Series): input vector.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function get_jitter_positions

get_jitter_positions(ax, df1, order, column_category, column_position)

function plot_dists

plot_dists(
    df1: DataFrame,
    x: str,
    y: str,
    colindex: str,
    hue: str = None,
    order: list = None,
    hue_order: list = None,
    kind: str = 'box',
    show_p: bool = True,
    show_n: bool = True,
    show_n_prefix: str = '',
    show_n_ha=None,
    show_n_ticklabels: bool = True,
    show_outlines: bool = False,
    kws_outlines: dict = {},
    alternative: str = 'two-sided',
    offx_n: float = 0,
    axis_cont_lim: tuple = None,
    axis_cont_scale: str = 'linear',
    offs_pval: dict = None,
    fmt_pval: str = '<',
    alpha: float = 0.5,
    ax: Axes = None,
    test: bool = False,
    kws_stats: dict = {},
    **kws
)  Axes

Plot distributions.

Args:

  • df1 (pd.DataFrame): input data.
  • x (str): x column.
  • y (str): y column.
  • colindex (str): index column.
  • hue (str, optional): column with values to be encoded as hues. Defaults to None.
  • order (list, optional): order of categorical values. Defaults to None.
  • hue_order (list, optional): order of values to be encoded as hues. Defaults to None.
  • kind (str, optional): kind of distribution. Defaults to 'box'.
  • show_p (bool, optional): show p-values. Defaults to True.
  • show_n (bool, optional): show sample sizes. Defaults to True.
  • show_n_prefix (str, optional): show prefix of sample size label i.e. n=. Defaults to ''.
  • offx_n (float, optional): x-offset for the sample size label. Defaults to 0.
  • axis_cont_lim (tuple, optional): x-axis limits. Defaults to None.
  • offs_pval (float, optional): x and y offsets for the p-value labels.
  • # saturate_color_alpha (float, optional): saturation of the color. Defaults to 1.5.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.
  • kws_stats (dict, optional): parameters provided to the stat function. Defaults to {}.

Keyword Args:

  • kws: parameters provided to the seaborn function.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: 1. Sort categories. 2. Change alpha of the boxplot rather than changing saturation of the swarmplot.


function pointplot_groupbyedgecolor

pointplot_groupbyedgecolor(data: DataFrame, ax: Axes = None, **kws)  Axes

Plot seaborn's pointplot grouped by edgecolor of points.

Args:

  • data (pd.DataFrame): input data.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the seaborn's pointplot function.

Returns:

  • plt.Axes: plt.Axes object.

module roux.viz.theme

Theming.


function set_theme

set_theme(
    font: str = 'Myriad Pro',
    fontsize: int = 12,
    pad: int = 2,
    palette: list = ['#50AADC', '#D3DDDC', '#F1D929', '#f55f5f', '#046C9A', '#00A08A', '#F2AD00', '#F98400', '#5BBCD6', '#ECCBAE', '#D69C4E', '#ABDDDE', '#000000']
)

Set the theme.

Parameters:

  • font (str): font name.
  • fontsize (int): font size.
  • pad (int): padding.

TODOs: Addition of palette options.

module roux.workflow.workflow

For workflow management.


function get_scripts

get_scripts(
    ps: list,
    notebook_prefix: str = '\\d{2}',
    notebook_suffix: str = '_v\\d{2}',
    test: bool = False,
    fast: bool = True,
    cores: int = 6,
    force: bool = False,
    tab: str = '    ',
    **kws
)  DataFrame

Get scripts.

Args:

  • ps (list): paths.
  • notebook_prefix (str, optional): prefix of the notebook file to be considered as a "task".
  • notebook_suffix (str, optional): suffix of the notebook file to be considered as a "task".
  • test (bool, optional): test mode. Defaults to False.
  • fast (bool, optional): parallel processing. Defaults to True.
  • cores (int, optional): cores to use. Defaults to 6.
  • force (bool, optional): overwrite the outputs. Defaults to False.
  • tab (str, optional): tab in spaces. Defaults to ' '.

Returns:

  • pd.DataFrame: output table.

function to_scripts

to_scripts(
    packagep: str,
    notebooksdp: str,
    validate: bool = False,
    ps: list = None,
    notebook_prefix: str = '\\d{2}',
    notebook_suffix: str = '_v\\d{2}',
    scripts: bool = True,
    workflow: bool = True,
    sep_step: str = '## step',
    todos: bool = False,
    git: bool = True,
    clean: bool = False,
    test: bool = False,
    force: bool = True,
    tab: str = '    ',
    **kws
)

To scripts.

Args:

  • # packagen (str): package name.
  • packagep (str): path to the package.
  • notebooksdp (str, optional): path to the notebooks. Defaults to None.
  • validate (bool, optional): validate if functions are formatted correctly. Defaults to False.
  • ps (list, optional): paths. Defaults to None.
  • notebook_prefix (str, optional): prefix of the notebook file to be considered as a "task".
  • notebook_suffix (str, optional): suffix of the notebook file to be considered as a "task".
  • scripts (bool, optional): make scripts. Defaults to True.
  • workflow (bool, optional): make workflow file. Defaults to True.
  • sep_step (str, optional): separator marking the start of a step. Defaults to "## step".
  • todos (bool, optional): show todos. Defaults to False.
  • git (bool, optional): save version. Defaults to True.
  • clean (bool, optional): clean temporary files. Defaults to False.
  • test (bool, optional): test mode. Defaults to False.
  • force (bool, optional): overwrite outputs. Defaults to True.
  • tab (str, optional): tab size. Defaults to ' '.

Keyword parameters:

  • kws: parameters provided to the get_script function, including sep_step and sep_step_end

TODOs:

  • 1. For version control, use https: //github.com/jupyterlab/jupyterlab-git.

module roux.stat

Global Variables

  • binary
  • io

module roux.lib.io

For input/output of data files.


function read_zip

read_zip(p: str, file_open: str = None, fun_read=None, test: bool = False)

Read the contents of a zip file.

Parameters:

  • p (str): path of the file.
  • file_open (str): path of file within the zip file to open.
  • fun_read (object): function to read the file.

Examples:

  1. Setting fun_read parameter for reading tab-separated table from a zip file.

from io import StringIO ... fun_read=lambda x: pd.read_csv(io.StringIO(x.decode('utf-8')),sep=' ',header=None),

or

from io import BytesIO ... fun_read=lambda x: pd.read_table(BytesIO(x)),


function to_zip_dir

to_zip_dir(source, destination=None, fmt='zip')

Zip a folder. Ref: https://stackoverflow.com/a/50381250/3521099


function to_zip

to_zip(
    p: str,
    outp: str = None,
    func_rename=None,
    fmt: str = 'zip',
    test: bool = False
)

Compress a file/directory.

Parameters:

  • p (str): path to the file/directory.
  • outp (str): path to the output compressed file.
  • fmt (str): format of the compressed file.

Returns:

  • outp (str): path of the compressed file.

function to_dir

to_dir(
    paths: dict,
    output_dir_path: str,
    rename_basename=None,
    force=False,
    test=False
)

function get_version

get_version(suffix: str = '')  str

Get the time-based version string.

Parameters:

  • suffix (string): suffix.

Returns:

  • version (string): version.

function to_version

to_version(
    p: str,
    outd: str = None,
    test: bool = False,
    label: str = '',
    **kws: dict
)  str

Rename a file/directory to a version.

Parameters:

  • p (str): path.
  • outd (str): output directory.

Keyword parameters:

  • kws (dict): provided to get_version.

Returns:

  • version (string): version.

TODOs: 1. Use to_dir.


function backup

backup(
    p: str,
    outd: str = None,
    versioned: bool = False,
    suffix: str = '',
    zipped: bool = False,
    move_only: bool = False,
    test: bool = True,
    verbose: bool = False,
    no_test: bool = False
)

Backup a directory

Steps: 0. create version dir in outd 1. move ps to version (time) dir with common parents till the level of the version dir 2. zip or not

Parameters:

  • p (str): input path.
  • outd (str): output directory path.
  • versioned (bool): custom version for the backup (False).
  • suffix (str): custom suffix for the backup ('').
  • zipped (bool): whether to zip the backup (False).
  • test (bool): testing (True).
  • no_test (bool): no testing. Usage in command line (False).

TODOs: 1. Use to_dir. 2. Option to remove dirs find and move/zip "find -regex ./_." "find -regex ./test."


function read_url

read_url(url)

Read text from an URL.

Parameters:

  • url (str): URL link.

Returns:

  • s (string): text content of the URL.

function download

download(
    url: str,
    path: str = None,
    outd: str = None,
    force: bool = False,
    verbose: bool = True
)  str

Download a file.

Parameters:

  • url (str): URL.
  • path (str): custom output path (None)
  • outd (str): output directory ('data/database').
  • force (bool): overwrite output (False).
  • verbose (bool): verbose (True).

Returns:

  • path (str): output path (None)

function read_text

read_text(p)

Read a file. To be called by other functions

Args:

  • p (str): path.

Returns:

  • s (str): contents.

function to_list

to_list(l1, p)

Save list.

Parameters:

  • l1 (list): input list.
  • p (str): path.

Returns:

  • p (str): path.

function read_list

read_list(p)

Read the lines in the file.

Args:

  • p (str): path.

Returns:

  • l (list): list.

function read_list

read_list(p)

Read the lines in the file.

Args:

  • p (str): path.

Returns:

  • l (list): list.

function is_dict

is_dict(p)

function read_dict

read_dict(p, fmt: str = '', apply_on_keys=None, **kws)  dict

Read dictionary file.

Parameters:

  • p (str): path.
  • fmt (str): format of the file.

Keyword Arguments:

  • kws (d): parameters provided to reader function.

Returns:

  • d (dict): output dictionary.

function to_dict

to_dict(d, p, **kws)

Save dictionary file.

Parameters:

  • d (dict): input dictionary.
  • p (str): path.

Keyword Arguments:

  • kws (d): parameters provided to export function.

Returns:

  • p (str): path.

function post_read_table

post_read_table(
    df1: DataFrame,
    clean: bool,
    tables: list,
    verbose: bool = True,
    **kws_clean: dict
)

Post-reading a table.

Parameters:

  • df1 (DataFrame): input dataframe.
  • clean (bool): whether to apply clean function. tables ()
  • verbose (bool): verbose.

Keyword parameters:

  • kws_clean (dict): paramters provided to the clean function.

Returns:

  • df (DataFrame): output dataframe.

function read_table

read_table(
    p: str,
    ext: str = None,
    clean: bool = True,
    filterby_time=None,
    params: dict = {},
    kws_clean: dict = {},
    kws_cloud: dict = {},
    check_paths: bool = True,
    use_paths: bool = False,
    tables: int = 1,
    test: bool = False,
    verbose: bool = True,
    engine: str = 'pyarrow',
    **kws_read_tables: dict
)

Table/s reader.

Parameters:

 - <b>`p`</b> (str):  path of the file. It could be an input for `read_ps`, which would include strings with wildcards, list etc. 
 - <b>`ext`</b> (str):  extension of the file (default: None meaning infered from the path). 
 - <b>`clean=(default`</b>: True). filterby_time=None). 
 - <b>`check_paths`</b> (bool):  read files in the path column (default:True). 
 - <b>`use_paths`</b> (bool):  forced read files in the path column (default:False). 
 - <b>`test`</b> (bool):  testing (default:False). 
 - <b>`params`</b>:  parameters provided to the 'pd.read_csv' (default:{}). For example 
 - <b>`params['columns']`</b>:  columns to read. 
 - <b>`kws_clean`</b>:  parameters provided to 'rd.clean' (default:{}). 
 - <b>`kws_cloud`</b>:  parameters for reading files from google-drive (default:{}). 
 - <b>`tables`</b>:  how many tables to be read (default:1). 
 - <b>`verbose`</b>:  verbose (default:True). 

Keyword parameters: - kws_read_tables (dict): parameters provided to read_tables function. For example: - to_col={colindex: replaces_index}

Returns:

 - <b>`df`</b> (DataFrame):  output dataframe. 

Examples:

  1. For reading specific columns only set params=dict(columns=list).

  2. For reading many files, convert paths to a column with corresponding values:

to_col={colindex: replaces_index}

  1. Reading a vcf file. p='*.vcf|vcf.gz' read_table(p, params_read_csv=dict( #compression='gzip', sep=' ',comment='#',header=None, names=replace_many(get_header(path,comment='#',lineno=-1),['#',' '],'').split(' ')) )

function get_logp

get_logp(ps: list)  str

Infer the path of the log file.

Parameters:

  • ps (list): list of paths.

Returns:

  • p (str): path of the output file.

function apply_on_paths

apply_on_paths(
    ps: list,
    func,
    replaces_outp: str = None,
    to_col: dict = None,
    replaces_index=None,
    drop_index: bool = True,
    colindex: str = 'path',
    filter_rows: dict = None,
    fast: bool = False,
    progress_bar: bool = True,
    params: dict = {},
    dbug: bool = False,
    test1: bool = False,
    verbose: bool = True,
    kws_read_table: dict = {},
    **kws: dict
)

Apply a function on list of files.

Parameters:

  • ps (str|list): paths or string to infer paths using read_ps.
  • to_col (dict): convert the paths to a column e.g. {colindex: replaces_index}
  • func (function): function to be applied on each of the paths.
  • replaces_outp (dict|function): infer the output path (outp) by replacing substrings in the input paths (p).
  • filter_rows (dict): filter the rows based on dict, using rd.filter_rows.
  • fast (bool): parallel processing (default:False).
  • progress_bar (bool): show progress bar(default:True).
  • params (dict): parameters provided to the pd.read_csv function.
  • dbug (bool): debug mode on (default:False).
  • test1 (bool): test on one path (default:False).
  • kws_read_table (dict): parameters provided to the read_table function (default:{}).
  • replaces_index (object|dict|list|str): for example, 'basenamenoext' if path to basename.
  • drop_index (bool): whether to drop the index column e.g. path (default: True).
  • colindex (str): the name of the column containing the paths (default: 'path')

Keyword parameters:

  • kws (dict): parameters provided to the function.

Example:

  1. Function: def apply_(p,outd='data/data_analysed',force=False): outp=f"{outd}/{basenamenoext(p)}.pqt' if exists(outp) and not force: return df01=read_table(p) apply_on_paths( ps=glob("data/data_analysed/*"), func=apply_, outd="data/data_analysed/", force=True, fast=False, read_path=True, )

TODOs: Move out of io.


function read_tables

read_tables(
    ps: list,
    fast: bool = False,
    filterby_time=None,
    to_dict: bool = False,
    params: dict = {},
    tables: int = None,
    **kws_apply_on_paths: dict
)

Read multiple tables.

Parameters:

  • ps (list): list of paths.
  • fast (bool): parallel processing (default:False)
  • filterby_time (str): filter by time (default:None)
  • drop_index (bool): drop index (default:True)
  • to_dict (bool): output dictionary (default:False)
  • params (dict): parameters provided to the pd.read_csv function (default:{})
  • tables: number of tables (default:None).

Keyword parameters:

  • kws_apply_on_paths (dict): parameters provided to apply_on_paths.

Returns:

  • df (DataFrame): output dataframe.

TODOs: Parameter to report the creation dates of the newest and the oldest files.


function to_table

to_table(
    df: DataFrame,
    p: str,
    colgroupby: str = None,
    test: bool = False,
    **kws
)

Save table.

Parameters:

  • df (DataFrame): the input dataframe.
  • p (str): output path.
  • colgroupby (str|list): columns to groupby with to save the subsets of the data as separate files.
  • test (bool): testing on (default:False).

Keyword parameters:

  • kws (dict): parameters provided to the to_manytables function.

Returns:

  • p (str): path of the output.

function to_manytables

to_manytables(
    df: DataFrame,
    p: str,
    colgroupby: str,
    fmt: str = '',
    ignore: bool = False,
    kws_get_chunks={},
    **kws_to_table
)

Save many table.

Parameters:

  • df (DataFrame): the input dataframe.
  • p (str): output path.
  • colgroupby (str|list): columns to groupby with to save the subsets of the data as separate files.
  • fmt (str): if '=' column names in the folder name e.g. col1=True.
  • ignore (bool): ignore the warnings (default:False).

Keyword parameters:

  • kws_get_chunks (dict): parameters provided to the get_chunks function.

Returns:

  • p (str): path of the output.

TODOs:

  • 1. Change in default parameter: fmt='='.

function to_table_pqt

to_table_pqt(
    df: DataFrame,
    p: str,
    engine: str = 'pyarrow',
    compression: str = 'gzip',
    **kws_pqt: dict
)  str

Save a parquet file.

Parameters:

  • df (pd.DataFrame): table.
  • p (str): path.

Keyword parameters: Parameters provided to pd.DataFrame.to_parquet.

Returns:


function tsv2pqt

tsv2pqt(p: str)  str

Convert tab-separated file to Apache parquet.

Parameters:

  • p (str): path of the input.

Returns:

  • p (str): path of the output.

function pqt2tsv

pqt2tsv(p: str)  str

Convert Apache parquet file to tab-separated.

Parameters:

  • p (str): path of the input.

Returns:

  • p (str): path of the output.

function read_excel

read_excel(
    p: str,
    sheet_name: str = None,
    kws_cloud: dict = {},
    test: bool = False,
    **kws
)

Read excel file

Parameters:

  • p (str): path of the file.
  • sheet_name (str|None): read 1st sheet if None (default:None)
  • kws_cloud (dict): parameters provided to read the file from the google drive (default:{})
  • test (bool): if False and sheet_name not provided, return all sheets as a dictionary, else if True, print list of sheets.

Keyword parameters:

  • kws: parameters provided to the excel reader.

function to_excel_commented

to_excel_commented(p: str, comments: dict, outp: str = None, author: str = None)

Add comments to the columns of excel file and save.

Args:

  • p (str): input path of excel file.
  • comments (dict): map between column names and comment e.g. description of the column.
  • outp (str): output path of excel file. Defaults to None.
  • author (str): author of the comments. Defaults to 'Author'.

TODOs: 1. Increase the limit on comments can be added to number of columns. Currently it is 26 i.e. upto Z1.


function to_excel

to_excel(
    sheetname2df: dict,
    outp: str,
    comments: dict = None,
    save_input: bool = False,
    author: str = None,
    append: bool = False,
    adjust_column_width: bool = True,
    **kws
)

Save excel file.

Parameters:

  • sheetname2df (dict): dictionary mapping the sheetname to the dataframe.
  • outp (str): output path.
  • append (bool): append the dataframes (default:False).
  • comments (dict): map between column names and comment e.g. description of the column.
  • save_input (bool): additionally save the input tables in text format.

Keyword parameters:

  • kws: parameters provided to the excel writer.

function check_chunks

check_chunks(outd, col, plot=True)

Create chunks of the tables.

Parameters:

  • outd (str): output directory.
  • col (str): the column with values that are used for getting the chunks.
  • plot (bool): plot the chunk sizes (default:True).

Returns:

  • df3 (DataFrame): output dataframe.

module roux.lib

Global Variables

  • set
  • str
  • sys
  • df
  • dfs
  • text
  • io

function to_class

to_class(cls)

Get the decorator to attach functions.

Parameters:

  • cls (class): class object.

Returns:

  • decorator (decorator): decorator object.

References:

  • https: //gist.github.com/mgarod/09aa9c3d8a52a980bd4d738e52e5b97a

function decorator

decorator(func)

function decorator

decorator(func)

class rd

roux-dataframe (.rd) extension.

method __init__

__init__(pandas_obj)

class rs

roux-series (.rs) extension.

method __init__

__init__(pandas_obj)

module roux.viz.figure

For setting up figures.


function get_children

get_children(fig)

Get all the individual objects included in the figure.


function get_child_text

get_child_text(search_name, all_children=None, fig=None)

Get text object.


function align_texts

align_texts(fig, texts: list, align: str, test=False)

Align text objects.


function labelplots

labelplots(
    axes: list = None,
    fig=None,
    labels: list = None,
    xoff: float = 0,
    yoff: float = 0,
    auto: bool = False,
    xoffs: dict = {},
    yoffs: dict = {},
    va: str = 'center',
    ha: str = 'left',
    verbose: bool = True,
    test: bool = False,
    **kws_text
)

Label (sub)plots.

Args:

  • fig : plt.figure object.
  • axes (type): list of plt.Axes objects.
  • xoff (int, optional): x offset. Defaults to 0.
  • yoff (int, optional): y offset. Defaults to 0.
  • params_alignment (dict, optional): alignment parameters. Defaults to {}.
  • params_text (dict, optional): parameters provided to plt.text. Defaults to {'size':20,'va':'bottom', 'ha':'right' }.
  • test (bool, optional): test mode. Defaults to False.

Todos: 1. Get the x coordinate of the ylabel.


function annot_axs

annot_axs(data, ax1, ax2, cols, **kws_line)

module roux.workflow.function

For function management.


function get_quoted_path

get_quoted_path(s1: str)  str

Quoted paths.

Args:

  • s1 (str): path.

Returns:

  • str: quoted path.

function get_path

get_path(
    s: str,
    validate: bool,
    prefixes=['data/', 'metadata/', 'plot/'],
    test=False
)  str

Extract pathsfrom a line of code.

Args:

  • s (str): line of code.
  • validate (bool): validate the output.
  • prefixes (list, optional): allowed prefixes. Defaults to ['data/','metadata/','plot/'].
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: path.

TODOs: 1. Use wildcards i.e. *'s.


function remove_dirs_from_outputs

remove_dirs_from_outputs(outputs: list, test: bool = False)  list

Remove directories from the output paths.

Args:

  • outputs (list): output paths.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • list: paths.

function get_ios

get_ios(l: list, test=False)  tuple

Get input and output (IO) paths.

Args:

  • l (list): list of lines of code.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • tuple: paths of inputs and outputs.

function get_name

get_name(s: str, i: int, sep_step: str = '## step')  str

Get name of the function.

Args:

  • s (str): lines in markdown format.
  • sep_step (str, optional): separator marking the start of a step. Defaults to "## step".
  • i (int): index of the step.

Returns:

  • str: name of the function.

function get_step

get_step(
    l: list,
    name: str,
    sep_step: str = '## step',
    sep_step_end: str = '## tests',
    test=False,
    tab='    '
)  dict

Get code for a step.

Args:

  • l (list): list of lines of code
  • name (str): name of the function.
  • test (bool, optional): test mode. Defaults to False.
  • tab (str, optional): tab format. Defaults to ' '.

Returns:

  • dict: step name to code map.

function to_task

to_task(
    notebookp,
    task=None,
    sep_step: str = '## step',
    sep_step_end: str = '## tests',
    notebook_suffix: str = '_v',
    force=False,
    validate=False,
    path_prefix=None,
    verbose=True,
    test=False
)  str

Get the lines of code for a task (script to be saved as an individual .py file).

Args:

  • notebookp (type): path of the notebook.
  • sep_step (str, optional): separator marking the start of a step. Defaults to "## step".
  • sep_step_end (str, optional): separator marking the end of a step. Defaults to "## tests".
  • notebook_suffix (str, optional): suffix of the notebook file to be considered as a "task".
  • force (bool, optional): overwrite output. Defaults to False.
  • validate (bool, optional): validate output. Defaults to False.
  • path_prefix (type, optional): prefix to the path. Defaults to None.
  • verbose (bool, optional): show verbose. Defaults to True.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: lines of the code.

function get_global_imports

get_global_imports()  DataFrame

Get the metadata of the functions imported from from roux import global_imports.

module roux.stat.fit

For fitting data.


function fit_curve_fit

fit_curve_fit(
    func,
    xdata: <built-in function array> = None,
    ydata: <built-in function array> = None,
    bounds: tuple = (-inf, inf),
    test=False,
    plot=False
)  tuple

Wrapper around scipy's curve_fit.

Args:

  • func (function): fitting function.
  • xdata (np.array, optional): x data. Defaults to None.
  • ydata (np.array, optional): y data. Defaults to None.
  • bounds (tuple, optional): bounds. Defaults to (-np.inf, np.inf).
  • test (bool, optional): test. Defaults to False.
  • plot (bool, optional): plot. Defaults to False.

Returns:

  • tuple: output.

function fit_gauss_bimodal

fit_gauss_bimodal(
    data: <built-in function array>,
    bins: int = 50,
    expected: tuple = (1, 0.2, 250, 2, 0.2, 125),
    test=False
)  tuple

Fit bimodal gaussian distribution to the data in vector format.

Args:

  • data (np.array): vector.
  • bins (int, optional): bins. Defaults to 50.
  • expected (tuple, optional): expected parameters. Defaults to (1,.2,250,2,.2,125).
  • test (bool, optional): test. Defaults to False.

Returns:

  • tuple: description

Notes:

Observed better performance with roux.stat.cluster.cluster_1d.


function get_grid

get_grid(
    x: <built-in function array>,
    y: <built-in function array>,
    z: <built-in function array> = None,
    off: int = 0,
    grids: int = 100,
    method='linear',
    test=False,
    **kws
)  tuple

2D grids from 1d data.

Args:

  • x (np.array): vector.
  • y (np.array): vector.
  • z (np.array, optional): vector. Defaults to None.
  • off (int, optional): offsets. Defaults to 0.
  • grids (int, optional): grids. Defaults to 100.
  • method (str, optional): method. Defaults to 'linear'.
  • test (bool, optional): test. Defaults to False.

Returns:

  • tuple: output.

function fit_gaussian2d

fit_gaussian2d(
    x: <built-in function array>,
    y: <built-in function array>,
    z: <built-in function array>,
    grid=True,
    grids=20,
    method='linear',
    off=0,
    rescalez=True,
    test=False
)  tuple

Fit gaussian 2D.

Args:

  • x (np.array): vector.
  • y (np.array): vector.
  • z (np.array): vector.
  • grid (bool, optional): grid. Defaults to True.
  • grids (int, optional): grids. Defaults to 20.
  • method (str, optional): method. Defaults to 'linear'.
  • off (int, optional): offsets. Defaults to 0.
  • rescalez (bool, optional): rescalez. Defaults to True.
  • test (bool, optional): test. Defaults to False.

Returns:

  • tuple: output.

function fit_2d_distribution_kde

fit_2d_distribution_kde(
    x: <built-in function array>,
    y: <built-in function array>,
    bandwidth: float,
    xmin: float = None,
    xmax: float = None,
    xbins=100j,
    ymin: float = None,
    ymax: float = None,
    ybins=100j,
    test=False,
    **kwargs
)  tuple

2D kernel density estimate (KDE).

Notes:

Cut off outliers: quantile_coff=0.01 params_grid=merge_dicts([ df01.loc[:,var2col.values()].quantile(quantile_coff).rename(index=flip_dict({f"{k}min":var2col[k] for k in var2col})).to_dict(), df01.loc[:,var2col.values()].quantile(1-quantile_coff).rename(index=flip_dict({f"{k}max":var2col[k] for k in var2col})).to_dict(), ])

Args:

  • x (np.array): vector.
  • y (np.array): vector.
  • bandwidth (float): bandwidth
  • xmin (float, optional): x minimum. Defaults to None.
  • xmax (float, optional): x maximum. Defaults to None.
  • xbins (type, optional): x bins. Defaults to 100j.
  • ymin (float, optional): y minimum. Defaults to None.
  • ymax (float, optional): y maximum. Defaults to None.
  • ybins (type, optional): y bins. Defaults to 100j.
  • test (bool, optional): test. Defaults to False.

Returns:

  • tuple: output.

function check_poly_fit

check_poly_fit(d: DataFrame, xcol: str, ycol: str, degmax: int = 5)  DataFrame

Check the fit of a polynomial equations.

Args:

  • d (pd.DataFrame): input dataframe.
  • xcol (str): column containing the x values.
  • ycol (str): column containing the y values.
  • degmax (int, optional): degree maximum. Defaults to 5.

Returns:

  • pd.DataFrame: description

function mlr_2

mlr_2(df: DataFrame, coly: str, colxs: list)  tuple

Multiple linear regression between two variables.

Args:

  • df (pd.DataFrame): input dataframe.
  • coly (str): column containing y values.
  • colxs (list): columns containing x values.

Returns:

  • tuple: output.

function get_mlr_2_str

get_mlr_2_str(df: DataFrame, coly: str, colxs: list)  str

Get the result of the multiple linear regression between two variables as a string.

Args:

  • df (pd.DataFrame): input dataframe.
  • coly (str): column containing y values.
  • colxs (list): columns containing x values.

Returns:

  • str: output.

module roux.stat.sets

For set related stats.


function get_overlap

get_overlap(
    items_set: list,
    items_test: list,
    output_format: str = 'list'
)  list

Get overlapping items as a string.

Args:

  • items_set (list): items in the reference set
  • items_test (list): items to test
  • output_format (str, optional): format of the output. Defaults to 'list'.

Raises:

  • ValueError: output_format can be list or str

function get_overlap_size

get_overlap_size(
    items_set: list,
    items_test: list,
    fraction: bool = False,
    perc: bool = False,
    by: str = None
)  float

Percentage Jaccard index.

Args:

  • items_set (list): items in the reference set
  • items_test (list): items to test
  • fraction (bool, optional): output fraction. Defaults to False.
  • perc (bool, optional): output percentage. Defaults to False.
  • by (str, optional): fraction by. Defaults to None.

Returns:

  • float: overlap size.

function get_item_set_size_by_background

get_item_set_size_by_background(items_set: list, background: int)  float

Item set size by background

Args:

  • items_set (list): items in the reference set
  • background (int): background size

Returns:

  • float: Item set size by background

Notes:

Denominator of the fold change.


function get_fold_change

get_fold_change(items_set: list, items_test: list, background: int)  float

Get fold change.

Args:

  • items_set (list): items in the reference set
  • items_test (list): items to test
  • background (int): background size

Returns:

  • float: fold change

Notes:

fc = (intersection/(test items))/((items in the item set)/background)


function get_hypergeom_pval

get_hypergeom_pval(items_set: list, items_test: list, background: int)  float

Calculate hypergeometric P-value.

Args:

  • items_set (list): items in the reference set
  • items_test (list): items to test
  • background (int): background size

Returns:

  • float: hypergeometric P-value

function get_contigency_table

get_contigency_table(items_set: list, items_test: list, background: int)  list

Get a contingency table required for the Fisher's test.

Args:

  • items_set (list): items in the reference set
  • items_test (list): items to test
  • background (int): background size

Returns:

  • list: contingency table

Notes:

within item (/referenece) set: True False within test item: True intersection True False False False False total-size of union


function get_odds_ratio

get_odds_ratio(items_set: list, items_test: list, background: int)  float

Calculate Odds ratio and P-values using Fisher's exact test.

Args:

  • items_set (list): items in the reference set
  • items_test (list): items to test
  • background (int): background size

Returns:

  • float: Odds ratio

function get_enrichment

get_enrichment(
    df1: DataFrame,
    df2: DataFrame,
    colid: str,
    colset: str,
    background: int,
    coltest: str = None,
    test_type: list = None,
    verbose: bool = False
)  DataFrame

Calculate the enrichments.

Args:

  • df1 (pd.DataFrame): table containing items to test
  • df2 (pd.DataFrame): table containing refence sets and items
  • colid (str): column with IDs of items
  • colset (str): column sets
  • coltest (str): column tests
  • background (int): background size.
  • test_type (list): hypergeom or Fisher. Defaults to both.
  • verbose (bool): verbose

Returns:

  • pd.DataFrame: output table

module roux.viz.ds

For wrappers around pandas Series plotting attributes.


function hist

hist(ds: Series, ax: Axes = None, kws_set_label_n={}, **kws)

module roux.viz.blends

Blends of plotting functions.


function plot_ranks

plot_ranks(
    data: DataFrame,
    kws_plot: dict,
    col: str,
    colid: str,
    col_label: str = None,
    xlim_min: float = -20,
    ax=None
)

module roux.viz.colors

For setting up colors.


function rgbfloat2int

rgbfloat2int(rgb_float)

function get_colors_default

get_colors_default()  list

get default colors.

Returns:

  • list: colors.

function get_ncolors

get_ncolors(
    n: int,
    cmap: str = 'Spectral',
    ceil: bool = False,
    test: bool = False,
    N: int = 20,
    out: str = 'hex',
    **kws_get_cmap_section
)  list

Get colors.

Args:

  • n (int): number of colors to get.
  • cmap (str, optional): colormap. Defaults to 'Spectral'.
  • ceil (bool, optional): ceil. Defaults to False.
  • test (bool, optional): test mode. Defaults to False.
  • N (int, optional): number of colors in the colormap. Defaults to 20.
  • out (str, optional): output. Defaults to 'hex'.

Returns:

  • list: colors.

function get_val2color

get_val2color(
    ds: Series,
    vmin: float = None,
    vmax: float = None,
    cmap: str = 'Reds'
)  dict

Get color for a value.

Args:

  • ds (pd.Series): values.
  • vmin (float, optional): minimum value. Defaults to None.
  • vmax (float, optional): maximum value. Defaults to None.
  • cmap (str, optional): colormap. Defaults to 'Reds'.

Returns:

  • dict: output.

function saturate_color

saturate_color(color, alpha: float)  object

Saturate a color.

Args: color (type):

  • alpha (float): alpha level.

Returns:

  • object: output.

References:

  • https: //stackoverflow.com/a/60562502/3521099

function mix_colors

mix_colors(d: dict)  str

Mix colors.

Args:

  • d (dict): colors to alpha map.

Returns:

  • str: hex color.

References:

  • https: //stackoverflow.com/a/61488997/3521099

function make_cmap

make_cmap(cs: list, N: int = 20, **kws)

Create a colormap.

Args:

  • cs (list): colors
  • N (int, optional): resolution i.e. number of colors. Defaults to 20.

Returns: cmap.


function get_cmap_section

get_cmap_section(
    cmap,
    vmin: float = 0.0,
    vmax: float = 1.0,
    n: int = 100
)  object

Get section of a colormap.

Args:

  • cmap (object| str): colormap.
  • vmin (float, optional): minimum value. Defaults to 0.0.
  • vmax (float, optional): maximum value. Defaults to 1.0.
  • n (int, optional): resolution i.e. number of colors. Defaults to 100.

Returns:

  • object: cmap.

function append_cmap

append_cmap(
    cmap: str = 'Reds',
    color: str = '#D3DDDC',
    cmap_min: float = 0.2,
    cmap_max: float = 0.8,
    ncolors: int = 100,
    ncolors_min: int = 1,
    ncolors_max: int = 0
)

Append a color to colormap.

Args:

  • cmap (str, optional): colormap. Defaults to 'Reds'.
  • color (str, optional): color. Defaults to '#D3DDDC'.
  • cmap_min (float, optional): cmap_min. Defaults to 0.2.
  • cmap_max (float, optional): cmap_max. Defaults to 0.8.
  • ncolors (int, optional): number of colors. Defaults to 100.
  • ncolors_min (int, optional): number of colors minimum. Defaults to 1.
  • ncolors_max (int, optional): number of colors maximum. Defaults to 0.

Returns: cmap.

References:

  • https: //matplotlib.org/stable/tutorials/colors/colormap-manipulation.html

module roux.viz.diagram

For diagrams e.g. flowcharts


function diagram_nb

diagram_nb(
    graph: str,
    counts: dict = None,
    out: bool = False,
    test: bool = False
)

Show a diagram in jupyter notebook using mermaid.js.

Parameters:

References:

  • 1. https: //mermaid.js.org/config/Tutorials.html#jupyter-integration-with-mermaid-js

Examples:

graph LR; i1(["input1"]) & d1[("data1")] --> p1[["process1"]] --> o1(["output1"]) p1 --> o2["output2"]:::ends classDef ends fill:#fff,stroke:#fff

module roux.workflow

Global Variables

  • io
  • log
  • task
  • nb

module roux.global_imports

For importing commonly used functions at the development phase.

Requirements:

pip install roux[all]

Usage: in interactive sessions (e.g. in jupyter notebooks) to facilitate faster code development.

Note: Post-development, to remove *s from the code, use removestar (pip install removestar).

removestar file

module roux.viz.annot

For annotations.


function annot_side

annot_side(
    ax: Axes,
    df1: DataFrame,
    colx: str,
    coly: str,
    cols: str = None,
    hue: str = None,
    loc: str = 'right',
    scatter=False,
    scatter_marker='|',
    scatter_alpha=0.75,
    lines=True,
    offx3: float = 0.15,
    offymin: float = 0.1,
    offymax: float = 0.9,
    length_axhline: float = 3,
    text=True,
    text_offx: float = 0,
    text_offy: float = 0,
    invert_xaxis: bool = False,
    break_pt: int = 25,
    va: str = 'bottom',
    zorder: int = 2,
    color: str = 'gray',
    kws_line: dict = {},
    kws_scatter: dict = {},
    **kws_text
)  Axes

Annot elements of the plots on the of the side plot.

Args:

  • df1 (pd.DataFrame): input data
  • colx (str): column with x values.
  • coly (str): column with y values.
  • cols (str): column with labels.
  • hue (str): column with colors of the labels.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • loc (str, optional): location. Defaults to 'right'.
  • invert_xaxis (bool, optional): invert xaxis. Defaults to False.
  • offx3 (float, optional): x-offset for bend position of the arrow. Defaults to 0.15.
  • offymin (float, optional): x-offset minimum. Defaults to 0.1.
  • offymax (float, optional): x-offset maximum. Defaults to 0.9.
  • break_pt (int, optional): break point of the labels. Defaults to 25.
  • length_axhline (float, optional): length of the horizontal line i.e. the "underline". Defaults to 3.
  • zorder (int, optional): z-order. Defaults to 1.
  • color (str, optional): color of the line. Defaults to 'gray'.
  • kws_line (dict, optional): parameters for formatting the line. Defaults to {}.

Keyword Args:

  • kws: parameters provided to the ax.text function.

Returns:

  • plt.Axes: plt.Axes object.

function annot_side_curved

annot_side_curved(
    data,
    colx: str,
    coly: str,
    col_label: str,
    off: float = 0.5,
    lim: tuple = None,
    limf: tuple = None,
    loc: str = 'right',
    ax=None,
    test: bool = False,
    kws_text={},
    **kws_line
)

Annot elements of the plots on the of the side plot using bezier lines.

Usage: 1. Allows m:1 mappings between points and labels


function show_outlines

show_outlines(
    data: DataFrame,
    colx: str,
    coly: str,
    column_outlines: str,
    outline_colors: dict,
    style=None,
    legend: bool = True,
    kws_legend: dict = {},
    zorder: int = 3,
    ax: Axes = None,
    **kws_scatter
)  Axes

Outline points on the scatter plot by categories.


function show_confidence_ellipse

show_confidence_ellipse(x, y, ax, n_std=3.0, facecolor='none', **kwargs)

Create a plot of the covariance confidence ellipse of x and y.

Parameters:

---------- x, y : array-like, shape (n, ) Input data.

ax : matplotlib.axes.Axes The axes object to draw the ellipse into.

n_std : float The number of standard deviations to determine the ellipse's radiuses.

**kwargs Forwarded to ~matplotlib.patches.Ellipse

Returns ------- matplotlib.patches.Ellipse

References ---------- https://matplotlib.org/3.5.0/gallery/statistics/confidence_ellipse.html


function show_box

show_box(
    ax: Axes,
    xy: tuple,
    width: float,
    height: float,
    fill: str = None,
    alpha: float = 1,
    lw: float = 1.1,
    edgecolor: str = 'k',
    clip_on: bool = False,
    scale_width: float = 1,
    scale_height: float = 1,
    xoff: float = 0,
    yoff: float = 0,
    **kws
)  Axes

Highlight sections of a plot e.g. heatmap by drawing boxes.

Args:

  • xy (tuple): position of left, bottom corner of the box.
  • width (float): width.
  • height (float): height.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • fill (str, optional): fill the box with color. Defaults to None.
  • alpha (float, optional): alpha of color. Defaults to 1.
  • lw (float, optional): line width. Defaults to 1.1.
  • edgecolor (str, optional): edge color. Defaults to 'k'.
  • clip_on (bool, optional): clip the boxes by the axis limit. Defaults to False.
  • scale_width (float, optional): scale width. Defaults to 1.
  • scale_height (float, optional): scale height. Defaults to 1.
  • xoff (float, optional): x-offset. Defaults to 0.
  • yoff (float, optional): y-offset. Defaults to 0.

Keyword Args:

  • kws: parameters provided to the Rectangle function.

Returns:

  • plt.Axes: plt.Axes object.

function color_ax

color_ax(ax: Axes, c: str, linewidth: float = None)  Axes

Color border of plt.Axes.

Args:

  • ax (plt.Axes): plt.Axes object.
  • c (str): color.
  • linewidth (float, optional): line width. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function show_n_legend

show_n_legend(ax, df1: DataFrame, colid: str, colgroup: str, **kws)

function show_scatter_stats

show_scatter_stats(
    ax: Axes,
    data: DataFrame,
    x,
    y,
    z,
    method: str,
    resample: bool = False,
    show_n: bool = True,
    show_n_prefix: str = '',
    prefix: str = '',
    loc=None,
    zorder: int = 5,
    verbose: bool = True,
    kws_stat={},
    **kws_set_label
)

resample (bool, optional): resample data. Defaults to False.


function show_crosstab_stats

show_crosstab_stats(
    data: DataFrame,
    cols: list,
    method: str = None,
    alpha: float = 0.05,
    loc: str = None,
    xoff: float = 0,
    yoff: float = 0,
    linebreak: bool = False,
    ax: Axes = None,
    **kws_set_label
)  Axes

Annotate a confusion matrix.

Args:

  • data (pd.DataFrame): input data.
  • cols (list): list of columns with the categories.
  • method (str, optional): method used to calculate the statistical significance.
  • alpha (float, optional): alpha for the stats. Defaults to 0.05.
  • loc (str, optional): location. Over-rides kws_set_label. Defaults to None.
  • xoff (float, optional): x offset. Defaults to 0.
  • yoff (float, optional): y offset. Defaults to 0.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws_set_label: keyword parameters provided to set_label.

Returns:

  • plt.Axes: plt.Axes object.

function show_confusion_matrix_stats

show_confusion_matrix_stats(
    df_: DataFrame,
    ax: Axes = None,
    off: float = 0.5
)  Axes

Annotate a confusion matrix.

Args:

  • df_ (pd.DataFrame): input data.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • off (float, optional): offset. Defaults to 0.5.

Returns:

  • plt.Axes: plt.Axes object.

function set_suptitle

set_suptitle(axs, title, offy=0, **kws_text)

Combined title for a list of subplots.

module roux.vizi

module roux.lib.set

For processing list-like sets.


function union

union(l)

Union of lists.

Parameters:

  • l (list): list of lists.

Returns:

  • l (list): list.

function union

union(l)

Union of lists.

Parameters:

  • l (list): list of lists.

Returns:

  • l (list): list.

function intersection

intersection(l)

Intersections of lists.

Parameters:

  • l (list): list of lists.

Returns:

  • l (list): list.

function intersection

intersection(l)

Intersections of lists.

Parameters:

  • l (list): list of lists.

Returns:

  • l (list): list.

function nunion

nunion(l)

Count the items in union.

Parameters:

  • l (list): list of lists.

Returns:

  • i (int): count.

function nintersection

nintersection(l)

Count the items in intersetion.

Parameters:

  • l (list): list of lists.

Returns:

  • i (int): count.

function check_non_overlaps_with

check_non_overlaps_with(l1: list, l2: list, out_count: bool = False, log=True)

function validate_overlaps_with

validate_overlaps_with(l1, l2, **kws_check)

function assert_overlaps_with

assert_overlaps_with(l1, l2, out_count=False)

function jaccard_index

jaccard_index(l1, l2)

function dropna

dropna(x)

Drop np.nan items from a list.

Parameters:

  • x (list): list.

Returns:

  • x (list): list.

function unique

unique(l)

Unique items in a list.

Parameters:

  • l (list): input list.

Returns:

  • l (list): list.

Notes:

The function can return list of lists if used in pandas.core.groupby.DataFrameGroupBy.agg context.


function unique_sorted

unique_sorted(l)

Unique items in a list.

Parameters:

  • l (list): input list.

Returns:

  • l (list): list.

Notes:

The function can return list of lists if used in pandas.core.groupby.DataFrameGroupBy.agg context.


function list2str

list2str(x, fmt=None, ignore=False)

Returns string if single item in a list.

Parameters:

  • x (list): list

Returns:

  • s (str): string.

function lists2str

lists2str(ds: DataFrame, **kws_list2str)  str

Combining lists with ids to to unified string

Usage: pandas aggregation functions.


function unique_str

unique_str(l, **kws)

Unique single item from a list.

Parameters:

  • l (list): input list.

Returns:

  • l (list): list.

function nunique

nunique(l, **kws)

Count unique items in a list

Parameters:

  • l (list): list

Returns:

  • i (int): count.

function flatten

flatten(l)

List of lists to list.

Parameters:

  • l (list): input list.

Returns:

  • l (list): output list.

function get_alt

get_alt(l1, s)

Get alternate item between two.

Parameters:

  • l1 (list): list.
  • s (str): item.

Returns:

  • s (str): alternate item.

function intersections

intersections(dn2list, jaccard=False, count=True, fast=False, test=False)

Get intersections between lists.

Parameters:

  • dn2list (dist): dictionary mapping to lists.
  • jaccard (bool): return jaccard indices.
  • count (bool): return counts.
  • fast (bool): fast.
  • test (bool): verbose.

Returns:

  • df (DataFrame): output dataframe.

TODOs: 1. feed as an estimator to df.corr(). 2. faster processing by filling up the symetric half of the adjacency matrix.


function range_overlap

range_overlap(l1, l2)

Overlap between ranges.

Parameters:

  • l1 (list): start and end integers of one range.
  • l2 (list): start and end integers of other range.

Returns:

  • l (list): overlapped range.

function get_windows

get_windows(
    a,
    size=None,
    overlap=None,
    windows=None,
    overlap_fraction=None,
    stretch_last=False,
    out_ranges=True
)

Windows/segments from a range.

Parameters:

  • a (list): range.
  • size (int): size of the windows.
  • windows (int): number of windows.
  • overlap_fraction (float): overlap fraction.
  • overlap (int): overlap length.
  • stretch_last (bool): stretch last window.
  • out_ranges (bool): whether to output ranges.

Returns:

  • df1 (DataFrame): output dataframe.

Notes:

  1. For development, use of int provides np.floor.

function bools2intervals

bools2intervals(v)

Convert bools to intervals.

Parameters:

  • v (list): list of bools.

Returns:

  • l (list): intervals.

function list2ranges

list2ranges(l)

function get_pairs

get_pairs(
    items: list,
    items_with: list = None,
    size: int = 2,
    with_self: bool = False,
    unique: bool = False
)  DataFrame

Creates a dataframe with the paired items.

Parameters:

  • items: the list of items to pair.
  • items_with: list of items to pair with.
  • size: size of the combinations.
  • with_self: pair with self or not.
  • unique (bool): get unique pairs (defaults to False).

Returns: table with pairs of items.

Notes:

  1. the ids of the items are sorted e.g. 'a'-'b' not 'b'-'a'. 2. itertools.combinations does not pair self.

module roux.stat.solve

For solving equations.


function get_intersection_locations

get_intersection_locations(
    y1: <built-in function array>,
    y2: <built-in function array>,
    test: bool = False,
    x: <built-in function array> = None
)  list

Get co-ordinates of the intersection (x[idx]).

Args:

  • y1 (np.array): vector.
  • y2 (np.array): vector.
  • test (bool, optional): test mode. Defaults to False.
  • x (np.array, optional): vector. Defaults to None.

Returns:

  • list: output.

module roux.stat.preprocess

For classification.


function dropna_matrix

dropna_matrix(
    df1,
    coff_cols_min_perc_na=5,
    coff_rows_min_perc_na=5,
    test=False,
    verbose=False
)

function drop_low_complexity

drop_low_complexity(
    df1: DataFrame,
    min_nunique: int,
    max_inflation: int,
    max_nunique: int = None,
    cols: list = None,
    cols_keep: list = [],
    test: bool = False,
    verbose: bool = False
)  DataFrame

Remove low-complexity columns from the data.

Args:

  • df1 (pd.DataFrame): input data.
  • min_nunique (int): minimum unique values.
  • max_inflation (int): maximum over-representation of the values.
  • cols (list, optional): columns. Defaults to None.
  • cols_keep (list, optional): columns to keep. Defaults to [].
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • pd.DataFrame: output data.

function get_cols_x_for_comparison

get_cols_x_for_comparison(
    df1: DataFrame,
    cols_y: list,
    cols_index: list,
    cols_drop: list = [],
    cols_dropby_patterns: list = [],
    dropby_low_complexity: bool = True,
    min_nunique: int = 5,
    max_inflation: int = 50,
    dropby_collinearity: bool = True,
    coff_rs: float = 0.7,
    dropby_variance_inflation: bool = True,
    verbose: bool = False,
    test: bool = False
)  dict

Identify X columns.

Parameters:

  • df1 (pd.DataFrame): input table.
  • cols_y (list): y columns.

function to_preprocessed_data

to_preprocessed_data(
    df1: DataFrame,
    columns: dict,
    fill_missing_desc_value: bool = False,
    fill_missing_cont_value: bool = False,
    normby_zscore: bool = False,
    verbose: bool = False,
    test: bool = False
)  DataFrame

Preprocess data.


function to_filteredby_samples

to_filteredby_samples(
    df1: DataFrame,
    colindex: str,
    colsample: str,
    coff_samples_min: int,
    colsubset: str,
    coff_subsets_min: int = 2
)  DataFrame

Filter table before calculating differences. (1) Retain minimum number of samples per item representing a subset and (2) Retain minimum number of subsets per item.

Parameters:

  • df1 (pd.DataFrame): input table.
  • colindex (str): column containing items.
  • colsample (str): column containing samples.
  • coff_samples_min (int): minimum number of samples.
  • colsubset (str): column containing subsets.
  • coff_subsets_min (int): minimum number of subsets. Defaults to 2.

Returns: pd.DataFrame

Examples:

Parameters: colindex='genes id', colsample='sample id', coff_samples_min=3, colsubset= 'pLOF or WT' coff_subsets_min=2,


function get_cvsplits

get_cvsplits(
    X: <built-in function array>,
    y: <built-in function array> = None,
    cv: int = 5,
    random_state: int = None,
    outtest: bool = True
)  dict

Get cross-validation splits. A friendly wrapper around sklearn.model_selection.KFold.

Args:

  • X (np.array): X matrix.
  • y (np.array): y vector.
  • cv (int, optional): cross validations. Defaults to 5.
  • random_state (int, optional): random state. Defaults to None.
  • outtest (bool, optional): output test data. Defaults to True.

Returns:

  • dict: output.

module roux.stat.io

For input/output of stats.


function perc_label

perc_label(a, b=None, bracket=True)

function pval2annot

pval2annot(
    pval: float,
    alternative: str = None,
    alpha: float = 0.05,
    fmt: str = '*',
    power: bool = True,
    linebreak: bool = False,
    replace_prefix: str = None
)

P/Q-value to annotation.

Parameters:

  • fmt (str): *|<|'num'

module roux.workflow.task

For task management.


function validate_params

validate_params(d: dict)  bool

function run_task

run_task(
    parameters: dict,
    input_notebook_path: str,
    kernel: str = None,
    output_notebook_path: str = None,
    start_timeout: int = 480,
    verbose=False,
    force=False,
    **kws_papermill
)  str

Run a single task.

Prameters: parameters (dict): parameters including output_paths. input_notebook_path (dict): path to the input notebook which is parameterized. kernel (str): kernel to be used. output_notebook_path: path to the output notebook which is used as a report. verbose (bool): verbose.

Keyword parameters: kws_papermill: parameters provided to the pm.execute_notebook function.

Returns: Output path.


function apply_run_task

apply_run_task(
    x: str,
    input_notebook_path: str,
    kernel: str,
    force=False,
    **kws_papermill
)

function run_tasks

run_tasks(
    input_notebook_path: str,
    kernel: str = None,
    inputs: list = None,
    output_path_base: str = None,
    parameters_list=None,
    fast: bool = False,
    fast_workers: int = 6,
    to_filter_nbby_patterns_kws=None,
    input_notebook_temp_path=None,
    out_paths: bool = True,
    test1: bool = False,
    force: bool = False,
    test: bool = False,
    verbose: bool = False,
    **kws_papermill
)  list

Run a list of tasks.

Prameters: input_notebook_path (dict): path to the input notebook which is parameterized. kernel (str): kernel to be used. inputs (list): list of parameters without the output paths, which would be inferred by encoding. output_path_base (str): output path with a placeholder e.g. 'path/to/{KEY}/file'. parameters_list (list): list of parameters including the output paths. out_paths (bool): return paths of the reports (Defaults to True). test1 (bool): test only first task in the list (Defaults to False). fast (bool): enable parallel-processing. fast_workers (bool): number of parallel-processes. force (bool): overwrite the outputs. test (bool): test-mode. verbose (bool): verbose.

Keyword parameters: kws_papermill: parameters provided to the pm.execute_notebook function e.g. working directory (cwd=) to_filter_nbby_patterns_kws (list): dictionary containing parameters to be provided to to_filter_nbby_patterns function (Defaults to None).

Returns:

  • parameters_list (list): list of parameters including the output paths, inferred if not provided.

TODOs: 0. Ignore temporary parameters e.g test, verbose etc while encoding inputs. 1. Integrate with apply_on_paths for parallel processing etc.

Notes:

  1. To resolve RuntimeError: This event loop is already running in python from multiprocessing, execute import nest_asyncio nest_asyncio.apply()

module roux.viz.heatmap

For heatmaps.


function plot_table

plot_table(
    df1: DataFrame,
    xlabel: str = None,
    ylabel: str = None,
    annot: bool = True,
    cbar: bool = False,
    linecolor: str = 'k',
    linewidths: float = 1,
    cmap: str = None,
    sorty: bool = False,
    linebreaky: bool = False,
    scales: tuple = [1, 1],
    ax: Axes = None,
    **kws
)  Axes

Plot to show a table.

Args:

  • df1 (pd.DataFrame): input data.
  • xlabel (str, optional): x label. Defaults to None.
  • ylabel (str, optional): y label. Defaults to None.
  • annot (bool, optional): show numbers. Defaults to True.
  • cbar (bool, optional): show colorbar. Defaults to False.
  • linecolor (str, optional): line color. Defaults to 'k'.
  • linewidths (float, optional): line widths. Defaults to 1.
  • cmap (str, optional): color map. Defaults to None.
  • sorty (bool, optional): sort rows. Defaults to False.
  • linebreaky (bool, optional): linebreak for y labels. Defaults to False.
  • scales (tuple, optional): scale of the table. Defaults to [1,1].
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the sns.heatmap function.

Returns:

  • plt.Axes: plt.Axes object.

module roux.stat.paired

For paired stats.


function get_ratio_sorted

get_ratio_sorted(a: float, b: float, increase=True)  float

Get ratio sorted.

Args:

  • a (float): value #1.
  • b (float): value #2.
  • increase (bool, optional): check for increase. Defaults to True.

Returns:

  • float: output.

function diff

diff(a: float, b: float, absolute=True)  float

Get difference

Args:

  • a (float): value #1.
  • b (float): value #2.
  • absolute (bool, optional): get absolute difference. Defaults to True.

Returns:

  • float: output.

function get_diff_sorted

get_diff_sorted(a: float, b: float)  float

Difference sorted/absolute.

Args:

  • a (float): value #1.
  • b (float): value #2.

Returns:

  • float: output.

function balance

balance(a: float, b: float, absolute=True)  float

Balance.

Args:

  • a (float): value #1.
  • b (float): value #2.
  • absolute (bool, optional): absolute difference. Defaults to True.

Returns:

  • float: output.

function get_paired_sets_stats

get_paired_sets_stats(l1: list, l2: list, test: bool = False)  list

Paired stats comparing two sets.

Args:

  • l1 (list): set #1.
  • l2 (list): set #2.
  • test (bool): test mode. Defaults to False.

Returns:

  • list: tuple (overlap, intersection, union, ratio).

function get_stats_paired

get_stats_paired(
    df1: DataFrame,
    cols: list,
    input_logscale: bool,
    prefix: str = None,
    drop_cols: bool = False,
    unidirectional_stats: list = ['min', 'max'],
    fast: bool = False
)  DataFrame

Paired stats, row-wise.

Args:

  • df1 (pd.DataFrame): input data.
  • cols (list): columns.
  • input_logscale (bool): if the input data is log-scaled.
  • prefix (str, optional): prefix of the output column/s. Defaults to None.
  • drop_cols (bool, optional): drop these columns. Defaults to False.
  • unidirectional_stats (list, optional): column-wise status. Defaults to ['min','max'].
  • fast (bool, optional): parallel processing. Defaults to False.

Returns:

  • pd.DataFrame: output dataframe.

function get_stats_paired_agg

get_stats_paired_agg(
    x: <built-in function array>,
    y: <built-in function array>,
    ignore: bool = False,
    verb: bool = True
)  Series

Paired stats aggregated, for example, to classify 2D distributions.

Args:

  • x (np.array): x vector.
  • y (np.array): y vector.
  • ignore (bool, optional): suppress warnings. Defaults to False.
  • verb (bool, optional): verbose. Defaults to True.

Returns:

  • pd.Series: output.

function classify_sharing

classify_sharing(
    df1: DataFrame,
    column_value: str,
    bins: list = [0, 25, 75, 100],
    labels: list = ['low', 'medium', 'high'],
    prefix: str = '',
    verbose: bool = False
)  DataFrame

Classify sharing % calculated from Jaccard index.

Parameters:

  • df1 (pd.DataFrame): input table.
  • column_value (str): column with values.
  • bins (list): bins. Defaults to [0,25,75,100].
  • labels (list): bin labels. Defaults to ['low','medium','high'],
  • prefix (str): prefix of the columns.
  • verbose (bool): verbose. Defaults to False.

module roux.stat.variance

For variance related stats.


function confidence_interval_95

confidence_interval_95(x: <built-in function array>)  float

95% confidence interval.

Args:

  • x (np.array): input vector.

Returns:

  • float: output.

function get_ci

get_ci(rs, ci_type, outstr=False)

function get_variance_inflation

get_variance_inflation(data, coly: str, cols_x: list = None)

Variance Inflation Factor (VIF). A wrapper around statsmodels's 'variance_inflation_factor function.

Parameters:

  • data (pd.DataFrame): input data.
  • coly (str): dependent variable.
  • cols_x (list): independent variables.

Returns: pd.Series

module roux.stat.norm

For normalisation.


function to_norm

to_norm(x, off=1e-05)

Normalise a vector bounded between 0 and 1.


function norm_by_quantile

norm_by_quantile(X: <built-in function array>)  <built-in function array>

Quantile normalize the columns of X.

Params: X : 2D array of float, shape (M, N). The input data, with M rows (genes/features) and N columns (samples).

Returns:

  • Xn : 2D array of float, shape (M, N). The normalized data.

Notes:

Faster processing (~5 times compared to other function tested) because of the use of numpy arrays. TODOs: Use from sklearn.preprocessing import QuantileTransformer with output_distribution parameter allowing rescaling back to the same distribution kind.


function norm_by_gaussian_kde

norm_by_gaussian_kde(
    values: <built-in function array>
)  <built-in function array>

Normalise matrix by gaussian KDE.

Args:

  • values (np.array): input matrix.

Returns:

  • np.array: output matrix.

References:

  • https: //github.com/saezlab/protein_attenuation/blob/6c1e81af37d72ef09835ee287f63b000c7c6663c/src/protein_attenuation/utils.py

function zscore

zscore(df: DataFrame, cols: list = None)  DataFrame

Z-score.

Args:

  • df (pd.DataFrame): input table.

Returns:

  • pd.DataFrame: output table.

TODOs: 1. Use scipy or sklearn's zscore because of it's additional options from scipy.stats import zscore df.apply(zscore)


function zscore_robust

zscore_robust(a: <built-in function array>)  <built-in function array>

Robust Z-score.

Args:

  • a (np.array): input data.

Returns:

  • np.array: output.

Example: t = sc.stats.norm.rvs(size=100, scale=1, random_state=123456) plt.hist(t,bins=40) plt.hist(apply_zscore_robust(t),bins=40) print(np.median(t),np.median(apply_zscore_robust(t)))


function norm_covariance_PCA

norm_covariance_PCA(
    X: <built-in function array>,
    use_svd: bool = True,
    use_sklearn: bool = True,
    rescale_centered: bool = True,
    random_state: int = 0,
    test: bool = False,
    verbose: bool = False
)  <built-in function array>

Covariance normalization by PCA whitening.

Args:

  • X (np.array): input array
  • use_svd (bool, optional): use SVD method. Defaults to True.
  • use_sklearn (bool, optional): use skelearn for SVD method. Defaults to True.
  • rescale_centered (bool, optional): rescale to centered input. Defaults to True.
  • random_state (int, optional): random state. Defaults to 0.
  • test (bool, optional): test mode. Defaults to False.
  • verbose (bool, optional): verbose. Defaults to False.

Returns:

  • np.array: transformed data.

module roux.stat.diff

For difference related stats.


function compare_classes

compare_classes(x, y, method=None)

Compare classes


function compare_classes_many

compare_classes_many(df1: DataFrame, cols_y: list, cols_x: list)  DataFrame

function get_pval

get_pval(
    df: DataFrame,
    colvalue='value',
    colsubset='subset',
    colvalue_bool=False,
    colindex=None,
    subsets=None,
    test=False,
    func=None
)  tuple

Get p-value.

Args:

  • df (DataFrame): input dataframe.
  • colvalue (str, optional): column with values. Defaults to 'value'.
  • colsubset (str, optional): column with subsets. Defaults to 'subset'.
  • colvalue_bool (bool, optional): column with boolean values. Defaults to False.
  • colindex (str, optional): column with the index. Defaults to None.
  • subsets (list, optional): subset types. Defaults to None.
  • test (bool, optional): test. Defaults to False.
  • func (function, optional): function. Defaults to None.

Raises:

  • ArgumentError: colvalue or colsubset not found in df.
  • ValueError: need only 2 subsets.

Returns:

  • tuple: stat,p-value

function get_stat

get_stat(
    df1: DataFrame,
    colsubset: str,
    colvalue: str,
    colindex: str,
    subsets=None,
    cols_subsets=['subset1', 'subset2'],
    df2=None,
    stats=['mean', 'median', 'var', 'size'],
    coff_samples_min=None,
    verb=False,
    func=None,
    **kws
)  DataFrame

Get statistics.

Args:

  • df1 (DataFrame): input dataframe.
  • colvalue (str, optional): column with values. Defaults to 'value'.
  • colsubset (str, optional): column with subsets. Defaults to 'subset'.
  • colindex (str, optional): column with the index. Defaults to None.
  • subsets (list, optional): subset types. Defaults to None.
  • cols_subsets (list, optional): columns with subsets. Defaults to ['subset1', 'subset2'].
  • df2 (DataFrame, optional): second dataframe. Defaults to None.
  • stats (list, optional): summary statistics. Defaults to [np.mean,np.median,np.var]+[len].
  • coff_samples_min (int, optional): minimum sample size required. Defaults to None.
  • verb (bool, optional): verbose. Defaults to False.

Keyword Arguments:

  • kws: parameters provided to get_pval function.

Raises:

  • ArgumentError: colvalue or colsubset not found in df.
  • ValueError: len(subsets)<2

Returns:

  • DataFrame: output dataframe.

TODOs: 1. Rename to more specific get_diff, also other get_stat*/get_pval* functions.


function get_stats

get_stats(
    df1: DataFrame,
    colsubset: str,
    cols_value: list,
    colindex: str,
    subsets=None,
    df2=None,
    cols_subsets=['subset1', 'subset2'],
    stats=['mean', 'median', 'var', 'size'],
    axis=0,
    test=False,
    **kws
)  DataFrame

Get statistics by iterating over columns wuth values.

Args:

  • df1 (DataFrame): input dataframe.
  • colsubset (str, optional): column with subsets.
  • cols_value (list): list of columns with values.
  • colindex (str, optional): column with the index.
  • subsets (list, optional): subset types. Defaults to None.
  • df2 (DataFrame, optional): second dataframe, e.g. pd.DataFrame({"subset1":['test'],"subset2":['reference']}). Defaults to None.
  • cols_subsets (list, optional): columns with subsets. Defaults to ['subset1', 'subset2'].
  • stats (list, optional): summary statistics. Defaults to [np.mean,np.median,np.var]+[len].
  • axis (int, optional): 1 if different tests else use 0. Defaults to 0.

Keyword Arguments:

  • kws: parameters provided to get_pval function.

Raises:

  • ArgumentError: colvalue or colsubset not found in df.
  • ValueError: len(subsets)<2

Returns:

  • DataFrame: output dataframe.

TODOs: 1. No column prefix if len(cols_value)==1.


function get_significant_changes

get_significant_changes(
    df1: DataFrame,
    coff_p=0.025,
    coff_q=0.1,
    alpha=None,
    change_type=['diff', 'ratio'],
    changeby='mean',
    value_aggs=['mean', 'median']
)  DataFrame

Get significant changes.

Args:

  • df1 (DataFrame): input dataframe.
  • coff_p (float, optional): cutoff on p-value. Defaults to 0.025.
  • coff_q (float, optional): cutoff on q-value. Defaults to 0.1.
  • alpha (float, optional): alias for coff_p. Defaults to None.
  • changeby (str, optional): "" if check for change by both mean and median. Defaults to "".
  • value_aggs (list, optional): values to aggregate. Defaults to ['mean','median'].

Returns:

  • DataFrame: output dataframe.

function apply_get_significant_changes

apply_get_significant_changes(
    df1: DataFrame,
    cols_value: list,
    cols_groupby: list,
    cols_grouped: list,
    fast=False,
    **kws
)  DataFrame

Apply on dataframe to get significant changes.

Args:

  • df1 (DataFrame): input dataframe.
  • cols_value (list): columns with values.
  • cols_groupby (list): columns with groups.

Returns:

  • DataFrame: output dataframe.

function get_stats_groupby

get_stats_groupby(
    df1: DataFrame,
    cols_group: list,
    coff_p: float = 0.05,
    coff_q: float = 0.1,
    alpha=None,
    fast=False,
    **kws
)  DataFrame

Iterate over groups, to get the differences.

Args:

  • df1 (DataFrame): input dataframe.
  • cols_group (list): columns to interate over.
  • coff_p (float, optional): cutoff on p-value. Defaults to 0.025.
  • coff_q (float, optional): cutoff on q-value. Defaults to 0.1.
  • alpha (float, optional): alias for coff_p. Defaults to None.
  • fast (bool, optional): parallel processing. Defaults to False.

Returns:

  • DataFrame: output dataframe.

function get_diff

get_diff(
    df1: DataFrame,
    cols_x: list,
    cols_y: list,
    cols_index: list,
    cols_group: list,
    coff_p: float = None,
    test: bool = False,
    func=None,
    **kws
)  DataFrame

Wrapper around the get_stats_groupby

Keyword parameters: cols=['variable x','variable y'], coff_p=0.05, coff_q=0.01, colindex=['id'],


function binby_pvalue_coffs

binby_pvalue_coffs(
    df1: DataFrame,
    coffs=[0.01, 0.05, 0.1],
    color=False,
    testn='MWU test, FDR corrected',
    colindex='genes id',
    colgroup='tissue',
    preffix='',
    colns=None,
    palette=None
)  tuple

Bin data by pvalue cutoffs.

Args:

  • df1 (DataFrame): input dataframe.
  • coffs (list, optional): cut-offs. Defaults to [0.01,0.05,0.25].
  • color (bool, optional): color asignment. Defaults to False.
  • testn (str, optional): test number. Defaults to 'MWU test, FDR corrected'.
  • colindex (str, optional): column with index. Defaults to 'genes id'.
  • colgroup (str, optional): column with the groups. Defaults to 'tissue'.
  • preffix (str, optional): prefix. Defaults to ''.
  • colns (type, optional): columns number. Defaults to None.
  • notcountedpalette (type, optional): description. Defaults to None.

Returns:

  • tuple: output.

Notes:

  1. To be deprecated in the favor of the functions used for enrichment analysis for example.

module roux.workflow.df

For management of tables.


function exclude_items

exclude_items(df1: DataFrame, metadata: dict)  DataFrame

Exclude items from the table with the workflow info.

Args:

  • df1 (pd.DataFrame): input table.
  • metadata (dict): metadata of the repository.

Returns:

  • pd.DataFrame: output.

module roux.lib.dict

For processing dictionaries.


function head_dict

head_dict(d, lines=5)

function sort_dict

sort_dict(d1, by=1, ascending=True)

Sort dictionary by values.

Parameters:

  • d1 (dict): input dictionary.
  • by (int): index of the value among the values.
  • ascending (bool): ascending order.

Returns:

  • d1 (dict): output dictionary.

function merge_dicts

merge_dicts(l: list)  dict

Merge dictionaries.

Parameters:

  • l (list): list containing the dictionaries.

Returns:

  • d (dict): output dictionary.

TODOs: 1. In python>=3.9, merged = d1 | d2?


function merge_dicts_deep

merge_dicts_deep(left: dict, right: dict)  dict

Merge nested dictionaries. Overwrites left with right.

Parameters:

  • left (dict): dictionary #1
  • right (dict): dictionary #2

TODOs: 1. In python>=3.9, merged = d1 | d2?


function merge_dict_values

merge_dict_values(l, test=False)

Merge dictionary values.

Parameters:

  • l (list): list containing the dictionaries.
  • test (bool): verbose.

Returns:

  • d (dict): output dictionary.

function flip_dict

flip_dict(d)

switch values with keys and vice versa.

Parameters:

  • d (dict): input dictionary.

Returns:

  • d (dict): output dictionary.

module roux.workflow.nb

For operations on jupyter notebooks.


function get_lines

get_lines(p: str, keep_comments: bool = True)  list

Get lines of code from notebook.

Args:

  • p (str): path to notebook.
  • keep_comments (bool, optional): keep comments. Defaults to True.

Returns:

  • list: lines.

function read_nb_md

read_nb_md(p: str, n: int = None)  list

Read notebook's documentation in the markdown cells.

Args:

  • p (str): path of the notebook.
  • n (int): number of the markdown cells to extract.

Returns:

  • list: lines of the strings.

function to_info

to_info(p: str, outp: str, linkd: str = '')  str

Save README.md file with table of contents obtained from jupyter notebooks.

Args:

  • p (str, optional): path of the notebook files that would be converted to "tasks".
  • outp (str, optional): path of the output file, e.g. 'README.md'.

Returns:

  • str: path of the output file.

function to_replaced_nb

to_replaced_nb(
    nb_path,
    output_path,
    replaces: dict = {},
    cell_type: str = 'code',
    drop_lines_with_substrings: list = None,
    test=False
)

Replace text in a jupyter notebook.

Parameters nb: notebook object obtained from nbformat.reads. replaces (dict): mapping of text to 'replace from' to the one to 'replace with'. cell_type (str): the type of the cell.

Returns:

  • new_nb: notebook object.

function to_filtered_nb

to_filtered_nb(
    p: str,
    outp: str,
    header: str,
    kind: str = 'include',
    validate_diff: int = None
)

Filter sections in a notebook based on markdown headings.

Args:

  • header (str): exact first line of a markdown cell marking a section in a notebook. validate_diff

function to_filter_nbby_patterns

to_filter_nbby_patterns(p, outp, patterns=None, **kws)

Filter out notebook cells if the pattern string is found.

Args:

  • patterns (list): list of string patterns.

function to_clear_unused_cells

to_clear_unused_cells(
    notebook_path,
    new_notebook_path,
    validate_diff: int = None
)

Remove code cells with all lines commented.


function to_clear_outputs

to_clear_outputs(notebook_path, new_notebook_path)

function to_filtered_outputs

to_filtered_outputs(input_path, output_path, warnings=True, strings=True)

module roux.viz.sets

For plotting sets.


function plot_venn

plot_venn(
    ds1: Series,
    ax: Axes = None,
    figsize: tuple = [2.5, 2.5],
    show_n: bool = True,
    outmore=False,
    **kws
)  Axes

Plot Venn diagram.

Args:

  • ds1 (pd.Series): input pandas.Series or dictionary. Subsets in the index levels, mapped to counts.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • figsize (tuple, optional): figure size. Defaults to [2.5,2.5].
  • show_n (bool, optional): show sample sizes. Defaults to True.

Returns:

  • plt.Axes: plt.Axes object.

function plot_intersection_counts

plot_intersection_counts(
    df1: DataFrame,
    cols: list = None,
    kind: str = 'table',
    method: str = None,
    show_counts: bool = True,
    show_pval: bool = True,
    confusion: bool = False,
    rename_cols: bool = False,
    sort_cols: tuple = [True, True],
    order_x: list = None,
    order_y: list = None,
    cmap: str = 'Reds',
    ax: Axes = None,
    kws_show_stats: dict = {},
    **kws_plot
)  Axes

Plot counts for the intersection between two sets.

Args:

  • df1 (pd.DataFrame): input data
  • cols (list, optional): columns. Defaults to None.
  • kind (str, optional): kind of plot: table or barplot. Detaults to table.
  • method (str, optional): method to check the association ['chi2','FE']. Defaults to None.
  • rename_cols (bool, optional): rename the columns. Defaults to True.
  • show_pval (bool, optional): annotate p-values. Defaults to True.
  • cmap (str, optional): colormap. Defaults to 'Reds'.
  • kws_show_stats (dict, optional): arguments provided to stats function. Defaults to {}.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Raises:

  • ValueError: show_pval position should be the allowed one.

Keyword Args:

  • kws_plot: keyword arguments provided to the plotting function.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: 1. Use compare_classes to get the stats.


function plot_intersections

plot_intersections(
    ds1: Series,
    item_name: str = None,
    figsize: tuple = [4, 4],
    text_width: float = 2,
    yorder: list = None,
    sort_by: str = 'cardinality',
    sort_categories_by: str = None,
    element_size: int = 40,
    facecolor: str = 'gray',
    bari_annot: int = None,
    totals_bar: bool = False,
    totals_text: bool = True,
    intersections_ylabel: float = None,
    intersections_min: float = None,
    test: bool = False,
    annot_text: bool = False,
    set_ylabelx: float = -0.25,
    set_ylabely: float = 0.5,
    **kws
)  Axes

Plot upset plot.

Args:

  • ds1 (pd.Series): input vector.
  • item_name (str, optional): name of items. Defaults to None.
  • figsize (tuple, optional): figure size. Defaults to [4,4].
  • text_width (float, optional): max. width of the text. Defaults to 2.
  • yorder (list, optional): order of y elements. Defaults to None.
  • sort_by (str, optional): sorting method. Defaults to 'cardinality'.
  • sort_categories_by (str, optional): sorting method. Defaults to None.
  • element_size (int, optional): size of elements. Defaults to 40.
  • facecolor (str, optional): facecolor. Defaults to 'gray'.
  • bari_annot (int, optional): annotate nth bar. Defaults to None.
  • totals_text (bool, optional): show totals. Defaults to True.
  • intersections_ylabel (float, optional): y-label of the intersections. Defaults to None.
  • intersections_min (float, optional): intersection minimum to show. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.
  • annot_text (bool, optional): annotate text. Defaults to False.
  • set_ylabelx (float, optional): x position of the ylabel. Defaults to -0.25.
  • set_ylabely (float, optional): y position of the ylabel. Defaults to 0.5.

Keyword Args:

  • kws: parameters provided to the upset.plot function.

Returns:

  • plt.Axes: plt.Axes object.

Notes:

sort_by:{‘cardinality’, ‘degree’} If ‘cardinality’, subset are listed from largest to smallest. If ‘degree’, they are listed in order of the number of categories intersected. sort_categories_by:{‘cardinality’, None} Whether to sort the categories by total cardinality, or leave them in the provided order. References: https://upsetplot.readthedocs.io/en/stable/api.html


function plot_enrichment

plot_enrichment(
    data: DataFrame,
    x: str,
    y: str,
    s: str,
    hue='Q',
    xlabel=None,
    ylabel='significance\n(-log10(Q))',
    size: int = None,
    color: str = None,
    annots_side: int = 5,
    annots_side_labels=None,
    coff_fdr: float = None,
    xlim: tuple = None,
    xlim_off: float = 0.2,
    ylim: tuple = None,
    ax: Axes = None,
    break_pt: int = 25,
    annot_coff_fdr: bool = False,
    kws_annot: dict = {'loc': 'right', 'offx3': 0.15},
    returns='ax',
    **kwargs
)  Axes

Plot enrichment stats.

Args:

 - <b>`data`</b> (pd.DataFrame):  input data. 
 - <b>`x`</b> (str):  x column. 
 - <b>`y`</b> (str):  y column. 
 - <b>`s`</b> (str):  size column. 
 - <b>`size`</b> (int, optional):  size of the points. Defaults to None. 
 - <b>`color`</b> (str, optional):  color of the points. Defaults to None. 
 - <b>`annots_side`</b> (int, optional):  how many labels to show on side. Defaults to 5. 
 - <b>`coff_fdr`</b> (float, optional):  FDR cutoff. Defaults to None. 
 - <b>`xlim`</b> (tuple, optional):  x-axis limits. Defaults to None. 
 - <b>`xlim_off`</b> (float, optional):  x-offset on limits. Defaults to 0.2. 
 - <b>`ylim`</b> (tuple, optional):  y-axis limits. Defaults to None. 
 - <b>`ax`</b> (plt.Axes, optional):  `plt.Axes` object. Defaults to None. 
 - <b>`break_pt`</b> (int, optional):  break point (' ') for the labels. Defaults to 25. 
 - <b>`annot_coff_fdr`</b> (bool, optional):  show FDR cutoff. Defaults to False. 
 - <b>`kws_annot`</b> (dict, optional):  parameters provided to the `annot_side` function. Defaults to dict( loc='right', annot_count_max=5, offx3=0.15, ). 

Keyword Args: - kwargs: parameters provided to the sns.scatterplot function.

Returns:

 - <b>`plt.Axes`</b>:  `plt.Axes` object. 

function plot_pie

plot_pie(
    counts: list,
    labels: list,
    scales_line_xy: tuple = (1.1, 1.1),
    remove_wedges: list = None,
    remove_wedges_index: list = [],
    line_color: str = 'k',
    annot_side: bool = False,
    kws_annot_side: dict = {},
    ax: Axes = None,
    **kws_pie
)  Axes

Pie plot.

Args:

  • counts (list): counts.
  • labels (list): labels.
  • scales_line_xy (tuple, optional): scales for the lines. Defaults to (1.1,1.1).
  • remove_wedges (list, optional): remove wedge/s. Defaults to None.
  • remove_wedges_index (list, optional): remove wedge/s by index. Defaults to [].
  • line_color (str, optional): line color. Defaults to 'k'.
  • annot_side (bool, optional): annotations on side using the annot_side function. Defaults to False.
  • kws_annot_side (dict, optional): keyword arguments provided to the annot_side function. Defaults to {}.
  • ax (plt.Axes, optional): subplot. Defaults to None.

Keyword Args:

  • kws_pie: keyword arguments provided to the pie chart function.

Returns:

  • plt.Axes: subplot

References:

  • https: //matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html

module roux.stat.compare

For comparison related stats.


function get_comparison

get_comparison(
    df1: DataFrame,
    d1: dict = None,
    coff_p: float = 0.05,
    between_ys: bool = False,
    verbose: bool = False,
    **kws
)

Compare the x and y columns.

Parameters:

  • df1 (pd.DataFrame): input table.
  • d1 (dict): columns dict, output of get_cols_x_for_comparison.
  • between_ys (bool): compare y's

Notes:

Column information: d1={'cols_index': ['id'], 'cols_x': {'cont': [], 'desc': []}, 'cols_y': {'cont': [], 'desc': []}} Comparison types: 1. continuous vs continuous -> correlation 2. decrete vs continuous -> difference 3. decrete vs decrete -> FE or chi square


function compare_strings

compare_strings(l0: list, l1: list, cutoff: float = 0.5)  DataFrame

Compare two lists of strings.

Parameters:

  • l0 (list): list of strings.
  • l1 (list): list of strings to compare with.
  • cutoff (float): threshold to filter the comparisons.

Returns: table with the similarity scores.

TODOs: 1. Add option for semantic similarity.

module roux.lib.dfs

For processing multiple pandas DataFrames/Series


function filter_dfs

filter_dfs(dfs: list, cols: list, how: str = 'inner')  DataFrame

Filter dataframes based items in the common columns.

Parameters:

  • dfs (list): list of dataframes.
  • cols (list): list of columns.
  • how (str): how to filter ('inner')

Returns

  • dfs (list): list of dataframes.

function merge_with_many_columns

merge_with_many_columns(
    df1: DataFrame,
    right: str,
    left_on: str,
    right_ons: list,
    right_id: str,
    how: str = 'inner',
    validate: str = '1:1',
    test: bool = False,
    verbose: bool = False,
    **kws_merge
)  DataFrame

Merge with many columns. For example, if ids in the left table can map to ids located in multiple columns of the right table.

Parameters:

  • df1 (pd.DataFrame): left table.
  • right (pd.DataFrame): right table.
  • left_on (str): column in the left table to merge on.
  • right_ons (list): columns in the right table to merge on.
  • right_id (str): column in the right dataframe with for example the ids to be merged.

Keyword parameters:

  • kws_merge: to be supplied to pandas.DataFrame.merge.

Returns: Merged table.


function merge_paired

merge_paired(
    df1: DataFrame,
    df2: DataFrame,
    left_ons: list,
    right_on: list,
    common: list = [],
    right_ons_common: list = [],
    how: str = 'inner',
    validates: list = ['1:1', '1:1'],
    suffixes: list = None,
    test: bool = False,
    verb: bool = True,
    **kws
)  DataFrame

Merge uppaired dataframes to a paired dataframe.

Parameters:

  • df1 (DataFrame): paired dataframe.
  • df2 (DataFrame): unpaired dataframe.
  • left_ons (list): columns of the df1 (suffixed).
  • right_on (str|list): column/s of the df2 (to be suffixed).
  • common (str|list): common column/s between df1 and df2 (not suffixed).
  • right_ons_common (str|list): common column/s between df2 to be used for merging (not to be suffixed).
  • how (str): method of merging ('inner').
  • validates (list): validate mappings for the 1st mapping between df1 and df2 and 2nd one between df1+df2 and df2 (['1:1','1:1']).
  • suffixes (list): suffixes to be used (None).
  • test (bool): testing (False).
  • verb (bool): verbose (True).

Keyword Parameters:

  • kws (dict): parameters provided to merge.

Returns:

  • df (DataFrame): output dataframe.

Examples:

Parameters: how='inner', left_ons=['gene id gene1','gene id gene2'], # suffixed common='sample id', # not suffixed right_on='gene id', # to be suffixed right_ons_common=[], # not to be suffixed


function merge_dfs

merge_dfs(dfs: list, **kws)  DataFrame

Merge dataframes from left to right.

Parameters:

  • dfs (list): list of dataframes.

Keyword Parameters:

  • kws (dict): parameters provided to merge.

Returns:

  • df (DataFrame): output dataframe.

Notes:

For example, reduce(lambda x, y: x.merge(y), [1, 2, 3, 4, 5]) merges ((((1.merge(2)).merge(3)).merge(4)).merge(5)).


function compare_rows

compare_rows(df1, df2, test=False, **kws)

module roux.viz.scatter

For scatter plots.


function plot_scatter_agg

plot_scatter_agg(
    dplot: DataFrame,
    x: str = None,
    y: str = None,
    z: str = None,
    kws_legend={'bbox_to_anchor': [1, 1], 'loc': 'upper left'},
    title=None,
    label_colorbar=None,
    ax=None,
    kind=None,
    verbose=False,
    cmap='Blues',
    gridsize=10,
    **kws
)

UNDER DEV.


function plot_scatter

plot_scatter(
    data: DataFrame,
    x: str = None,
    y: str = None,
    z: str = None,
    kind: str = 'scatter',
    scatter_kws={},
    line_kws={},
    stat_method: str = 'spearman',
    stat_kws={},
    hollow: bool = False,
    ax: Axes = None,
    verbose: bool = True,
    **kws
)  Axes

Plot scatter with multiple layers and stats.

Args:

  • data (pd.DataFrame): input dataframe.
  • x (str): x column.
  • y (str): y column.
  • z (str, optional): z column. Defaults to None.
  • kind (str, optional): kind of scatter. Defaults to 'hexbin'.
  • trendline_method (str, optional): trendline method ['poly','lowess']. Defaults to 'poly'.
  • stat_method (str, optional): method of annoted stats ['mlr',"spearman"]. Defaults to "spearman".
  • cmap (str, optional): colormap. Defaults to 'Reds'.
  • label_colorbar (str, optional): label of the colorbar. Defaults to None.
  • gridsize (int, optional): number of grids in the hexbin. Defaults to 25.
  • bbox_to_anchor (list, optional): location of the legend. Defaults to [1,1].
  • loc (str, optional): location of the legend. Defaults to 'upper left'.
  • title (str, optional): title of the plot. Defaults to None.
  • #params_plot (dict, optional): parameters provided to the plot function. Defaults to {}.
  • line_kws (dict, optional): parameters provided to the plot_trendline function. Defaults to {}.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the plot function.

Returns:

  • plt.Axes: plt.Axes object.

Notes:

  1. For a rasterized scatter plot set scatter_kws={'rasterized': True} 2. This function does not apply multiple colors, similar to sns.regplot.

function plot_qq

plot_qq(x: Series)  Axes

plot QQ.

Args:

  • x (pd.Series): input vector.

Returns:

  • plt.Axes: plt.Axes object.

function plot_ranks

plot_ranks(
    df1: DataFrame,
    col: str,
    colid: str,
    ranks_on: str = 'y',
    ascending: bool = True,
    col_rank: str = None,
    line: bool = True,
    kws_line={},
    show_topn: int = None,
    show_ids: list = None,
    ax=None,
    **kws
)  Axes

Plot rankings.

Args:

  • dplot (pd.DataFrame): input data.
  • colx (str): x column.
  • coly (str): y column.
  • colid (str): column with unique ids.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the seaborn.scatterplot function.

Returns:

  • plt.Axes: plt.Axes object.

Usage: Combined with annotations using annot_side.


function plot_volcano

plot_volcano(
    data: DataFrame,
    colx: str,
    coly: str,
    colindex: str,
    hue: str = 'x',
    style: str = 'P=0',
    style_order: list = ['o', '^'],
    markers: list = ['o', '^'],
    show_labels: int = None,
    labels_layout: str = None,
    labels_kws: dict = {},
    show_outlines: int = None,
    outline_colors: list = ['k'],
    collabel: str = None,
    show_line=True,
    line_pvalue=0.1,
    line_x: float = 0.0,
    line_x_min: float = None,
    show_text: bool = True,
    text_increase: str = None,
    text_decrease: str = None,
    text_diff: str = None,
    legend: bool = False,
    verbose: bool = False,
    p_min: float = None,
    ax: Axes = None,
    outmore: bool = False,
    kws_legend: dict = {},
    **kws_scatterplot
)  Axes

Volcano plot.

Parameters:

Keyword parameters:

Returns: plt.Axes

module roux.run

For access to a few functions from the terminal.

module roux.lib.str

For processing strings.


function substitution

substitution(s, i, replaceby)

Substitute character in a string.

Parameters:

  • s (string): string.
  • i (int): location.
  • replaceby (string): character to substitute with.

Returns:

  • s (string): output string.

function substitution

substitution(s, i, replaceby)

Substitute character in a string.

Parameters:

  • s (string): string.
  • i (int): location.
  • replaceby (string): character to substitute with.

Returns:

  • s (string): output string.

function replace_many

replace_many(
    s: str,
    replaces: dict,
    replacewith: str = '',
    ignore: bool = False
)

Rename by replacing sub-strings.

Parameters:

  • s (str): input string.
  • replaces (dict|list): from->to format or list containing substrings to remove.
  • replacewith (str): replace to in case replaces is a list.
  • ignore (bool): if True, not validate the successful replacements.

Returns:

  • s (DataFrame): output dataframe.

function replace_many

replace_many(
    s: str,
    replaces: dict,
    replacewith: str = '',
    ignore: bool = False
)

Rename by replacing sub-strings.

Parameters:

  • s (str): input string.
  • replaces (dict|list): from->to format or list containing substrings to remove.
  • replacewith (str): replace to in case replaces is a list.
  • ignore (bool): if True, not validate the successful replacements.

Returns:

  • s (DataFrame): output dataframe.

function filter_list

filter_list(l: list, patterns: list, kind='out')  list

Filter a list of strings.

Args:

  • l (list): list of strings.
  • patterns (list): list of regex patterns. patterns are applied after stripping the whitespaces.

Returns: (list) list of filtered strings.


function tuple2str

tuple2str(tup, sep=' ')

Join tuple items.

Parameters:

  • tup (tuple|list): input tuple/list.
  • sep (str): separator between the items.

Returns:

  • s (str): output string.

function linebreaker

linebreaker(text, width=None, break_pt=None, sep='\n', **kws)

Insert newlines within a string.

Parameters:

  • text (str): string.
  • width (int): insert newline at this interval.
  • sep (string): separator to split the sub-strings.

Returns:

  • s (string): output string.

References:


function findall

findall(s, ss, outends=False, outstrs=False, suffixlen=0)

Find the substrings or their locations in a string.

Parameters:

  • s (string): input string.
  • ss (string): substring.
  • outends (bool): output end positions.
  • outstrs (bool): output strings.
  • suffixlen (int): length of the suffix.

Returns:

  • l (list): output list.

function get_marked_substrings

get_marked_substrings(
    s,
    leftmarker='{',
    rightmarker='}',
    leftoff=0,
    rightoff=0
)  list

Get the substrings flanked with markers from a string.

Parameters:

  • s (str): input string.
  • leftmarker (str): marker on the left.
  • rightmarker (str): marker on the right.
  • leftoff (int): offset on the left.
  • rightoff (int): offset on the right.

Returns:

  • l (list): list of substrings.

function get_marked_substrings

get_marked_substrings(
    s,
    leftmarker='{',
    rightmarker='}',
    leftoff=0,
    rightoff=0
)  list

Get the substrings flanked with markers from a string.

Parameters:

  • s (str): input string.
  • leftmarker (str): marker on the left.
  • rightmarker (str): marker on the right.
  • leftoff (int): offset on the left.
  • rightoff (int): offset on the right.

Returns:

  • l (list): list of substrings.

function mark_substrings

mark_substrings(s, ss, leftmarker='(', rightmarker=')')  str

Mark sub-string/s in a string.

Parameters:

  • s (str): input string.
  • ss (str): substring.
  • leftmarker (str): marker on the left.
  • rightmarker (str): marker on the right.

Returns:

  • s (str): string.

function get_bracket

get_bracket(s, leftmarker='(', righttmarker=')')  str

Get bracketed substrings.

Parameters:

  • s (string): string.
  • leftmarker (str): marker on the left.
  • rightmarker (str): marker on the right.

Returns:

  • s (str): string.

TODOs: 1. Use get_marked_substrings.


function align

align(
    s1: str,
    s2: str,
    prefix: bool = False,
    suffix: bool = False,
    common: bool = True
)  list

Align strings.

Parameters:

  • s1 (str): string #1.
  • s2 (str): string #2.
  • prefix (str): prefix.
  • suffix (str): suffix.
  • common (str): common substring.

Returns:

  • l (list): output list.

Notes:

  1. Code to test: [ get_prefix(source,target,common=False), get_prefix(source,target,common=True), get_suffix(source,target,common=False), get_suffix(source,target,common=True),]

function get_prefix

get_prefix(s1, s2: str = None, common: bool = True, clean: bool = True)  str

Get the prefix of the strings

Parameters:

  • s1 (str|list): 1st string.
  • s2 (str): 2nd string (default:None).
  • common (bool): get the common prefix (default:True).
  • clean (bool): clean the leading and trailing whitespaces (default:True).

Returns:

  • s (str): prefix.

function get_suffix

get_suffix(s1, s2: str = None, common: bool = True, clean: bool = True)  str

Get the suffix of the strings

Parameters:

  • s1 (str|list): 1st string.
  • s2 (str): 2nd string (default:None).
  • common (bool): get the common prefix (default:True).
  • clean (bool): clean the leading and trailing whitespaces (default:True).

Returns:

  • s (str): prefix.

function get_fix

get_fix(s1: str, s2: str, **kws: dict)  str

Infer common prefix or suffix.

Parameters:

  • s1 (str): 1st string.
  • s2 (str): 2nd string.

Keyword parameters:

  • kws: parameters provided to the get_prefix and get_suffix functions.

Returns:

  • s (str): prefix or suffix.

function removesuffix

removesuffix(s1: str, suffix: str)  str

Remove suffix.

Paramters: s1 (str): input string. suffix (str): suffix.

Returns:

  • s1 (str): string without the suffix.

TODOs: 1. Deprecate in py>39 use .removesuffix() instead.


function str2dict

str2dict(
    s: str,
    reversible: bool = True,
    sep: str = ';',
    sep_equal: str = '='
)  dict

String to dictionary.

Parameters:

  • s (str): string.
  • sep (str): separator between entries (default:';').
  • sep_equal (str): separator between the keys and the values (default:'=').

Returns:

  • d (dict): dictionary.

References:

  • 1. https: //stackoverflow.com/a/186873/3521099

function dict2str

dict2str(
    d1: dict,
    reversible: bool = True,
    sep: str = ';',
    sep_equal: str = '='
)  str

Dictionary to string.

Parameters:

  • d (dict): dictionary.
  • sep (str): separator between entries (default:';').
  • sep_equal (str): separator between the keys and the values (default:'=').
  • reversible (str): use json

Returns:

  • s (str): string.

function str2num

str2num(s: str)  float

String to number.

Parameters:

  • s (str): string.

Returns:

  • i (int): number.

function num2str

num2str(
    num: float,
    magnitude: bool = False,
    coff: float = 10000,
    decimals: int = 0
)  str

Number to string.

Parameters:

  • num (int): number.
  • magnitude (bool): use magnitudes (default:False).
  • coff (int): cutoff (default:10000).
  • decimals (int): decimal points (default:0).

Returns:

  • s (str): string.

TODOs 1. ~ if magnitude else not


function encode

encode(data, short: bool = False, method_short: str = 'sha256', **kws)  str

Encode the data as a string.

Parameters:

  • data (str|dict|Series): input data.
  • short (bool): Outputs short string, compatible with paths but non-reversible. Defaults to False.
  • method_short (str): method used for encoding when short=True.

Keyword parameters:

  • kws: parameters provided to encoding function.

Returns:

  • s (string): output string.

function decode

decode(s, out=None, **kws_out)

Decode data from a string.

Parameters:

  • s (string): encoded string.
  • out (str): output format (dict|df).

Keyword parameters:

  • kws_out: parameters provided to dict2df.

Returns:

  • d (dict|DataFrame): output data.

function to_formula

to_formula(
    replaces={' ': 'SPACE', '(': 'LEFTBRACKET', ')': 'RIGHTTBRACKET', '.': 'DOT', ',': 'COMMA', '%': 'PERCENT', "'": 'INVCOMMA', '+': 'PLUS', '-': 'MINUS'},
    reverse=False
)  dict

Converts strings to the formula format, compatible with patsy for example.

module roux.workflow.monitor

For workflow monitors.


function plot_workflow_log

plot_workflow_log(dplot: DataFrame)  Axes

Plot workflow log.

Args:

  • dplot (pd.DataFrame): input data (dparam).

Returns:

  • plt.Axes: output.

TODOs: 1. use the statistics tagged as ## stats.

module roux.lib.text

For processing text files.


function get_header

get_header(path: str, comment='#', lineno=None)

Get the header of a file.

Args:

  • path (str): path.
  • comment (str): comment identifier.
  • lineno (int): line numbers upto.

Returns:

  • lines (list): header.

function cat

cat(ps, outp)

Concatenate text files.

Args:

  • ps (list): list of paths.
  • outp (str): output path.

Returns:

  • outp (str): output path.

module roux.vizi.scatter


function plot_scatters_grouped

plot_scatters_grouped(
    data: DataFrame,
    cols_groupby: list,
    aggfunc: dict,
    orient='h',
    **kws_encode
)

Scatters grouped by categories.

Args:

  • data (pd.DataFrame): input data,
  • cols_groupby (list): list of colummns to groupby,
  • aggfunc (dict): columns mapped to the aggregation function,

Keyword Args:

  • kws_encode: parameters provided to the encode attribute

Returns: Altair figure

module roux.stat.network

For network related stats.


function get_subgraphs

get_subgraphs(df1: DataFrame, source: str, target: str)  DataFrame

Subgraphs from the the edge list.

Args:

  • df1 (pd.DataFrame): input dataframe containing edge-list.
  • source (str): source node.
  • target (str): taget node.

Returns:

  • pd.DataFrame: output.

module roux.lib.google

Processing files form google-cloud services.


function get_service

get_service(service_name='drive', access_limit=True, client_config=None)

Creates a google service object.

:param service_name: name of the service e.g. drive :param access_limit: True is access limited else False :param client_config: custom client config ... :return: google service object

Ref: https://developers.google.com/drive/api/v3/about-auth


function get_service

get_service(service_name='drive', access_limit=True, client_config=None)

Creates a google service object.

:param service_name: name of the service e.g. drive :param access_limit: True is access limited else False :param client_config: custom client config ... :return: google service object

Ref: https://developers.google.com/drive/api/v3/about-auth


function list_files_in_folder

list_files_in_folder(service, folderid, filetype=None, fileext=None, test=False)

Lists files in a google drive folder.

:param service: service object e.g. drive :param folderid: folder id from google drive :param filetype: specify file type :param fileext: specify file extension :param test: True if verbose else False ... :return: list of files in the folder


function get_file_id

get_file_id(p)

function download_file

download_file(
    p=None,
    file_id=None,
    service=None,
    outd=None,
    outp=None,
    convert=False,
    force=False,
    test=False
)

Downloads a specified file.

:param service: google service object :param file_id: file id as on google drive :param filetypes: specify file type :param outp: path to the ouput file :param test: True if verbose else False

Ref: https://developers.google.com/drive/api/v3/ref-export-formats


function upload_file

upload_file(service, filep, folder_id, test=False)

Uploads a local file onto google drive.

:param service: google service object :param filep: path of the file :param folder_id: id of the folder on google drive where the file will be uploaded :param test: True is verbose else False ... :return: id of the uploaded file


function upload_files

upload_files(service, ps, folder_id, **kws)

function download_drawings

download_drawings(folderid, outd, service=None, test=False)

Download specific files: drawings

TODOs: 1. use download_file


function get_comments

get_comments(
    fileid,
    fields='comments/quotedFileContent/value,comments/content,comments/id',
    service=None
)

Get comments.

fields: comments/ kind: id: createdTime: modifiedTime: author: kind: displayName: photoLink: me: True htmlContent: content: deleted: quotedFileContent: mimeType: value: anchor: replies: []


function search

search(query, results=1, service=None, **kws_search)

Google search.

:param query: exact terms ... :return: dict


function get_search_strings

get_search_strings(text, num=5, test=False)

Google search.

:param text: string :param num: number of results :param test: True if verbose else False ... :return lines: list


function get_metadata_of_paper

get_metadata_of_paper(
    file_id,
    service_drive,
    service_search,
    metadata=None,
    force=False,
    test=False
)

Get the metadata of a pdf document.


function share

share(
    drive_service,
    content_id,
    share=False,
    unshare=False,
    user_permission=None,
    permissionId='anyoneWithLink'
)

:params user_permission: user_permission = { 'type': 'anyone', 'role': 'reader', 'email':'@' } Ref: https://developers.google.com/drive/api/v3/manage-sharing


class slides


method create_image

create_image(service, presentation_id, page_id, image_id)

image less than 1.5 Mb


method get_page_ids

get_page_ids(service, presentation_id)

module roux.viz

Global Variables

  • ds
  • theme
  • ax_
  • colors
  • diagram
  • io

module roux.viz.ax_

For setting up subplots.


function set_axes_minimal

set_axes_minimal(ax, xlabel=None, ylabel=None, off_axes_pad=0)  Axes

Set minimal axes labels, at the lower left corner.


function set_axes_arrows

set_axes_arrows(
    ax: Axes,
    length: float = 0.1,
    pad: float = 0.2,
    color: str = 'k',
    head_width: float = 0.03,
    head_length: float = 0.02,
    length_includes_head: bool = True,
    clip_on: bool = False,
    **kws_arrow
)

Set arrows next to the axis labels.

Parameters:

  • ax (plt.Axes): subplot. color=

function set_label

set_label(
    s: str,
    ax: Axes,
    x: float = 0,
    y: float = 0,
    ha: str = 'left',
    va: str = 'top',
    loc=None,
    off_loc=0.01,
    title: bool = False,
    **kws
)  Axes

Set label on a plot.

Args:

  • x (float): x position.
  • y (float): y position.
  • s (str): label.
  • ax (plt.Axes): plt.Axes object.
  • ha (str, optional): horizontal alignment. Defaults to 'left'.
  • va (str, optional): vertical alignment. Defaults to 'top'.
  • loc (int, optional): location of the label. 1:'upper right', 2:'upper left', 3:'lower left':3, 4:'lower right'
  • offs_loc (tuple,optional): x and y location offsets.
  • title (bool, optional): set as title. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function set_ylabel

set_ylabel(
    ax: Axes,
    s: str = None,
    x: float = -0.1,
    y: float = 1.02,
    xoff: float = 0,
    yoff: float = 0
)  Axes

Set ylabel horizontal.

Args:

  • ax (plt.Axes): plt.Axes object.
  • s (str, optional): ylabel. Defaults to None.
  • x (float, optional): x position. Defaults to -0.1.
  • y (float, optional): y position. Defaults to 1.02.
  • xoff (float, optional): x offset. Defaults to 0.
  • yoff (float, optional): y offset. Defaults to 0.

Returns:

  • plt.Axes: plt.Axes object.

function get_ax_labels

get_ax_labels(ax: Axes)

function format_labels

format_labels(
    ax,
    axes: list = ['x', 'y'],
    fmt='cap1',
    title_fontsize=15,
    rename_labels=None,
    rotate_ylabel=True,
    y=1.05,
    test=False
)

function rename_ticklabels

rename_ticklabels(
    ax: Axes,
    axis: str,
    rename: dict = None,
    replace: dict = None,
    ignore: bool = False
)  Axes

Rename the ticklabels.

Args:

  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • axis (str): axis (x|y).
  • rename (dict, optional): replace strings. Defaults to None.
  • replace (dict, optional): replace sub-strings. Defaults to None.
  • ignore (bool, optional): ignore warnings. Defaults to False.

Raises:

  • ValueError: either rename or replace should be provided.

Returns:

  • plt.Axes: plt.Axes object.

function get_ticklabel_position

get_ticklabel_position(ax: Axes, axis: str)  Axes

Get positions of the ticklabels.

Args:

  • ax (plt.Axes): plt.Axes object.
  • axis (str): axis (x|y).

Returns:

  • plt.Axes: plt.Axes object.

function set_ticklabels_color

set_ticklabels_color(ax: Axes, ticklabel2color: dict, axis: str = 'y')  Axes

Set colors to ticklabels.

Args:

  • ax (plt.Axes): plt.Axes object.
  • ticklabel2color (dict): colors of the ticklabels.
  • axis (str): axis (x|y).

Returns:

  • plt.Axes: plt.Axes object.

function format_ticklabels

format_ticklabels(
    ax: Axes,
    axes: tuple = ['x', 'y'],
    interval: float = None,
    n: int = None,
    fmt: str = None,
    font: str = None
)  Axes

format_ticklabels

Args:

  • ax (plt.Axes): plt.Axes object.
  • axes (tuple, optional): axes. Defaults to ['x','y'].
  • n (int, optional): number of ticks. Defaults to None.
  • fmt (str, optional): format e.g. '.0f'. Defaults to None.
  • font (str, optional): font. Defaults to 'DejaVu Sans Mono'.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: 1. include color_ticklabels


function split_ticklabels

split_ticklabels(
    ax: Axes,
    fmt: str,
    axis='x',
    group_x=-0.45,
    group_y=-0.25,
    group_prefix=None,
    group_suffix=False,
    group_loc='center',
    group_colors=None,
    group_alpha=0.2,
    show_group_line=True,
    group_line_off_x=0.15,
    group_line_off_y=0.1,
    show_group_span=False,
    group_span_kws={},
    sep: str = '-',
    pad_major=6,
    off: float = 0.2,
    test: bool = False,
    **kws
)  Axes

Split ticklabels into major and minor. Two minor ticks are created per major tick.

Args:

  • ax (plt.Axes): plt.Axes object.
  • fmt (str): 'group'-wise or 'pair'-wise splitting of the ticklabels.
  • axis (str): name of the axis: x or y.
  • sep (str, optional): separator within the tick labels. Defaults to ' '.
  • test (bool, optional): test-mode. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function get_axlimsby_data

get_axlimsby_data(
    X: Series,
    Y: Series,
    off: float = 0.2,
    equal: bool = False
)  Axes

Infer axis limits from data.

Args:

  • X (pd.Series): x values.
  • Y (pd.Series): y values.
  • off (float, optional): offsets. Defaults to 0.2.
  • equal (bool, optional): equal limits. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function get_axlims

get_axlims(ax: Axes)  Axes

Get axis limits.

Args:

  • ax (plt.Axes): plt.Axes object.

Returns:

  • plt.Axes: plt.Axes object.

function set_equallim

set_equallim(
    ax: Axes,
    diagonal: bool = False,
    difference: float = None,
    format_ticks: bool = True,
    **kws_format_ticklabels
)  Axes

Set equal axis limits.

Args:

  • ax (plt.Axes): plt.Axes object.
  • diagonal (bool, optional): show diagonal. Defaults to False.
  • difference (float, optional): difference from . Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function set_axlims

set_axlims(
    ax: Axes,
    off: float,
    axes: list = ['x', 'y'],
    equal=False,
    **kws_set_equallim
)  Axes

Set axis limits.

Args:

  • ax (plt.Axes): plt.Axes object.
  • off (float): offset.
  • axes (list, optional): axis name/s. Defaults to ['x','y'].

Returns:

  • plt.Axes: plt.Axes object.

function set_grids

set_grids(ax: Axes, axis: str = None)  Axes

Show grids based on the shape (aspect ratio) of the plot.

Args:

  • ax (plt.Axes): plt.Axes object.
  • axis (str, optional): axis name. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function format_legends

format_legends(ax: Axes, **kws_legend)  Axes

Format legend text.

Args:

  • ax (plt.Axes): plt.Axes object.

Returns:

  • plt.Axes: plt.Axes object.

function rename_legends

rename_legends(ax: Axes, replaces: dict, **kws_legend)  Axes

Rename legends.

Args:

  • ax (plt.Axes): plt.Axes object.
  • replaces (dict): description

Returns:

  • plt.Axes: plt.Axes object.

function append_legends

append_legends(ax: Axes, labels: list, handles: list, **kws)  Axes

Append to legends.

Args:

  • ax (plt.Axes): plt.Axes object.
  • labels (list): labels.
  • handles (list): handles.

Returns:

  • plt.Axes: plt.Axes object.

function sort_legends

sort_legends(ax: Axes, sort_order: list = None, **kws)  Axes

Sort or filter legends.

Args:

  • ax (plt.Axes): plt.Axes object.
  • sort_order (list, optional): order of legends. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

Notes:

  1. Filter the legends by providing the indices of the legends to keep.

function drop_duplicate_legend

drop_duplicate_legend(ax, **kws)

function reset_legend_colors

reset_legend_colors(ax)

Reset legend colors.

Args:

  • ax (plt.Axes): plt.Axes object.

Returns:

  • plt.Axes: plt.Axes object.

function set_legends_merged

set_legends_merged(axs, **kws_legend)

Reset legend colors.

Args:

  • axs (list): list of plt.Axes objects.

Returns:

  • plt.Axes: first plt.Axes object in the list.

function set_legend_custom

set_legend_custom(
    ax: Axes,
    legend2param: dict,
    param: str = 'color',
    lw: float = 1,
    marker: str = 'o',
    markerfacecolor: bool = True,
    size: float = 10,
    color: str = 'k',
    linestyle: str = '',
    title_ha: str = 'center',
    **kws
)  Axes

Set custom legends.

Args:

  • ax (plt.Axes): plt.Axes object.
  • legend2param (dict): legend name to parameter to change e.g. name of the color.
  • param (str, optional): parameter to change. Defaults to 'color'.
  • lw (float, optional): line width. Defaults to 1.
  • marker (str, optional): marker type. Defaults to 'o'.
  • markerfacecolor (bool, optional): marker face color. Defaults to True.
  • size (float, optional): size of the markers. Defaults to 10.
  • color (str, optional): color of the markers. Defaults to 'k'.
  • linestyle (str, optional): line style. Defaults to ''.
  • title_ha (str, optional): title horizontal alignment. Defaults to 'center'.
  • frameon (bool, optional): show frame. Defaults to True.

Returns:

  • plt.Axes: plt.Axes object.

TODOs: 1. differnet number of points for eachh entry

from matplotlib.legend_handler import HandlerTuple l1, = plt.plot(-1, -1, lw=0, marker="o", markerfacecolor='k', markeredgecolor='k') l2, = plt.plot(-0.5, -1, lw=0, marker="o", markerfacecolor="none", markeredgecolor='k') plt.legend([(l1,), (l1, l2)], ["test 1", "test 2"],

  • handler_map={tuple: HandlerTuple(2)} )

References:

  • https: //matplotlib.org/stable/api/markers_api.html
  • http: //www.cis.jhu.edu/~shanest/mpt/js/mathjax/mathjax-dev/fonts/Tables/STIX/STIX/All/All.html

function get_line_cap_length

get_line_cap_length(ax: Axes, linewidth: float)  Axes

Get the line cap length.

Args:

  • ax (plt.Axes): plt.Axes object
  • linewidth (float): width of the line.

Returns:

  • plt.Axes: plt.Axes object

function set_colorbar

set_colorbar(
    fig: object,
    ax: Axes,
    ax_pc: Axes,
    label: str,
    bbox_to_anchor: tuple = (0.05, 0.5, 1, 0.45),
    orientation: str = 'vertical'
)

Set colorbar.

Args:

  • fig (object): figure object.
  • ax (plt.Axes): plt.Axes object.
  • ax_pc (plt.Axes): plt.Axes object for the colorbar.
  • label (str): label
  • bbox_to_anchor (tuple, optional): location. Defaults to (0.05, 0.5, 1, 0.45).
  • orientation (str, optional): orientation. Defaults to "vertical".

Returns: figure object.


function set_colorbar_label

set_colorbar_label(ax: Axes, label: str)  Axes

Find colorbar and set label for it.

Args:

  • ax (plt.Axes): plt.Axes object.
  • label (str): label.

Returns:

  • plt.Axes: plt.Axes object.

function format_ax

format_ax(
    ax=None,
    kws_fmt_ticklabels={},
    kws_fmt_labels={},
    kws_legend={},
    rotate_ylabel=False
)

module roux.viz.io

For input/output of plots.


function to_plotp

to_plotp(
    ax: Axes = None,
    prefix: str = 'plot/plot_',
    suffix: str = '',
    fmts: list = ['png']
)  str

Infer output path for a plot.

Args:

  • ax (plt.Axes): plt.Axes object.
  • prefix (str, optional): prefix with directory path for the plot. Defaults to 'plot/plot_'.
  • suffix (str, optional): suffix of the filename. Defaults to ''.
  • fmts (list, optional): formats of the images. Defaults to ['png'].

Returns:

  • str: output path for the plot.

function savefig

savefig(
    plotp: str,
    tight_layout: bool = True,
    bbox_inches: list = None,
    fmts: list = ['png'],
    savepdf: bool = False,
    normalise_path: bool = True,
    replaces_plotp: dict = None,
    dpi: int = 500,
    force: bool = True,
    kws_replace_many: dict = {},
    kws_savefig: dict = {},
    verbose: bool = False,
    **kws
)  str

Wrapper around plt.savefig.

Args:

  • plotp (str): output path or plt.Axes object.
  • tight_layout (bool, optional): tight_layout. Defaults to True.
  • bbox_inches (list, optional): bbox_inches. Defaults to None.
  • savepdf (bool, optional): savepdf. Defaults to False.
  • normalise_path (bool, optional): normalise_path. Defaults to True.
  • replaces_plotp (dict, optional): replaces_plotp. Defaults to None.
  • dpi (int, optional): dpi. Defaults to 500.
  • force (bool, optional): overwrite output. Defaults to True.
  • kws_replace_many (dict, optional): parameters provided to the replace_many function. Defaults to {}.

Keyword Args:

  • kws: parameters provided to to_plotp function.
  • kws_savefig: parameters provided to to_savefig function.
  • kws_replace_many: parameters provided to replace_many function.

Returns:

  • str: output path.

function savelegend

savelegend(
    plotp: str,
    legend: object,
    expand: list = [-5, -5, 5, 5],
    **kws_savefig
)  str

Save only the legend of the plot/figure.

Args:

  • plotp (str): output path.
  • legend (object): legend object.
  • expand (list, optional): expand. Defaults to [-5,-5,5,5].

Returns:

  • str: output path.

References:

  • 1. https: //stackoverflow.com/a/47749903/3521099

function update_kws_plot

update_kws_plot(kws_plot: dict, kws_plotp: dict, test: bool = False)  dict

Update the input parameters.

Args:

  • kws_plot (dict): input parameters.
  • kws_plotp (dict): saved parameters.
  • test (bool, optional): description. Defaults to False.

Returns:

  • dict: updated parameters.

function get_plot_inputs

get_plot_inputs(
    plotp: str,
    df1: DataFrame = None,
    kws_plot: dict = {},
    outd: str = None
)  tuple

Get plot inputs.

Args:

  • plotp (str): path of the plot.
  • df1 (pd.DataFrame): data for the plot.
  • kws_plot (dict): parameters of the plot.
  • outd (str): output directory.

Returns:

  • tuple: (path,dataframe,dict)

function log_code

log_code()

Log the code.


function log_code

log_code()

Log the code.


function get_lines

get_lines(
    logp: str = 'log_notebook.log',
    sep: str = 'begin_plot()',
    test: bool = False
)  list

Get lines from the log.

Args:

  • logp (str, optional): path to the log file. Defaults to 'log_notebook.log'.
  • sep (str, optional): label marking the start of code of the plot. Defaults to 'begin_plot()'.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • list: lines of code.

function to_script

to_script(
    srcp: str,
    plotp: str,
    defn: str = 'plot_',
    s4: str = '    ',
    test: bool = False,
    validate: bool = False,
    **kws
)  str

Save the script with the code for the plot.

Args:

  • srcp (str): path of the script.
  • plotp (str): path of the plot.
  • defn (str, optional): prefix of the function. Defaults to "plot_".
  • s4 (str, optional): a tab. Defaults to ' '.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: path of the script.

TODOs: 1. Compatible with names of the input dataframes other that df1. 1. Get the variable name of the dataframe

def get_df_name(df): name =[x for x in globals() if globals()[x] is df and not x.startswith('-')][0] return name

  1. Replace df1 with the variable name of the dataframe.

function to_plot

to_plot(
    plotp: str,
    data: DataFrame = None,
    df1: DataFrame = None,
    kws_plot: dict = {},
    logp: str = 'log_notebook.log',
    sep: str = 'begin_plot()',
    validate: bool = False,
    show_path: bool = False,
    show_path_offy: float = -0.2,
    force: bool = True,
    test: bool = False,
    quiet: bool = True,
    **kws
)  str

Save a plot.

Args:

  • plotp (str): output path.
  • df1 (pd.DataFrame, optional): dataframe with plotting data. Defaults to None.
  • data (pd.DataFrame, optional): dataframe with plotting data. Defaults to None.
  • kws_plot (dict, optional): parameters for plotting. Defaults to dict().
  • logp (str, optional): path to the log. Defaults to 'log_notebook.log'.
  • sep (str, optional): separator marking the start of the plotting code in jupyter notebook. Defaults to 'begin_plot()'.
  • validate (bool, optional): validate the "readability" using read_plot function. Defaults to False.
  • show_path (bool, optional): show path on the plot. Defaults to False.
  • show_path_offy (float, optional): y-offset for the path label. Defaults to 0.
  • force (bool, optional): overwrite output. Defaults to True.
  • test (bool, optional): test mode. Defaults to False.
  • quiet (bool, optional): quiet mode. Defaults to False.

Returns:

  • str: output path.

Notes:

Requirement: 1. Start logging in the jupyter notebook. from IPython import get_ipython log_notebookp=f'log_notebook.log';open(log_notebookp, 'w').close();get_ipython().run_line_magic('logstart','{log_notebookp} over')


function read_plot

read_plot(p: str, safe: bool = False, test: bool = False, **kws)  Axes

Generate the plot from data, parameters and a script.

Args:

  • p (str): path of the plot saved using to_plot function.
  • safe (bool, optional): read as an image. Defaults to False.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

function to_concat

to_concat(
    ps: list,
    how: str = 'h',
    use_imagemagick: bool = False,
    use_conda_env: bool = False,
    test: bool = False,
    **kws_outp
)  str

Concat images.

Args:

  • ps (list): list of paths.
  • how (str, optional): horizontal (h) or vertical v. Defaults to 'h'.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: path of the output.

function to_montage

to_montage(
    ps: list,
    layout: str,
    source_path: str = None,
    env_name: str = None,
    hspace: float = 0,
    vspace: float = 0,
    output_path: str = None,
    test: bool = False,
    **kws_outp
)  str

To montage.

Args:

  • ps (type): list of paths.
  • layout (type): layout of the images.
  • hspace (int, optional): horizontal space. Defaults to 0.
  • vspace (int, optional): vertical space. Defaults to 0.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: path of the output.

function to_gif

to_gif(
    ps: list,
    outp: str,
    duration: int = 200,
    loop: int = 0,
    optimize: bool = True
)  str

Convert to GIF.

Args:

  • ps (list): list of paths.
  • outp (str): output path.
  • duration (int, optional): duration. Defaults to 200.
  • loop (int, optional): loop or not. Defaults to 0.
  • optimize (bool, optional): optimize the size. Defaults to True.

Returns:

  • str: output path.

References:

  • 1. https: //pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#gif
  • 2. https: //stackoverflow.com/a/57751793/3521099

function to_data

to_data(path: str)  str

Convert to base64 string.

Args:

  • path (str): path of the input.

Returns: base64 string.


function to_convert

to_convert(filep: str, outd: str = None, fmt: str = 'JPEG')  str

Convert format of image using PIL.

Args:

  • filep (str): input path.
  • outd (str, optional): output directory. Defaults to None.
  • fmt (str, optional): format of the output. Defaults to "JPEG".

Returns:

  • str: output path.

function to_raster

to_raster(
    plotp: str,
    dpi: int = 500,
    alpha: bool = False,
    trim: bool = False,
    force: bool = False,
    test: bool = False
)  str

to_raster summary

Args:

  • plotp (str): input path.
  • dpi (int, optional): DPI. Defaults to 500.
  • alpha (bool, optional): transparency. Defaults to False.
  • trim (bool, optional): trim margins. Defaults to False.
  • force (bool, optional): overwrite output. Defaults to False.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • str: description

Notes:

  1. Runs a bash command: convert -density 300 -trim.

function to_rasters

to_rasters(plotd, ext='svg')

Convert many images to raster. Uses inkscape.

Args:

  • plotd (str): directory.
  • ext (str, optional): extension of the output. Defaults to 'svg'.

module roux.stat.corr

For correlation stats.


function resampled

resampled(
    x: <built-in function array>,
    y: <built-in function array>,
    method_fun: object,
    method_kws: dict = {},
    ci_type: str = 'max',
    cv: int = 5,
    random_state: int = 1,
    verbose: bool = False
)  tuple

Get correlations after resampling.

Args:

  • x (np.array): x vector.
  • y (np.array): y vector.
  • method_fun (str, optional): method function.
  • ci_type (str, optional): confidence interval type. Defaults to 'max'.
  • cv (int, optional): number of resamples. Defaults to 5.
  • random_state (int, optional): random state. Defaults to 1.
  • verbose (bool): verbose.

Returns:

  • dict: results containing mean correlation coefficient, CI and CI type.

function get_corr

get_corr(
    x: str,
    y: str,
    method: str,
    df: DataFrame = None,
    method_kws: dict = {},
    pval: bool = True,
    preprocess: bool = True,
    n_min=10,
    preprocess_kws: dict = {},
    resample: bool = False,
    cv=5,
    resample_kws: dict = {},
    verbose: bool = False,
    test: bool = False
)  dict

Correlation between vectors. A unifying wrapper around scipy's functions to calculate correlations and distances. Allows application of resampling on those functions.

Usage: 1. Linear table with paired values. For a matrix, use pd.DataFrame.corr instead.

Args:

  • x (str): x column name or a vector.
  • y (str): y column name or a vector.
  • method (str): method name.
  • df (pd.DataFrame): input table.
  • pval (bool): calculate p-value.
  • resample (bool, optional): resampling. Defaults to False.
  • preprocess (bool): preprocess the input
  • preprocess_kws (dict) : parameters provided to the pre-processing function i.e. _pre.
  • resample (bool): resampling.
  • resample_kws (dict): parameters provided to the resampling function i.e. resample.
  • verbose (bool): verbose.

Returns:

  • res (dict): a dictionary containing results.

Notes:

res directory contains following values: method : method name r : correlation coefficient or distance p : pvalue of the correlation. n : sample size rr: resampled average 'r' ci: CI ci_type: CI type


function get_corrs

get_corrs(
    data: DataFrame,
    method: str,
    cols: list = None,
    cols_with: list = None,
    coff_inflation_min: float = None,
    get_pairs_kws={},
    fast: bool = False,
    test: bool = False,
    verbose: bool = False,
    **kws_get_corr
)  DataFrame

Correlate many columns of a dataframes.

Parameters:

  • df1 (DataFrame): input dataframe.
  • method (str): method of correlation spearman or pearson.
  • cols (str): columns.
  • cols_with (str): columns to correlate with i.e. variable2.
  • fast (bool): use parallel-processing if True.

Keyword arguments:

  • kws_get_corr: parameters provided to get_corr function.

Returns:

  • DataFrame: output dataframe.

Notes:

In the fast mode (fast=True), to set the number of processes, before executing the get_corrs command, run from pandarallel import pandarallel pandarallel.initialize(nb_workers={},progress_bar=True,use_memory_fs=False)


function check_collinearity

check_collinearity(
    df1: DataFrame,
    threshold: float = 0.7,
    colvalue: str = 'r',
    cols_variable: list = ['variable1', 'variable2'],
    coff_pval: float = 0.05,
    method: str = 'spearman',
    coff_inflation_min: int = 50
)  Series

Check collinearity.

Args:

  • df1 (DataFrame): input dataframe.
  • threshold (float): minimum threshold for the colinearity.

Returns:

  • DataFrame: output dataframe with minimum correlation among correlated subnetwork of columns.

function pairwise_chi2

pairwise_chi2(df1: DataFrame, cols_values: list)  DataFrame

Pairwise chi2 test.

Args:

  • df1 (DataFrame): pd.DataFrame
  • cols_values (list): list of columns.

Returns:

  • DataFrame: output dataframe.

TODOs: 0. use lib.set.get_pairs to get the combinations.

module roux.viz.line

For line plots.


function plot_range

plot_range(
    df00: DataFrame,
    colvalue: str,
    colindex: str,
    k: str,
    headsize: int = 15,
    headcolor: str = 'lightgray',
    ax: Axes = None,
    **kws_area
)  Axes

Plot range/intervals e.g. genome coordinates as lines.

Args:

  • df00 (pd.DataFrame): input data.
  • colvalue (str): column with values.
  • colindex (str): column with ids.
  • k (str): subset name.
  • headsize (int, optional): margin at top. Defaults to 15.
  • headcolor (str, optional): color of the margin. Defaults to 'lightgray'.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword args:

  • kws: keyword parameters provided to area function.

Returns:

  • plt.Axes: plt.Axes object.

function plot_bezier

plot_bezier(
    pt1,
    pt2,
    pt1_guide=None,
    pt2_guide=None,
    direction='h',
    off_guide=0.25,
    ax=None,
    test=False,
    **kws_line
)

function plot_kinetics

plot_kinetics(
    df1: DataFrame,
    x: str,
    y: str,
    hue: str,
    cmap: str = 'Reds_r',
    ax: Axes = None,
    test: bool = False,
    kws_legend: dict = {},
    **kws_set
)  Axes

Plot time-dependent kinetic data.

Args:

  • df1 (pd.DataFrame): input data.
  • x (str): x column.
  • y (str): y column.
  • hue (str): hue column.
  • cmap (str, optional): colormap. Defaults to 'Reds_r'.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.
  • kws_legend (dict, optional): legend parameters. Defaults to {}.

Returns:

  • plt.Axes: plt.Axes object.

function plot_steps

plot_steps(
    df1: DataFrame,
    col_step_name: str,
    col_step_size: str,
    ax: Axes = None,
    test: bool = False
)  Axes

Plot step-wise changes in numbers, e.g. for a filtering process.

Args:

  • df1 (pd.DataFrame): input data.
  • col_step_size (str): column containing the numbers.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.
  • test (bool, optional): test mode. Defaults to False.

Returns:

  • plt.Axes: plt.Axes object.

module roux.lib.df

For processing individual pandas DataFrames/Series. Mainly used in piped operations.


function get_name

get_name(df1: DataFrame, cols: list = None, coff: float = 2, out=None)

Gets the name of the dataframe.

Especially useful within groupby+pandarellel context.

Parameters:

  • df1 (DataFrame): input dataframe.
  • cols (list): list groupby columns.
  • coff (int): cutoff of unique values to infer the name.
  • out (str): format of the output (list|not).

Returns:

  • name (tuple|str|list): name of the dataframe.

function log_name

log_name(df1: DataFrame, **kws_get_name)

function get_groupby_columns

get_groupby_columns(df_)

Get the columns supplied to groupby.

Parameters:

  • df_ (DataFrame): input dataframe.

Returns:

  • columns (list): list of columns.

function get_constants

get_constants(df1)

Get the columns with a single unique value.

Parameters:

  • df1 (DataFrame): input dataframe.

Returns:

  • columns (list): list of columns.

function drop_unnamedcol

drop_unnamedcol(df)

Deletes the columns with "Unnamed" prefix.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function drop_unnamedcol

drop_unnamedcol(df)

Deletes the columns with "Unnamed" prefix.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function drop_levelcol

drop_levelcol(df)

Deletes the potentially temporary columns names with "level" prefix.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function drop_constants

drop_constants(df)

Deletes columns with a single unique value.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function dropby_patterns

dropby_patterns(
    df1,
    patterns=None,
    strict=False,
    test=False,
    verbose=True,
    errors='raise'
)

Deletes columns containing substrings i.e. patterns.

Parameters:

  • df1 (DataFrame): input dataframe.
  • patterns (list): list of substrings.
  • test (bool): verbose.

Returns:

  • df1 (DataFrame): output dataframe.

function flatten_columns

flatten_columns(df: DataFrame, sep: str = ' ', **kws)  DataFrame

Multi-index columns to single-level.

Parameters:

  • df (DataFrame): input dataframe.
  • sep (str): separator within the joined tuples (' ').

Returns:

  • df (DataFrame): output dataframe.

Keyword Arguments:

  • kws (dict): parameters provided to coltuples2str function.

function lower_columns

lower_columns(df)

Column names of the dataframe to lower-case letters.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function renameby_replace

renameby_replace(
    df: DataFrame,
    replaces: dict,
    ignore: bool = True,
    **kws
)  DataFrame

Rename columns by replacing sub-strings.

Parameters:

  • df (DataFrame): input dataframe.
  • replaces (dict|list): from->to format or list containing substrings to remove.
  • ignore (bool): if True, not validate the successful replacements.

Returns:

  • df (DataFrame): output dataframe.

Keyword Arguments:

  • kws (dict): parameters provided to replacemany function.

function clean_columns

clean_columns(df: DataFrame)  DataFrame

Standardise columns.

Steps: 1. Strip flanking white-spaces. 2. Lower-case letters.

Parameters:

  • df (DataFrame): input dataframe.

Returns:

  • df (DataFrame): output dataframe.

function clean

clean(
    df: DataFrame,
    cols: list = [],
    drop_constants: bool = False,
    drop_unnamed: bool = True,
    verb: bool = False
)  DataFrame

Deletes potentially temporary columns.

Steps: 1. Strip flanking white-spaces. 2. Lower-case letters.

Parameters:

  • df (DataFrame): input dataframe.
  • drop_constants (bool): whether to delete the columns with a single unique value.
  • drop_unnamed (bool): whether to delete the columns with 'Unnamed' prefix.
  • verb (bool): verbose.

Returns:

  • df (DataFrame): output dataframe.

function compress

compress(df1: DataFrame, coff_categories: int = None, verbose: bool = True)

Compress the dataframe by converting columns containing strings/objects to categorical.

Parameters:

  • df1 (DataFrame): input dataframe.
  • coff_categories (int): if the number of unique values are less than cutoff the it will be converted to categories.
  • verbose (bool): verbose.

Returns:

  • df1 (DataFrame): output dataframe.

function clean_compress

clean_compress(df: DataFrame, kws_compress: dict = {}, **kws_clean)

clean and compress the dataframe.

Parameters:

  • df (DataFrame): input dataframe.
  • kws_compress (int): keyword arguments for the compress function.
  • test (bool): verbose.

Keyword Arguments:

  • kws_clean (dict): parameters provided to clean function.

Returns:

  • df1 (DataFrame): output dataframe.

See Also: clean compress


function check_na

check_na(df, subset=None, out=True, perc=False, log=True)

Number of missing values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • out (bool): output, else not which can be applicable in chained operations.

Returns:

  • ds (Series): output stats.

function validate_no_na

validate_no_na(df, subset=None)

Validate no missing values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • perc (bool): output percentages.

Returns:

  • ds (Series): output stats.

function assert_no_na

assert_no_na(df, subset=None)

Assert that no missing values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • perc (bool): output percentages.

Returns:

  • ds (Series): output stats.

function to_str

to_str(data, log=False)

function check_nunique

check_nunique(
    df: DataFrame,
    subset: list = None,
    groupby: str = None,
    perc: bool = False,
    auto=False,
    out=True,
    log=True
)  Series

Number/percentage of unique values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • perc (bool): output percentages.

Returns:

  • ds (Series): output stats.

function check_inflation

check_inflation(df1, subset=None)

Occurances of values in columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.

Returns:

  • ds (Series): output stats.

function check_dups

check_dups(df, subset=None, perc=False, out=True)

Check duplicates.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • perc (bool): output percentages.

Returns:

  • ds (Series): output stats.

function check_duplicated

check_duplicated(df, **kws)

Check duplicates (alias of check_dups)


function validate_no_dups

validate_no_dups(df, subset=None, log: bool = True)

Validate that no duplicates.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.

function validate_no_duplicates

validate_no_duplicates(df, subset=None, **kws)

Validate that no duplicates (alias of validate_no_dups)


function assert_no_dups

assert_no_dups(df, subset=None)

Assert that no duplicates


function validate_dense

validate_dense(
    df01: DataFrame,
    subset: list = None,
    duplicates: bool = True,
    na: bool = True,
    message=None
)  DataFrame

Validate no missing values and no duplicates in the dataframe.

Parameters:

  • df01 (DataFrame): input dataframe.
  • subset (list): list of columns.
  • duplicates (bool): whether to check duplicates.
  • na (bool): whether to check na.
  • message (str): error message

function assert_dense

assert_dense(
    df01: DataFrame,
    subset: list = None,
    duplicates: bool = True,
    na: bool = True,
    message=None
)  DataFrame

Alias of validate_dense.

Notes:

to be deprecated in future releases.


function assert_len

assert_len(df: DataFrame, count: int)  DataFrame

Validate length in pipe'd operations.

Example: ( df .rd.assert_len(10) )


function assert_nunique

assert_nunique(df: DataFrame, col: str, count: int)  DataFrame

Validate unique counts in pipe'd operations.

Example: ( df .rd.assert_nunique('id',10) )


function classify_mappings

classify_mappings(df1: DataFrame, subset, clean: bool = False)  DataFrame

Classify mappings between items in two columns.

Parameters:

  • df1 (DataFrame): input dataframe.
  • col1 (str): column #1.
  • col2 (str): column #2.
  • clean (str): drop columns with the counts.

Returns:

  • (pd.DataFrame): output.

function check_mappings

check_mappings(df: DataFrame, subset: list = None, out=True)  DataFrame

Mapping between items in two columns.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • out (str): format of the output.

Returns:

  • ds (Series): output stats.

function assert_1_1_mappings

assert_1_1_mappings(df: DataFrame, subset: list = None)  DataFrame

Validate that the papping between items in two columns is 1:1.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): list of columns.
  • out (str): format of the output.

function get_mappings

get_mappings(
    df1: DataFrame,
    subset=None,
    keep='all',
    clean=False,
    cols=None
)  DataFrame

Classify the mapapping between items in two columns.

Parameters:

  • df1 (DataFrame): input dataframe.
  • subset (list): list of columns.
  • keep (str): type of mapping (1:1|1:m|m:1).
  • clean (bool): whether remove temporary columns.
  • cols (list): alias of subset.

Returns:

  • df (DataFrame): output dataframe.

function to_map_binary

to_map_binary(df: DataFrame, colgroupby=None, colvalue=None)  DataFrame

Convert linear mappings to a binary map

Parameters:

  • df (DataFrame): input dataframe.
  • colgroupby (str): name of the column for groupby.
  • colvalue (str): name of the column containing values.

Returns:

  • df1 (DataFrame): output dataframe.

function check_intersections

check_intersections(
    df: DataFrame,
    colindex=None,
    colgroupby=None,
    plot=False,
    **kws_plot
)  DataFrame

Check intersections. Linear dataframe to is converted to a binary map and then to a series using groupby.

Parameters:

  • df (DataFrame): input dataframe.
  • colindex (str): name of the index column.
  • colgroupby (str): name of the groupby column.
  • plot (bool): plot or not.

Returns:

  • ds1 (Series): output Series.

Keyword Arguments:

  • kws_plot (dict): parameters provided to the plotting function.

function get_totals

get_totals(ds1)

Get totals from the output of check_intersections.

Parameters:

  • ds1 (Series): input Series.

Returns:

  • d (dict): output dictionary.

function filter_rows

filter_rows(
    df,
    d,
    sign='==',
    logic='and',
    drop_constants=False,
    test=False,
    verbose=True
)

Filter rows using a dictionary.

Parameters:

  • df (DataFrame): input dataframe.
  • d (dict): dictionary.
  • sign (str): condition within mappings ('==').
  • logic (str): condition between mappings ('and').
  • drop_constants (bool): to drop the columns with single unique value (False).
  • test (bool): testing (False).
  • verbose (bool): more verbose (True).

Returns:

  • df (DataFrame): output dataframe.

function agg_bools

agg_bools(df1, cols)

Bools to columns. Reverse of one-hot encoder (get_dummies).

Parameters:

  • df1 (DataFrame): input dataframe.
  • cols (list): columns.

Returns:

  • ds (Series): output series.

function melt_paired

melt_paired(
    df: DataFrame,
    cols_index: list = None,
    suffixes: list = None,
    cols_value: list = None,
    clean: bool = False
)  DataFrame

Melt a paired dataframe.

Parameters:

  • df (DataFrame): input dataframe.
  • cols_index (list): paired index columns (None).
  • suffixes (list): paired suffixes (None).
  • cols_value (list): names of the columns containing the values (None).

Notes:

Partial melt melts selected columns cols_value.

Examples: Paired parameters: cols_value=['value1','value2'], suffixes=['gene1','gene2'],


function get_bin_labels

get_bin_labels(bins: list, dtype: str = 'int')

function get_bins

get_bins(
    df: DataFrame,
    col: str,
    bins: list,
    dtype: str = 'int',
    labels: list = None,
    **kws_cut
)

function get_qbins

get_qbins(df: DataFrame, col: str, bins: list, labels: list = None, **kws_qcut)

function get_chunks

get_chunks(
    df1: DataFrame,
    colindex: str,
    colvalue: str,
    bins: int = None,
    value: str = 'right'
)  DataFrame

Get chunks of a dataframe.

Parameters:

  • df1 (DataFrame): input dataframe.
  • colindex (str): name of the index column.
  • colvalue (str): name of the column containing values [0-100]
  • bins (int): number of bins.
  • value (str): value to use as the name of the chunk ('right').

Returns:

  • ds (Series): output series.

function sample_near_quantiles

sample_near_quantiles(data: DataFrame, col: str, n: int, clean: bool = False)

Get rows with values closest to the quantiles.


function get_group

get_group(groups, i: int = None, verbose: bool = True)  DataFrame

Get a dataframe for a group out of the groupby object.

Parameters:

  • groups (object): groupby object.
  • i (int): index of the group. default None returns the largest group.
  • verbose (bool): verbose (True).

Returns:

  • df (DataFrame): output dataframe.

Notes:

Useful for testing groupby.


function groupby_sample

groupby_sample(
    df: DataFrame,
    groupby: list,
    i: int = None,
    **kws_get_group
)  DataFrame

Samples a group (similar to .sample)

Parameters:

  • df (pd.DataFrame): input dataframe.
  • groupby (list): columns to group by.
  • i (int): index of the group. default None returns the largest group.

Keyword arguments: keyword parameters provided to the get_group function

Returns: pd.DataFrame


function groupby_sort_values

groupby_sort_values(
    df: DataFrame,
    groupby: str,
    col: str,
    func: str,
    col_temp: str = 'temp',
    ascending=True,
    **kws_sort_values
)  DataFrame

Groupby and sort

Parameters:

  • df (pd.DataFrame): input dataframe.
  • groupby (list): columns to group by.

Keyword arguments: keyword parameters provided to the .sort_values attribute

Returns: pd.DataFrame


function groupby_agg_nested

groupby_agg_nested(
    df1: DataFrame,
    groupby: list,
    subset: list,
    func: dict = None,
    cols_value: list = None,
    verbose: bool = False,
    **kws_agg
)  DataFrame

Aggregate serially from the lower level subsets to upper level ones.

Parameters:

  • df1 (pd.DataFrame): input dataframe.
  • groupby (list): groupby columns i.e. list of columns to be used as ids in the output.
  • subset (list): nested groups i.e. subsets.
  • func (dict): map betweek columns with value to aggregate and the function for aggregation.
  • cols_value (list): columns with value to aggregate, (optional).
  • verbose (bool): verbose.

Keyword arguments:

  • kws_agg : keyword arguments provided to pandas's .agg function.

Returns: output dataframe with the aggregated values.


function groupby_filter_fast

groupby_filter_fast(
    df1: DataFrame,
    col_groupby,
    fun_agg,
    expr,
    col_agg: str = 'temporary',
    **kws_query
)  DataFrame

Groupby and filter fast.

Parameters:

  • df1 (DataFrame): input dataframe.
  • by (str|list): column name/s to groupby with.
  • fun (object): function to filter with.
  • how (str): greater or less than coff (>|<).
  • coff (float): cut-off.

Returns:

  • df1 (DataFrame): output dataframe.

Todo: Deprecation if pandas.core.groupby.DataFrameGroupBy.filter is faster.


function infer_index

infer_index(
    data: DataFrame,
    cols_drop=[],
    include=<class 'object'>,
    exclude=None
)  list

Infer the index (id) of the table.


function to_multiindex_columns

to_multiindex_columns(df, suffixes, test=False)

Single level columns to multiindex.

Parameters:

  • df (DataFrame): input dataframe.
  • suffixes (list): list of suffixes.
  • test (bool): verbose (False).

Returns:

  • df (DataFrame): output dataframe.

function to_ranges

to_ranges(df1, colindex, colbool, sort=True)

Ranges from boolean columns.

Parameters:

  • df1 (DataFrame): input dataframe.
  • colindex (str): column containing index items.
  • colbool (str): column containing boolean values.
  • sort (bool): sort the dataframe (True).

Returns:

  • df1 (DataFrame): output dataframe.

TODO: compare with io_sets.bools2intervals.


function to_boolean

to_boolean(df1)

Boolean from ranges.

Parameters:

  • df1 (DataFrame): input dataframe.

Returns:

  • ds (Series): output series.

TODO: compare with io_sets.bools2intervals.


function to_cat

to_cat(ds1: Series, cats: list, ordered: bool = True)

To series containing categories.

Parameters:

  • ds1 (Series): input series.
  • cats (list): categories.
  • ordered (bool): if the categories are ordered (True).

Returns:

  • ds1 (Series): output series.

function astype_cat

astype_cat(df1: DataFrame, col: str, cats: list)

function sort_valuesby_list

sort_valuesby_list(
    df1: DataFrame,
    by: str,
    cats: list,
    by_more: list = [],
    **kws
)

Sort dataframe by custom order of items in a column.

Parameters:

  • df1 (DataFrame): input dataframe.
  • by (str): column.
  • cats (list): ordered list of items.

Keyword parameters:

  • kws (dict): parameters provided to sort_values.

Returns:

  • df (DataFrame): output dataframe.

function agg_by_order

agg_by_order(x, order)

Get first item in the order.

Parameters:

  • x (list): list.
  • order (list): desired order of the items.

Returns:

  • k: first item.

Notes:

Used for sorting strings. e.g. damaging > other non-conserving > other conserving

TODO: Convert categories to numbers and take min


function agg_by_order_counts

agg_by_order_counts(x, order)

Get the aggregated counts by order*.

Parameters:

  • x (list): list.
  • order (list): desired order of the items.

Returns:

  • df (DataFrame): output dataframe.

Examples: df=pd.DataFrame({'a1':['a','b','c','a','b','c','d'], 'b1':['a1','a1','a1','b1','b1','b1','b1'],}) df.groupby('b1').apply(lambda df : agg_by_order_counts(x=df['a1'], order=['b','c','a'], ))


function swap_paired_cols

swap_paired_cols(df_, suffixes=['gene1', 'gene2'])

Swap suffixes of paired columns.

Parameters:

  • df_ (DataFrame): input dataframe.
  • suffixes (list): suffixes.

Returns:

  • df (DataFrame): output dataframe.

function sort_columns_by_values

sort_columns_by_values(
    df: DataFrame,
    subset: list,
    suffixes: list = None,
    order: list = None,
    clean=False
)  DataFrame

Sort the values in columns in ascending order.

Parameters:

  • df (DataFrame): input dataframe.
  • subset (list): columns.
  • suffixes (list): suffixes.
  • order (list): ordered list.

Returns:

  • df (DataFrame): output dataframe.

Notes:

In the output dataframe, sorted means values are sorted because gene1>gene2.


function make_ids

make_ids(
    df: DataFrame,
    cols: list,
    ids_have_equal_length: bool,
    sep: str = '--',
    sort: bool = False
)  Series

Make ids by joining string ids in more than one columns.

Parameters:

  • df (DataFrame): input dataframe.
  • cols (list): columns.
  • ids_have_equal_length (bool): ids have equal length, if True faster processing.
  • sep (str): separator between the ids ('--').
  • sort (bool): sort the ids before joining (False).

Returns:

  • ds (Series): output series.

function make_ids_sorted

make_ids_sorted(
    df: DataFrame,
    cols: list,
    ids_have_equal_length: bool,
    sep: str = '--',
    sort: bool = False
)  Series

Make sorted ids by joining string ids in more than one columns.

Parameters:

  • df (DataFrame): input dataframe.
  • cols (list): columns.
  • ids_have_equal_length (bool): ids have equal length, if True faster processing.
  • sep (str): separator between the ids ('--').

Returns:

  • ds (Series): output series.

function get_alt_id

get_alt_id(s1: str, s2: str, sep: str = '--')

Get alternate/partner id from a paired id.

Parameters:

  • s1 (str): joined id.
  • s2 (str): query id.

Returns:

  • s (str): partner id.

function split_ids

split_ids(df1, col, sep='--', prefix=None)

Split joined ids to individual ones.

Parameters:

  • df1 (DataFrame): input dataframe.
  • col (str): column containing the joined ids.
  • sep (str): separator within the joined ids ('--').
  • prefix (str): prefix of the individual ids (None).

Return:

  • df1 (DataFrame): output dataframe.

function dict2df

dict2df(d, colkey='key', colvalue='value')

Dictionary to DataFrame.

Parameters:

  • d (dict): dictionary.
  • colkey (str): name of column containing the keys.
  • colvalue (str): name of column containing the values.

Returns:

  • df (DataFrame): output dataframe.

function log_shape_change

log_shape_change(d1, fun='')

Report the changes in the shapes of a DataFrame.

Parameters:

  • d1 (dic): dictionary containing the shapes.
  • fun (str): name of the function.

function log_apply

log_apply(
    df,
    fun,
    validate_equal_length=False,
    validate_equal_width=False,
    validate_equal_shape=False,
    validate_no_decrease_length=False,
    validate_no_decrease_width=False,
    validate_no_increase_length=False,
    validate_no_increase_width=False,
    *args,
    **kwargs
)

Report (log) the changes in the shapes of the dataframe before and after an operation/s.

Parameters:

  • df (DataFrame): input dataframe.
  • fun (object): function to apply on the dataframe.
  • validate_equal_length (bool): Validate that the number of rows i.e. length of the dataframe remains the same before and after the operation.
  • validate_equal_width (bool): Validate that the number of columns i.e. width of the dataframe remains the same before and after the operation.
  • validate_equal_shape (bool): Validate that the number of rows and columns i.e. shape of the dataframe remains the same before and after the operation.

Keyword parameters:

  • args (tuple): provided to fun.
  • kwargs (dict): provided to fun.

Returns:

  • df (DataFrame): output dataframe.

class log

Report (log) the changes in the shapes of the dataframe before and after an operation/s.

TODO: Create the attribures (attr) using strings e.g. setattr. import inspect fun=inspect.currentframe().f_code.co_name

method __init__

__init__(pandas_obj)

method check_dups

check_dups(**kws)

method check_na

check_na(**kws)

method clean

clean(**kws)

method drop

drop(**kws)

method drop_duplicates

drop_duplicates(**kws)

method dropna

dropna(**kws)

method explode

explode(**kws)

method filter_

filter_(**kws)

method filter_rows

filter_rows(**kws)

method groupby

groupby(**kws)

method join

join(**kws)

method melt

melt(**kws)

method melt_paired

melt_paired(**kws)

method merge

merge(**kws)

method pivot

pivot(**kws)

method pivot_table

pivot_table(**kws)

method query

query(**kws)

method stack

stack(**kws)

method unstack

unstack(**kws)

module roux.stat.binary

For processing binary data.


function compare_bools_jaccard

compare_bools_jaccard(x, y)

Compare bools in terms of the jaccard index.

Args:

  • x (list): list of bools.
  • y (list): list of bools.

Returns:

  • float: jaccard index.

function compare_bools_jaccard_df

compare_bools_jaccard_df(df: DataFrame)  DataFrame

Pairwise compare bools in terms of the jaccard index.

Args:

  • df (DataFrame): dataframe with boolean columns.

Returns:

  • DataFrame: matrix with comparisons between the columns.

function classify_bools

classify_bools(l: list)  str

Classify bools.

Args:

  • l (list): list of bools

Returns:

  • str: classification.

function frac

frac(x: list)  float

Fraction.

Args:

  • x (list): list of bools.

Returns:

  • float: fraction of True values.

function perc

perc(x: list)  float

Percentage.

Args:

  • x (list): list of bools.

Returns:

  • float: Percentage of the True values

function get_stats_confusion_matrix

get_stats_confusion_matrix(df_: DataFrame)  DataFrame

Get stats confusion matrix.

Args:

  • df_ (DataFrame): Confusion matrix.

Returns:

  • DataFrame: stats.

function get_cutoff

get_cutoff(
    y_true,
    y_score,
    method,
    show_diagonal=True,
    show_area=True,
    kws_area: dict = {},
    show_cutoff=True,
    plot_pr=True,
    color='k',
    returns=['ax'],
    ax=None
)

Obtain threshold based on ROC or PR curve.

Returns: Table:

  • columns: values
  • method: ROC, PR
  • variable: threshold (index), TPR, FPR, TP counts, precision, recall values: Plots: AUC ROC, TPR vs TP counts PR Specificity vs TP counts Dictionary: Thresholds from AUC, PR

TODOs: 1. Separate the plotting functions.

module roux.viz.bar

For bar plots.


function plot_barh

plot_barh(
    df1: DataFrame,
    colx: str,
    coly: str,
    colannnotside: str = None,
    x1: float = None,
    offx: float = 0,
    ax: Axes = None,
    **kws
)  Axes

Plot horizontal bar plot with text on them.

Args:

  • df1 (pd.DataFrame): input data.
  • colx (str): x column.
  • coly (str): y column.
  • colannnotside (str): column with annotations to show on the right side of the plot.
  • x1 (float): x position of the text.
  • offx (float): x-offset of x1, multiplier.
  • color (str): color of the bars.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Keyword Args:

  • kws: parameters provided to the barh function.

Returns:

  • plt.Axes: plt.Axes object.

function plot_value_counts

plot_value_counts(
    df: DataFrame,
    col: str,
    logx: bool = False,
    kws_hist: dict = {'bins': 10},
    kws_bar: dict = {},
    grid: bool = False,
    axes: list = None,
    fig: object = None,
    hist: bool = True
)

Plot pandas's value_counts.

Args:

  • df (pd.DataFrame): input data value_counts.
  • col (str): column with counts.
  • logx (bool, optional): x-axis on log-scale. Defaults to False.
  • kws_hist (type, optional): parameters provided to the hist function. Defaults to {'bins':10}.
  • kws_bar (dict, optional): parameters provided to the bar function. Defaults to {}.
  • grid (bool, optional): show grids or not. Defaults to False.
  • axes (list, optional): list of plt.axes. Defaults to None.
  • fig (object, optional): figure object. Defaults to None.
  • hist (bool, optional): show histgram. Defaults to True.

function plot_barh_stacked_percentage

plot_barh_stacked_percentage(
    df1: DataFrame,
    coly: str,
    colannot: str,
    color: str = None,
    yoff: float = 0,
    ax: Axes = None
)  Axes

Plot horizontal stacked bar plot with percentages.

Args:

  • df1 (pd.DataFrame): input data. values in rows sum to 100%.
  • coly (str): y column. yticklabels, e.g. retained and dropped.
  • colannot (str): column with annotations.
  • color (str, optional): color. Defaults to None.
  • yoff (float, optional): y-offset. Defaults to 0.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function plot_bar_serial

plot_bar_serial(
    d1: dict,
    polygon: bool = False,
    polygon_x2i: float = 0,
    labelis: list = [],
    y: float = 0,
    ylabel: str = None,
    off_arrowy: float = 0.15,
    kws_rectangle={'height': 0.5, 'linewidth': 1},
    ax: Axes = None
)  Axes

Barplots with serial increase in resolution.

Args:

  • d1 (dict): dictionary with the data.
  • polygon (bool, optional): show polygon. Defaults to False.
  • polygon_x2i (float, optional): connect polygon to this subset. Defaults to 0.
  • labelis (list, optional): label these subsets. Defaults to [].
  • y (float, optional): y position. Defaults to 0.
  • ylabel (str, optional): y label. Defaults to None.
  • off_arrowy (float, optional): offset for the arrow. Defaults to 0.15.
  • kws_rectangle (type, optional): parameters provided to the rectangle function. Defaults to dict(height=0.5,linewidth=1).
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

function plot_barh_stacked_percentage_intersections

plot_barh_stacked_percentage_intersections(
    df0: DataFrame,
    colxbool: str,
    colybool: str,
    colvalue: str,
    colid: str,
    colalt: str,
    colgroupby: str,
    coffgroup: float = 0.95,
    ax: Axes = None
)  Axes

Plot horizontal stacked bar plot with percentages and intesections.

Args:

  • df0 (pd.DataFrame): input data.
  • colxbool (str): x column.
  • colybool (str): y column.
  • colvalue (str): column with the values.
  • colid (str): column with ids.
  • colalt (str): column with the alternative subset.
  • colgroupby (str): column with groups.
  • coffgroup (float, optional): cut-off between the groups. Defaults to 0.95.
  • ax (plt.Axes, optional): plt.Axes object. Defaults to None.

Returns:

  • plt.Axes: plt.Axes object.

Examples:

Parameters: colxbool='paralog', colybool='essential', colvalue='value', colid='gene id', colalt='singleton', coffgroup=0.95, colgroupby='tissue',


function to_input_data_sankey

to_input_data_sankey(
    df0,
    colid,
    cols_groupby=None,
    colall='all',
    remove_all=False
)

function plot_sankey

plot_sankey(
    df1,
    cols_groupby=None,
    hues=None,
    node_color=None,
    link_color=None,
    info=None,
    x=None,
    y=None,
    colors=None,
    hovertemplate=None,
    text_width=20,
    convert=True,
    width=400,
    height=400,
    outp=None,
    validate=True,
    test=False,
    **kws
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roux-0.1.2.tar.gz (258.4 kB view details)

Uploaded Source

Built Distribution

roux-0.1.2-py3-none-any.whl (237.0 kB view details)

Uploaded Python 3

File details

Details for the file roux-0.1.2.tar.gz.

File metadata

  • Download URL: roux-0.1.2.tar.gz
  • Upload date:
  • Size: 258.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for roux-0.1.2.tar.gz
Algorithm Hash digest
SHA256 399f5b379cc28e0d61eb5d74101da191b9d3b96b33d22f4ca58e0324f81962a0
MD5 6895f90dfd9fc194cd3cde0790247c62
BLAKE2b-256 f2cb3f2a9b87d7b8d23acc04a540e363e41f82522a1f61f7d3add6b86662bd92

See more details on using hashes here.

File details

Details for the file roux-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: roux-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 237.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for roux-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 815d9fc72e847b52f3664d9ae673379e664a1ad6dec37c361e9627106c9b6bc8
MD5 36309a9b78a3d12dfcc4c91ecf4565cd
BLAKE2b-256 39f334b9c5c0fe06e30242f8e9a00316acf837b4e5d10f4908a7dde7ab23929b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page