Convenience functions.
Project description
roux
Convenience functions in Python.
Examples
·
Explore the API
Examples
Installation
pip install roux # with basic dependencies
pip install roux[all] # with all the additional dependencies (recommended).
With additional dependencies as required:
pip install roux[viz] # for visualizations e.g. seaborn etc.
pip install roux[data] # for data operations e.g. reading excel files etc.
pip install roux[stat] # for statistics e.g. statsmodels etc.
pip install roux[fast] # for faster processing e.g. parallelization etc.
pip install roux[workflow] # for workflow operations e.g. omegaconf etc.
pip install roux[interactive] # for interactive operations in jupyter notebook e.g. watermark, icecream etc.
Command-line usage
🗺️ Read configuration.
roux read-config path/to/file
🗺️ Read metadata.
roux read-metadata path/to/file
📁 Find the latest and the oldest file in a list.
roux read-ps list_of_paths
💾 Backup a directory with a timestamp (ISO).
roux backup path/to/directory
⭐ Remove *'s from a jupyter notebook'.
roux removestar path/to/notebook
ℹ️ Available command line tools and their usage.
roux --help
How to cite?
- Using BibTeX:
@software{Dandage_roux,
title = {roux: Streamlined and Versatile Data Processing Toolkit},
author = {Dandage, Rohan},
year = {2023},
url = {https://zenodo.org/doi/10.5281/zenodo.2682670},
version = {v0.1.0},
note = {The URL is a DOI link to the permanent archive of the software.},
}
-
Using citation information from CITATION.CFF file.
Future directions, for which contributions are welcome
- Addition of visualization function as attributes to
rd
dataframes. - Refactoring of the workflow functions.
Similar projects
API
module roux.global_imports
For importing commonly used functions at the development phase.
Usage: in interactive sessions (e.g. in jupyter notebooks) to facilitate faster code development.
Note: Post-development, to remove *s from the code, use removestar (pip install removestar).
removestar file
Global Variables
- FONTSIZE
- PAD
module roux.lib.df
For processing individual pandas DataFrames/Series
function get_name
get_name(df1: DataFrame, cols: list = None, coff: float = 2, out=None)
Gets the name of the dataframe.
Especially useful within groupby
+pandarellel
context.
Parameters:
df1
(DataFrame): input dataframe.cols
(list): list groupby columns.coff
(int): cutoff of unique values to infer the name.out
(str): format of the output (list|not).
Returns:
name
(tuple|str|list): name of the dataframe.
function get_groupby_columns
get_groupby_columns(df_)
Get the columns supplied to groupby
.
Parameters:
df_
(DataFrame): input dataframe.
Returns:
columns
(list): list of columns.
function get_constants
get_constants(df1)
Get the columns with a single unique value.
Parameters:
df1
(DataFrame): input dataframe.
Returns:
columns
(list): list of columns.
function drop_unnamedcol
drop_unnamedcol(df)
Deletes the columns with "Unnamed" prefix.
Parameters:
df
(DataFrame): input dataframe.
Returns:
df
(DataFrame): output dataframe.
function drop_unnamedcol
drop_unnamedcol(df)
Deletes the columns with "Unnamed" prefix.
Parameters:
df
(DataFrame): input dataframe.
Returns:
df
(DataFrame): output dataframe.
function drop_levelcol
drop_levelcol(df)
Deletes the potentially temporary columns names with "level" prefix.
Parameters:
df
(DataFrame): input dataframe.
Returns:
df
(DataFrame): output dataframe.
function drop_constants
drop_constants(df)
Deletes columns with a single unique value.
Parameters:
df
(DataFrame): input dataframe.
Returns:
df
(DataFrame): output dataframe.
function dropby_patterns
dropby_patterns(df1, patterns=None, strict=False, test=False, verbose=True)
Deletes columns containing substrings i.e. patterns.
Parameters:
df1
(DataFrame): input dataframe.patterns
(list): list of substrings.test
(bool): verbose.
Returns:
df1
(DataFrame): output dataframe.
function flatten_columns
flatten_columns(df: DataFrame, sep: str = ' ', **kws) → DataFrame
Multi-index columns to single-level.
Parameters:
df
(DataFrame): input dataframe.sep
(str): separator within the joined tuples (' ').
Returns:
df
(DataFrame): output dataframe.
Keyword Arguments:
kws
(dict): parameters provided tocoltuples2str
function.
function lower_columns
lower_columns(df)
Column names of the dataframe to lower-case letters.
Parameters:
df
(DataFrame): input dataframe.
Returns:
df
(DataFrame): output dataframe.
function renameby_replace
renameby_replace(
df: DataFrame,
replaces: dict,
ignore: bool = True,
**kws
) → DataFrame
Rename columns by replacing sub-strings.
Parameters:
df
(DataFrame): input dataframe.replaces
(dict|list): from->to format or list containing substrings to remove.ignore
(bool): if True, not validate the successful replacements.
Returns:
df
(DataFrame): output dataframe.
Keyword Arguments:
kws
(dict): parameters provided toreplacemany
function.
function clean_columns
clean_columns(df: DataFrame) → DataFrame
Standardise columns.
Steps: 1. Strip flanking white-spaces. 2. Lower-case letters.
Parameters:
df
(DataFrame): input dataframe.
Returns:
df
(DataFrame): output dataframe.
function clean
clean(
df: DataFrame,
cols: list = [],
drop_constants: bool = False,
drop_unnamed: bool = True,
verb: bool = False
) → DataFrame
Deletes potentially temporary columns.
Steps: 1. Strip flanking white-spaces. 2. Lower-case letters.
Parameters:
df
(DataFrame): input dataframe.drop_constants
(bool): whether to delete the columns with a single unique value.drop_unnamed
(bool): whether to delete the columns with 'Unnamed' prefix.verb
(bool): verbose.
Returns:
df
(DataFrame): output dataframe.
function compress
compress(df1, coff_categories=20, test=False)
Compress the dataframe by converting columns containing strings/objects to categorical.
Parameters:
df1
(DataFrame): input dataframe.coff_categories
(int): if the number of unique values are less than cutoff the it will be converted to categories.test
(bool): verbose.
Returns:
df1
(DataFrame): output dataframe.
function clean_compress
clean_compress(df, kws_compress={}, **kws_clean)
clean
and compress
the dataframe.
Parameters:
df
(DataFrame): input dataframe.kws_compress
(int): keyword arguments for thecompress
function.test
(bool): verbose.
Keyword Arguments:
kws_clean
(dict): parameters provided toclean
function.
Returns:
df1
(DataFrame): output dataframe.
See Also: clean
compress
function check_na
check_na(df, subset=None, out=True, perc=False, log=True)
Number of missing values in columns.
Parameters:
df
(DataFrame): input dataframe.subset
(list): list of columns.out
(bool): output, else not which can be applicable in chained operations.
Returns:
ds
(Series): output stats.
function validate_no_na
validate_no_na(df, subset=None)
Validate no missing values in columns.
Parameters:
df
(DataFrame): input dataframe.subset
(list): list of columns.perc
(bool): output percentages.
Returns:
ds
(Series): output stats.
function assert_no_na
assert_no_na(df, subset=None)
Assert that no missing values in columns.
Parameters:
df
(DataFrame): input dataframe.subset
(list): list of columns.perc
(bool): output percentages.
Returns:
ds
(Series): output stats.
function to_str
to_str(data, log=False)
function check_nunique
check_nunique(
df: DataFrame,
subset: list = None,
groupby: str = None,
perc: bool = False,
auto=False,
out=True,
log=True
) → Series
Number/percentage of unique values in columns.
Parameters:
df
(DataFrame): input dataframe.subset
(list): list of columns.perc
(bool): output percentages.
Returns:
ds
(Series): output stats.
function check_inflation
check_inflation(df1, subset=None)
Occurances of values in columns.
Parameters:
df
(DataFrame): input dataframe.subset
(list): list of columns.
Returns:
ds
(Series): output stats.
function check_dups
check_dups(df, subset=None, perc=False, out=True)
Check duplicates.
Parameters:
df
(DataFrame): input dataframe.subset
(list): list of columns.perc
(bool): output percentages.
Returns:
ds
(Series): output stats.
function check_duplicated
check_duplicated(df, **kws)
Check duplicates (alias of check_dups
)
function validate_no_dups
validate_no_dups(df, subset=None)
Validate that no duplicates.
Parameters:
df
(DataFrame): input dataframe.subset
(list): list of columns.
function validate_no_duplicates
validate_no_duplicates(df, subset=None)
Validate that no duplicates (alias of validate_no_dups
)
function assert_no_dups
assert_no_dups(df, subset=None)
Assert that no duplicates
function validate_dense
validate_dense(
df01: DataFrame,
subset: list = None,
duplicates: bool = True,
na: bool = True,
message=None
) → DataFrame
Validate no missing values and no duplicates in the dataframe.
Parameters:
df01
(DataFrame): input dataframe.subset
(list): list of columns.duplicates
(bool): whether to check duplicates.na
(bool): whether to check na.message
(str): error message
function assert_dense
assert_dense(
df01: DataFrame,
subset: list = None,
duplicates: bool = True,
na: bool = True,
message=None
) → DataFrame
Alias of validate_dense
.
Notes:
to be deprecated in future releases.
function classify_mappings
classify_mappings(df1: DataFrame, subset, clean: bool = False) → DataFrame
Classify mappings between items in two columns.
Parameters:
df1
(DataFrame): input dataframe.col1
(str): column #1.col2
(str): column #2.clean
(str): drop columns with the counts.
Returns:
(pd.DataFrame)
: output.
function check_mappings
check_mappings(df: DataFrame, subset: list = None, out=True) → DataFrame
Mapping between items in two columns.
Parameters:
df
(DataFrame): input dataframe.subset
(list): list of columns.out
(str): format of the output.
Returns:
ds
(Series): output stats.
function assert_1_1_mappings
assert_1_1_mappings(df: DataFrame, subset: list = None) → DataFrame
Validate that the papping between items in two columns is 1:1.
Parameters:
df
(DataFrame): input dataframe.subset
(list): list of columns.out
(str): format of the output.
function get_mappings
get_mappings(
df1: DataFrame,
subset=None,
keep='all',
clean=False,
cols=None
) → DataFrame
Classify the mapapping between items in two columns.
Parameters:
df1
(DataFrame): input dataframe.subset
(list): list of columns.keep
(str): type of mapping (1:1|1:m|m:1).clean
(bool): whether remove temporary columns.cols
(list): alias ofsubset
.
Returns:
df
(DataFrame): output dataframe.
function to_map_binary
to_map_binary(df: DataFrame, colgroupby=None, colvalue=None) → DataFrame
Convert linear mappings to a binary map
Parameters:
df
(DataFrame): input dataframe.colgroupby
(str): name of the column for groupby.colvalue
(str): name of the column containing values.
Returns:
df1
(DataFrame): output dataframe.
function check_intersections
check_intersections(
df: DataFrame,
colindex=None,
colgroupby=None,
plot=False,
**kws_plot
) → DataFrame
Check intersections. Linear dataframe to is converted to a binary map and then to a series using groupby
.
Parameters:
df
(DataFrame): input dataframe.colindex
(str): name of the index column.colgroupby
(str): name of the groupby column.plot
(bool): plot or not.
Returns:
ds1
(Series): output Series.
Keyword Arguments:
kws_plot
(dict): parameters provided to the plotting function.
function get_totals
get_totals(ds1)
Get totals from the output of check_intersections
.
Parameters:
ds1
(Series): input Series.
Returns:
d
(dict): output dictionary.
function filter_rows
filter_rows(
df,
d,
sign='==',
logic='and',
drop_constants=False,
test=False,
verbose=True
)
Filter rows using a dictionary.
Parameters:
df
(DataFrame): input dataframe.d
(dict): dictionary.sign
(str): condition within mappings ('==').logic
(str): condition between mappings ('and').drop_constants
(bool): to drop the columns with single unique value (False).test
(bool): testing (False).verbose
(bool): more verbose (True).
Returns:
df
(DataFrame): output dataframe.
function get_bools
get_bools(df, cols, drop=False)
Columns to bools. One-hot-encoder (get_dummies
).
Parameters:
df
(DataFrame): input dataframe.cols
(list): columns to encode.drop
(bool): drop thecols
(False).
Returns:
df
(DataFrame): output dataframe.
function agg_bools
agg_bools(df1, cols)
Bools to columns. Reverse of one-hot encoder (get_dummies
).
Parameters:
df1
(DataFrame): input dataframe.cols
(list): columns.
Returns:
ds
(Series): output series.
function melt_paired
melt_paired(
df: DataFrame,
cols_index: list = None,
suffixes: list = None,
cols_value: list = None,
clean: bool = False
) → DataFrame
Melt a paired dataframe.
Parameters:
df
(DataFrame): input dataframe.cols_index
(list): paired index columns (None).suffixes
(list): paired suffixes (None).cols_value
(list): names of the columns containing the values (None).
Notes:
Partial melt melts selected columns
cols_value
.
Examples: Paired parameters: cols_value=['value1','value2'], suffixes=['gene1','gene2'],
function get_chunks
get_chunks(
df1: DataFrame,
colindex: str,
colvalue: str,
bins: int = None,
value: str = 'right'
) → DataFrame
Get chunks of a dataframe.
Parameters:
df1
(DataFrame): input dataframe.colindex
(str): name of the index column.colvalue
(str): name of the column containing values [0-100]bins
(int): number of bins.value
(str): value to use as the name of the chunk ('right').
Returns:
ds
(Series): output series.
function sample_near_quantiles
sample_near_quantiles(data: DataFrame, col: str, n: int, clean: bool = False)
Get rows with values closest to the quantiles.
function get_group
get_group(groups, i: int = None, verbose: bool = True) → DataFrame
Get a dataframe for a group out of the groupby
object.
Parameters:
groups
(object): groupby object.i
(int): index of the group. default None returns the largest group.verbose
(bool): verbose (True).
Returns:
df
(DataFrame): output dataframe.
Notes:
Useful for testing
groupby
.
function groupby_sample
groupby_sample(
df: DataFrame,
groupby: list,
i: int = None,
**kws_get_group
) → DataFrame
Samples a group (similar to .sample)
Parameters:
df
(pd.DataFrame): input dataframe.groupby
(list): columns to group by.i
(int): index of the group. default None returns the largest group.
Keyword arguments: keyword parameters provided to the get_group
function
Returns: pd.DataFrame
function groupby_agg_nested
groupby_agg_nested(
df1: DataFrame,
groupby: list,
subset: list,
func: dict = None,
cols_value: list = None,
verbose: bool = False,
**kws_agg
) → DataFrame
Aggregate serially from the lower level subsets to upper level ones.
Parameters:
df1
(pd.DataFrame): input dataframe.groupby
(list): groupby columns i.e. list of columns to be used as ids in the output.subset
(list): nested groups i.e. subsets.func
(dict): map betweek columns with value to aggregate and the function for aggregation.cols_value
(list): columns with value to aggregate, (optional).verbose
(bool): verbose.
Keyword arguments:
kws_agg
: keyword arguments provided to pandas's.agg
function.
Returns: output dataframe with the aggregated values.
function groupby_filter_fast
groupby_filter_fast(
df1: DataFrame,
col_groupby,
fun_agg,
expr,
col_agg: str = 'temporary',
**kws_query
) → DataFrame
Groupby and filter fast.
Parameters:
df1
(DataFrame): input dataframe.by
(str|list): column name/s to groupby with.fun
(object): function to filter with.how
(str): greater or less thancoff
(>|<).coff
(float): cut-off.
Returns:
df1
(DataFrame): output dataframe.
Todo:
Deprecation if pandas.core.groupby.DataFrameGroupBy.filter
is faster.
function infer_index
infer_index(
data: DataFrame,
cols_drop=[],
include=<class 'object'>,
exclude=None
) → list
Infer the index (id) of the table.
function to_multiindex_columns
to_multiindex_columns(df, suffixes, test=False)
Single level columns to multiindex.
Parameters:
df
(DataFrame): input dataframe.suffixes
(list): list of suffixes.test
(bool): verbose (False).
Returns:
df
(DataFrame): output dataframe.
function to_ranges
to_ranges(df1, colindex, colbool, sort=True)
Ranges from boolean columns.
Parameters:
df1
(DataFrame): input dataframe.colindex
(str): column containing index items.colbool
(str): column containing boolean values.sort
(bool): sort the dataframe (True).
Returns:
df1
(DataFrame): output dataframe.
TODO: compare with io_sets.bools2intervals.
function to_boolean
to_boolean(df1)
Boolean from ranges.
Parameters:
df1
(DataFrame): input dataframe.
Returns:
ds
(Series): output series.
TODO: compare with io_sets.bools2intervals.
function to_cat
to_cat(ds1, cats, ordered=True)
To series containing categories.
Parameters:
ds1
(Series): input series.cats
(list): categories.ordered
(bool): if the categories are ordered (True).
Returns:
ds1
(Series): output series.
function astype_cat
astype_cat(df1: DataFrame, col: str, cats: list)
function sort_valuesby_list
sort_valuesby_list(
df1: DataFrame,
by: str,
cats: list,
by_more: list = [],
**kws
)
Sort dataframe by custom order of items in a column.
Parameters:
df1
(DataFrame): input dataframe.by
(str): column.cats
(list): ordered list of items.
Keyword parameters:
kws
(dict): parameters provided tosort_values
.
Returns:
df
(DataFrame): output dataframe.
function agg_by_order
agg_by_order(x, order)
Get first item in the order.
Parameters:
x
(list): list.order
(list): desired order of the items.
Returns:
k
: first item.
Notes:
Used for sorting strings. e.g.
damaging > other non-conserving > other conserving
TODO: Convert categories to numbers and take min
function agg_by_order_counts
agg_by_order_counts(x, order)
Get the aggregated counts by order*.
Parameters:
x
(list): list.order
(list): desired order of the items.
Returns:
df
(DataFrame): output dataframe.
Examples: df=pd.DataFrame({'a1':['a','b','c','a','b','c','d'], 'b1':['a1','a1','a1','b1','b1','b1','b1'],}) df.groupby('b1').apply(lambda df : agg_by_order_counts(x=df['a1'], order=['b','c','a'], ))
function groupby_sort_values
groupby_sort_values(
df: DataFrame,
col_groupby: list,
col_sortby: list,
subset: list = None,
col_subset: list = None,
func: str = 'mean',
ascending: bool = True
)
Sort groups.
Parameters:
df
(DataFrame): input dataframe.col_groupby
(str|list): column/s to groupby with.col_sortby
(str|list): column/s to sort values with.subset
(list): columns (None).col_subset
(str): column containing the subset (None).func
(str): aggregate function, provided to numpy ('mean').ascending
(bool): sort values ascending (True).
Returns:
df
(DataFrame): output dataframe.
function groupby_sort_values
groupby_sort_values(
df: DataFrame,
col_groupby: list,
col_sortby: list,
subset: list = None,
col_subset: list = None,
func: str = 'mean',
ascending: bool = True
)
Sort groups.
Parameters:
df
(DataFrame): input dataframe.col_groupby
(str|list): column/s to groupby with.col_sortby
(str|list): column/s to sort values with.subset
(list): columns (None).col_subset
(str): column containing the subset (None).func
(str): aggregate function, provided to numpy ('mean').ascending
(bool): sort values ascending (True).
Returns:
df
(DataFrame): output dataframe.
function swap_paired_cols
swap_paired_cols(df_, suffixes=['gene1', 'gene2'])
Swap suffixes of paired columns.
Parameters:
df_
(DataFrame): input dataframe.suffixes
(list): suffixes.
Returns:
df
(DataFrame): output dataframe.
function sort_columns_by_values
sort_columns_by_values(
df: DataFrame,
subset: list,
suffixes: list = None,
order: list = None,
clean=False
) → DataFrame
Sort the values in columns in ascending order.
Parameters:
df
(DataFrame): input dataframe.subset
(list): columns.suffixes
(list): suffixes.order
(list): ordered list.
Returns:
df
(DataFrame): output dataframe.
Notes:
In the output dataframe,
sorted
means values are sorted because gene1>gene2.
function make_ids
make_ids(
df: DataFrame,
cols: list,
ids_have_equal_length: bool,
sep: str = '--',
sort: bool = False
) → Series
Make ids by joining string ids in more than one columns.
Parameters:
df
(DataFrame): input dataframe.cols
(list): columns.ids_have_equal_length
(bool): ids have equal length, if True faster processing.sep
(str): separator between the ids ('--').sort
(bool): sort the ids before joining (False).
Returns:
ds
(Series): output series.
function make_ids_sorted
make_ids_sorted(
df: DataFrame,
cols: list,
ids_have_equal_length: bool,
sep: str = '--',
sort: bool = False
) → Series
Make sorted ids by joining string ids in more than one columns.
Parameters:
df
(DataFrame): input dataframe.cols
(list): columns.ids_have_equal_length
(bool): ids have equal length, if True faster processing.sep
(str): separator between the ids ('--').
Returns:
ds
(Series): output series.
function get_alt_id
get_alt_id(s1: str, s2: str, sep: str = '--')
Get alternate/partner id from a paired id.
Parameters:
s1
(str): joined id.s2
(str): query id.
Returns:
s
(str): partner id.
function split_ids
split_ids(df1, col, sep='--', prefix=None)
Split joined ids to individual ones.
Parameters:
df1
(DataFrame): input dataframe.col
(str): column containing the joined ids.sep
(str): separator within the joined ids ('--').prefix
(str): prefix of the individual ids (None).
Return:
df1
(DataFrame): output dataframe.
function dict2df
dict2df(d, colkey='key', colvalue='value')
Dictionary to DataFrame.
Parameters:
d
(dict): dictionary.colkey
(str): name of column containing the keys.colvalue
(str): name of column containing the values.
Returns:
df
(DataFrame): output dataframe.
function log_shape_change
log_shape_change(d1, fun='')
Report the changes in the shapes of a DataFrame.
Parameters:
d1
(dic): dictionary containing the shapes.fun
(str): name of the function.
function log_apply
log_apply(
df,
fun,
validate_equal_length=False,
validate_equal_width=False,
validate_equal_shape=False,
validate_no_decrease_length=False,
validate_no_decrease_width=False,
validate_no_increase_length=False,
validate_no_increase_width=False,
*args,
**kwargs
)
Report (log) the changes in the shapes of the dataframe before and after an operation/s.
Parameters:
df
(DataFrame): input dataframe.fun
(object): function to apply on the dataframe.validate_equal_length
(bool): Validate that the number of rows i.e. length of the dataframe remains the same before and after the operation.validate_equal_width
(bool): Validate that the number of columns i.e. width of the dataframe remains the same before and after the operation.validate_equal_shape
(bool): Validate that the number of rows and columns i.e. shape of the dataframe remains the same before and after the operation.
Keyword parameters:
args
(tuple): provided tofun
.kwargs
(dict): provided tofun
.
Returns:
df
(DataFrame): output dataframe.
class log
Report (log) the changes in the shapes of the dataframe before and after an operation/s.
TODO:
Create the attribures (attr
) using strings e.g. setattr. import inspect fun=inspect.currentframe().f_code.co_name
method __init__
__init__(pandas_obj)
method check_dups
check_dups(**kws)
method check_na
check_na(**kws)
method clean
clean(**kws)
method drop
drop(**kws)
method drop_duplicates
drop_duplicates(**kws)
method dropna
dropna(**kws)
method explode
explode(**kws)
method filter_
filter_(**kws)
method filter_rows
filter_rows(**kws)
method groupby
groupby(**kws)
method join
join(**kws)
method melt
melt(**kws)
method melt_paired
melt_paired(**kws)
method merge
merge(**kws)
method pivot
pivot(**kws)
method pivot_table
pivot_table(**kws)
method query
query(**kws)
method stack
stack(**kws)
method unstack
unstack(**kws)
module roux.lib.dfs
For processing multiple pandas DataFrames/Series
function filter_dfs
filter_dfs(dfs: list, cols: list, how: str = 'inner') → DataFrame
Filter dataframes based items in the common columns.
Parameters:
dfs
(list): list of dataframes.cols
(list): list of columns.how
(str): how to filter ('inner')
Returns
dfs
(list): list of dataframes.
function merge_with_many_columns
merge_with_many_columns(
df1: DataFrame,
right: str,
left_on: str,
right_ons: list,
right_id: str,
how: str = 'inner',
validate: str = '1:1',
test: bool = False,
verbose: bool = False,
**kws_merge
) → DataFrame
Merge with many columns. For example, if ids in the left table can map to ids located in multiple columns of the right table.
Parameters:
df1
(pd.DataFrame): left table.right
(pd.DataFrame): right table.left_on
(str): column in the left table to merge on.right_ons
(list): columns in the right table to merge on.right_id
(str): column in the right dataframe with for example the ids to be merged.
Keyword parameters:
kws_merge
: to be supplied topandas.DataFrame.merge
.
Returns: Merged table.
function merge_paired
merge_paired(
df1: DataFrame,
df2: DataFrame,
left_ons: list,
right_on: list,
common: list = [],
right_ons_common: list = [],
how: str = 'inner',
validates: list = ['1:1', '1:1'],
suffixes: list = None,
test: bool = False,
verb: bool = True,
**kws
) → DataFrame
Merge uppaired dataframes to a paired dataframe.
Parameters:
df1
(DataFrame): paired dataframe.df2
(DataFrame): unpaired dataframe.left_ons
(list): columns of thedf1
(suffixed).right_on
(str|list): column/s of thedf2
(to be suffixed).common
(str|list): common column/s betweendf1
anddf2
(not suffixed).right_ons_common
(str|list): common column/s betweendf2
to be used for merging (not to be suffixed).how
(str): method of merging ('inner').validates
(list): validate mappings for the 1st mapping betweendf1
anddf2
and 2nd one betweendf1+df2
anddf2
(['1:1','1:1']).suffixes
(list): suffixes to be used (None).test
(bool): testing (False).verb
(bool): verbose (True).
Keyword Parameters:
kws
(dict): parameters provided tomerge
.
Returns:
df
(DataFrame): output dataframe.
Examples:
Parameters: how='inner', left_ons=['gene id gene1','gene id gene2'], # suffixed common='sample id', # not suffixed right_on='gene id', # to be suffixed right_ons_common=[], # not to be suffixed
function merge_dfs
merge_dfs(dfs: list, **kws) → DataFrame
Merge dataframes from left to right.
Parameters:
dfs
(list): list of dataframes.
Keyword Parameters:
kws
(dict): parameters provided tomerge
.
Returns:
df
(DataFrame): output dataframe.
Notes:
For example, reduce(lambda x, y: x.merge(y), [1, 2, 3, 4, 5]) merges ((((1.merge(2)).merge(3)).merge(4)).merge(5)).
function compare_rows
compare_rows(df1, df2, test=False, **kws)
module roux.lib.dict
For processing dictionaries.
function head_dict
head_dict(d, lines=5)
function sort_dict
sort_dict(d1, by=1, ascending=True)
Sort dictionary by values.
Parameters:
d1
(dict): input dictionary.by
(int): index of the value among the values.ascending
(bool): ascending order.
Returns:
d1
(dict): output dictionary.
function merge_dicts
merge_dicts(l: list) → dict
Merge dictionaries.
Parameters:
l
(list): list containing the dictionaries.
Returns:
d
(dict): output dictionary.
TODOs: 1. In python>=3.9, merged = d1 | d2
?
function merge_dicts_deep
merge_dicts_deep(left: dict, right: dict) → dict
Merge nested dictionaries. Overwrites left with right.
Parameters:
left
(dict): dictionary #1right
(dict): dictionary #2
TODOs: 1. In python>=3.9, merged = d1 | d2
?
function merge_dict_values
merge_dict_values(l, test=False)
Merge dictionary values.
Parameters:
l
(list): list containing the dictionaries.test
(bool): verbose.
Returns:
d
(dict): output dictionary.
function flip_dict
flip_dict(d)
switch values with keys and vice versa.
Parameters:
d
(dict): input dictionary.
Returns:
d
(dict): output dictionary.
module roux.lib.google
Processing files form google-cloud services.
function get_service
get_service(service_name='drive', access_limit=True, client_config=None)
Creates a google service object.
:param service_name: name of the service e.g. drive :param access_limit: True is access limited else False :param client_config: custom client config ... :return: google service object
Ref: https://developers.google.com/drive/api/v3/about-auth
function get_service
get_service(service_name='drive', access_limit=True, client_config=None)
Creates a google service object.
:param service_name: name of the service e.g. drive :param access_limit: True is access limited else False :param client_config: custom client config ... :return: google service object
Ref: https://developers.google.com/drive/api/v3/about-auth
function list_files_in_folder
list_files_in_folder(service, folderid, filetype=None, fileext=None, test=False)
Lists files in a google drive folder.
:param service: service object e.g. drive :param folderid: folder id from google drive :param filetype: specify file type :param fileext: specify file extension :param test: True if verbose else False ... :return: list of files in the folder
function get_file_id
get_file_id(p)
function download_file
download_file(
p=None,
file_id=None,
service=None,
outd=None,
outp=None,
convert=False,
force=False,
test=False
)
Downloads a specified file.
:param service: google service object :param file_id: file id as on google drive :param filetypes: specify file type :param outp: path to the ouput file :param test: True if verbose else False
Ref: https://developers.google.com/drive/api/v3/ref-export-formats
function upload_file
upload_file(service, filep, folder_id, test=False)
Uploads a local file onto google drive.
:param service: google service object :param filep: path of the file :param folder_id: id of the folder on google drive where the file will be uploaded :param test: True is verbose else False ... :return: id of the uploaded file
function upload_files
upload_files(service, ps, folder_id, **kws)
function download_drawings
download_drawings(folderid, outd, service=None, test=False)
Download specific files: drawings
TODOs: 1. use download_file
function get_comments
get_comments(
fileid,
fields='comments/quotedFileContent/value,comments/content,comments/id',
service=None
)
Get comments.
fields: comments/ kind: id: createdTime: modifiedTime: author: kind: displayName: photoLink: me: True htmlContent: content: deleted: quotedFileContent: mimeType: value: anchor: replies: []
function search
search(query, results=1, service=None, **kws_search)
Google search.
:param query: exact terms ... :return: dict
function get_search_strings
get_search_strings(text, num=5, test=False)
Google search.
:param text: string :param num: number of results :param test: True if verbose else False ... :return lines: list
function get_metadata_of_paper
get_metadata_of_paper(
file_id,
service_drive,
service_search,
metadata=None,
force=False,
test=False
)
Get the metadata of a pdf document.
function share
share(
drive_service,
content_id,
share=False,
unshare=False,
user_permission=None,
permissionId='anyoneWithLink'
)
:params user_permission: user_permission = { 'type': 'anyone', 'role': 'reader', 'email':'@' } Ref: https://developers.google.com/drive/api/v3/manage-sharing
class slides
method create_image
create_image(service, presentation_id, page_id, image_id)
image less than 1.5 Mb
method get_page_ids
get_page_ids(service, presentation_id)
module roux.lib.io
For input/output of data files.
function read_zip
read_zip(p: str, file_open: str = None, fun_read=None, test: bool = False)
Read the contents of a zip file.
Parameters:
p
(str): path of the file.file_open
(str): path of file within the zip file to open.fun_read
(object): function to read the file.
Examples:
- Setting
fun_read
parameter for reading tab-separated table from a zip file.
from io import StringIO ... fun_read=lambda x: pd.read_csv(io.StringIO(x.decode('utf-8')),sep=' ',header=None),
or
from io import BytesIO ... fun_read=lambda x: pd.read_table(BytesIO(x)),
function to_zip_dir
to_zip_dir(source, destination=None, fmt='zip')
Zip a folder. Ref: https://stackoverflow.com/a/50381250/3521099
function to_zip
to_zip(
p: str,
outp: str = None,
func_rename=None,
fmt: str = 'zip',
test: bool = False
)
Compress a file/directory.
Parameters:
p
(str): path to the file/directory.outp
(str): path to the output compressed file.fmt
(str): format of the compressed file.
Returns:
outp
(str): path of the compressed file.
function to_dir
to_dir(
paths: dict,
output_dir_path: str,
rename_basename=None,
force=False,
test=False
)
function get_version
get_version(suffix: str = '') → str
Get the time-based version string.
Parameters:
suffix
(string): suffix.
Returns:
version
(string): version.
function to_version
to_version(
p: str,
outd: str = None,
test: bool = False,
name: str = None,
**kws: dict
) → str
Rename a file/directory to a version.
Parameters:
p
(str): path.outd
(str): output directory.
Keyword parameters:
kws
(dict): provided toget_version
.
Returns:
version
(string): version.
TODOs: 1. Use to_dir
.
function backup
backup(
p: str,
outd: str = None,
versioned: bool = False,
suffix: str = '',
zipped: bool = False,
move_only: bool = False,
test: bool = True,
verbose: bool = False,
no_test: bool = False
)
Backup a directory
Steps: 0. create version dir in outd 1. move ps to version (time) dir with common parents till the level of the version dir 2. zip or not
Parameters:
p
(str): input path.outd
(str): output directory path.versioned
(bool): custom version for the backup (False).suffix
(str): custom suffix for the backup ('').zipped
(bool): whether to zip the backup (False).test
(bool): testing (True).no_test
(bool): no testing. Usage in command line (False).
TODOs: 1. Use to_dir
. 2. Option to remove dirs find and move/zip "find -regex ./_." "find -regex ./test."
function read_url
read_url(url)
Read text from an URL.
Parameters:
url
(str): URL link.
Returns:
s
(string): text content of the URL.
function download
download(
url: str,
path: str = None,
outd: str = None,
force: bool = False,
verbose: bool = True
) → str
Download a file.
Parameters:
url
(str): URL.path
(str): custom output path (None)outd
(str): output directory ('data/database').force
(bool): overwrite output (False).verbose
(bool): verbose (True).
Returns:
path
(str): output path (None)
function read_text
read_text(p)
Read a file. To be called by other functions
Args:
p
(str): path.
Returns:
s
(str): contents.
function to_list
to_list(l1, p)
Save list.
Parameters:
l1
(list): input list.p
(str): path.
Returns:
p
(str): path.
function read_list
read_list(p)
Read the lines in the file.
Args:
p
(str): path.
Returns:
l
(list): list.
function read_list
read_list(p)
Read the lines in the file.
Args:
p
(str): path.
Returns:
l
(list): list.
function is_dict
is_dict(p)
function read_dict
read_dict(p, fmt: str = '', apply_on_keys=None, **kws) → dict
Read dictionary file.
Parameters:
p
(str): path.fmt
(str): format of the file.
Keyword Arguments:
kws
(d): parameters provided to reader function.
Returns:
d
(dict): output dictionary.
function to_dict
to_dict(d, p, **kws)
Save dictionary file.
Parameters:
d
(dict): input dictionary.p
(str): path.
Keyword Arguments:
kws
(d): parameters provided to export function.
Returns:
p
(str): path.
function post_read_table
post_read_table(
df1: DataFrame,
clean: bool,
tables: list,
verbose: bool = True,
**kws_clean: dict
)
Post-reading a table.
Parameters:
df1
(DataFrame): input dataframe.clean
(bool): whether to applyclean
function. tables ()verbose
(bool): verbose.
Keyword parameters:
kws_clean
(dict): paramters provided to theclean
function.
Returns:
df
(DataFrame): output dataframe.
function read_table
read_table(
p: str,
ext: str = None,
clean: bool = True,
filterby_time=None,
params: dict = {},
kws_clean: dict = {},
kws_cloud: dict = {},
check_paths: bool = True,
tables: int = 1,
test: bool = False,
verbose: bool = True,
engine: str = 'fastparquet',
**kws_read_tables: dict
)
Table/s reader.
Parameters:
- <b>`p`</b> (str): path of the file. It could be an input for `read_ps`, which would include strings with wildcards, list etc.
- <b>`ext`</b> (str): extension of the file (default: None meaning infered from the path).
- <b>`clean=(default`</b>: True). filterby_time=None).
- <b>`check_paths`</b> (bool): read files in the path column (default:True).
- <b>`test`</b> (bool): testing (default:False).
- <b>`params`</b>: parameters provided to the 'pd.read_csv' (default:{}). For example
- <b>`params['columns']`</b>: columns to read.
- <b>`kws_clean`</b>: parameters provided to 'rd.clean' (default:{}).
- <b>`kws_cloud`</b>: parameters for reading files from google-drive (default:{}).
- <b>`tables`</b>: how many tables to be read (default:1).
- <b>`verbose`</b>: verbose (default:True).
Keyword parameters:
- kws_read_tables
(dict): parameters provided to read_tables
function. For example:
- to_col={colindex
: replaces_index}
Returns:
- <b>`df`</b> (DataFrame): output dataframe.
Examples:
-
For reading specific columns only set
params=dict(columns=list)
. -
For reading many files, convert paths to a column with corresponding values:
to_col={colindex: replaces_index}
- Reading a vcf file. p='*.vcf|vcf.gz' read_table(p, params_read_csv=dict( #compression='gzip', sep=' ',comment='#',header=None, names=replace_many(get_header(path,comment='#',lineno=-1),['#',' '],'').split(' ')) )
function get_logp
get_logp(ps: list) → str
Infer the path of the log file.
Parameters:
ps
(list): list of paths.
Returns:
p
(str): path of the output file.
function apply_on_paths
apply_on_paths(
ps: list,
func,
replaces_outp: str = None,
to_col: dict = None,
replaces_index=None,
drop_index: bool = True,
colindex: str = 'path',
filter_rows: dict = None,
fast: bool = False,
progress_bar: bool = True,
params: dict = {},
dbug: bool = False,
test1: bool = False,
verbose: bool = True,
kws_read_table: dict = {},
**kws: dict
)
Apply a function on list of files.
Parameters:
ps
(str|list): paths or string to infer paths usingread_ps
.to_col
(dict): convert the paths to a column e.g. {colindex: replaces_index}func
(function): function to be applied on each of the paths.replaces_outp
(dict|function): infer the output path (outp
) by replacing substrings in the input paths (p
).filter_rows
(dict): filter the rows based on dict, usingrd.filter_rows
.fast
(bool): parallel processing (default:False).progress_bar
(bool): show progress bar(default:True).params
(dict): parameters provided to thepd.read_csv
function.dbug
(bool): debug mode on (default:False).test1
(bool): test on one path (default:False).kws_read_table
(dict): parameters provided to theread_table
function (default:{}).replaces_index
(object|dict|list|str): for example, 'basenamenoext' if path to basename.drop_index
(bool): whether to drop the index column e.g.path
(default: True).colindex
(str): the name of the column containing the paths (default: 'path')
Keyword parameters:
kws
(dict): parameters provided to the function.
Example:
- Function: def apply_(p,outd='data/data_analysed',force=False): outp=f"{outd}/{basenamenoext(p)}.pqt' if exists(outp) and not force: return df01=read_table(p) apply_on_paths( ps=glob("data/data_analysed/*"), func=apply_, outd="data/data_analysed/", force=True, fast=False, read_path=True, )
TODOs: Move out of io.
function read_tables
read_tables(
ps: list,
fast: bool = False,
filterby_time=None,
to_dict: bool = False,
params: dict = {},
tables: int = None,
**kws_apply_on_paths: dict
)
Read multiple tables.
Parameters:
ps
(list): list of paths.fast
(bool): parallel processing (default:False)filterby_time
(str): filter by time (default:None)drop_index
(bool): drop index (default:True)to_dict
(bool): output dictionary (default:False)params
(dict): parameters provided to thepd.read_csv
function (default:{})tables
: number of tables (default:None).
Keyword parameters:
kws_apply_on_paths
(dict): parameters provided toapply_on_paths
.
Returns:
df
(DataFrame): output dataframe.
TODOs: Parameter to report the creation dates of the newest and the oldest files.
function to_table
to_table(
df: DataFrame,
p: str,
colgroupby: str = None,
test: bool = False,
**kws
)
Save table.
Parameters:
df
(DataFrame): the input dataframe.p
(str): output path.colgroupby
(str|list): columns to groupby with to save the subsets of the data as separate files.test
(bool): testing on (default:False).
Keyword parameters:
kws
(dict): parameters provided to theto_manytables
function.
Returns:
p
(str): path of the output.
function to_manytables
to_manytables(
df: DataFrame,
p: str,
colgroupby: str,
fmt: str = '',
ignore: bool = False,
kws_get_chunks={},
**kws_to_table
)
Save many table.
Parameters:
df
(DataFrame): the input dataframe.p
(str): output path.colgroupby
(str|list): columns to groupby with to save the subsets of the data as separate files.fmt
(str): if '=' column names in the folder name e.g. col1=True.ignore
(bool): ignore the warnings (default:False).
Keyword parameters:
kws_get_chunks
(dict): parameters provided to theget_chunks
function.
Returns:
p
(str): path of the output.
TODOs:
1. Change in default parameter
:fmt='='
.
function to_table_pqt
to_table_pqt(
df: DataFrame,
p: str,
engine: str = 'fastparquet',
compression: str = 'gzip',
**kws_pqt: dict
) → str
Save a parquet file.
Parameters:
df
(pd.DataFrame): table.p
(str): path.
Keyword parameters: Parameters provided to pd.DataFrame.to_parquet
.
Returns:
function tsv2pqt
tsv2pqt(p: str) → str
Convert tab-separated file to Apache parquet.
Parameters:
p
(str): path of the input.
Returns:
p
(str): path of the output.
function pqt2tsv
pqt2tsv(p: str) → str
Convert Apache parquet file to tab-separated.
Parameters:
p
(str): path of the input.
Returns:
p
(str): path of the output.
function read_excel
read_excel(
p: str,
sheet_name: str = None,
kws_cloud: dict = {},
test: bool = False,
**kws
)
Read excel file
Parameters:
p
(str): path of the file.sheet_name
(str|None): read 1st sheet if None (default:None)kws_cloud
(dict): parameters provided to read the file from the google drive (default:{})test
(bool): if False and sheet_name not provided, return all sheets as a dictionary, else if True, print list of sheets.
Keyword parameters:
kws
: parameters provided to the excel reader.
function to_excel_commented
to_excel_commented(p: str, comments: dict, outp: str = None, author: str = None)
Add comments to the columns of excel file and save.
Args:
p
(str): input path of excel file.comments
(dict): map between column names and comment e.g. description of the column.outp
(str): output path of excel file. Defaults to None.author
(str): author of the comments. Defaults to 'Author'.
TODOs: 1. Increase the limit on comments can be added to number of columns. Currently it is 26 i.e. upto Z1.
function to_excel
to_excel(
sheetname2df: dict,
outp: str,
comments: dict = None,
save_input: bool = False,
author: str = None,
append: bool = False,
adjust_column_width: bool = True,
**kws
)
Save excel file.
Parameters:
sheetname2df
(dict): dictionary mapping the sheetname to the dataframe.outp
(str): output path.append
(bool): append the dataframes (default:False).comments
(dict): map between column names and comment e.g. description of the column.save_input
(bool): additionally save the input tables in text format.
Keyword parameters:
kws
: parameters provided to the excel writer.
function check_chunks
check_chunks(outd, col, plot=True)
Create chunks of the tables.
Parameters:
outd
(str): output directory.col
(str): the column with values that are used for getting the chunks.plot
(bool): plot the chunk sizes (default:True).
Returns:
df3
(DataFrame): output dataframe.
module roux.lib
Global Variables
- df
- set
- str
- sys
- dfs
- text
- io
- dict
function to_class
to_class(cls)
Get the decorator to attach functions.
Parameters:
cls
(class): class object.
Returns:
decorator
(decorator): decorator object.
References:
https
: //gist.github.com/mgarod/09aa9c3d8a52a980bd4d738e52e5b97a
function decorator
decorator(func)
class rd
roux-dataframe
(.rd
) extension.
method __init__
__init__(pandas_obj)
module roux.lib.set
For processing list-like sets.
function union
union(l)
Union of lists.
Parameters:
l
(list): list of lists.
Returns:
l
(list): list.
function union
union(l)
Union of lists.
Parameters:
l
(list): list of lists.
Returns:
l
(list): list.
function intersection
intersection(l)
Intersections of lists.
Parameters:
l
(list): list of lists.
Returns:
l
(list): list.
function intersection
intersection(l)
Intersections of lists.
Parameters:
l
(list): list of lists.
Returns:
l
(list): list.
function nunion
nunion(l)
Count the items in union.
Parameters:
l
(list): list of lists.
Returns:
i
(int): count.
function nintersection
nintersection(l)
Count the items in intersetion.
Parameters:
l
(list): list of lists.
Returns:
i
(int): count.
function check_non_overlaps_with
check_non_overlaps_with(l1: list, l2: list, out_count: bool = False, log=False)
function validate_overlaps_with
validate_overlaps_with(l1, l2)
function assert_overlaps_with
assert_overlaps_with(l1, l2, out_count=False)
function jaccard_index
jaccard_index(l1, l2)
function dropna
dropna(x)
Drop np.nan
items from a list.
Parameters:
x
(list): list.
Returns:
x
(list): list.
function unique
unique(l)
Unique items in a list.
Parameters:
l
(list): input list.
Returns:
l
(list): list.
Notes:
The function can return list of lists if used in
pandas.core.groupby.DataFrameGroupBy.agg
context.
function list2str
list2str(x, ignore=False)
Returns string if single item in a list.
Parameters:
x
(list): list
Returns:
s
(str): string.
function unique_str
unique_str(l, **kws)
Unique single item from a list.
Parameters:
l
(list): input list.
Returns:
l
(list): list.
function nunique
nunique(l, **kws)
Count unique items in a list
Parameters:
l
(list): list
Returns:
i
(int): count.
function flatten
flatten(l)
List of lists to list.
Parameters:
l
(list): input list.
Returns:
l
(list): output list.
function get_alt
get_alt(l1, s)
Get alternate item between two.
Parameters:
l1
(list): list.s
(str): item.
Returns:
s
(str): alternate item.
function intersections
intersections(dn2list, jaccard=False, count=True, fast=False, test=False)
Get intersections between lists.
Parameters:
dn2list
(dist): dictionary mapping to lists.jaccard
(bool): return jaccard indices.count
(bool): return counts.fast
(bool): fast.test
(bool): verbose.
Returns:
df
(DataFrame): output dataframe.
TODOs: 1. feed as an estimator to df.corr()
. 2. faster processing by filling up the symetric half of the adjacency matrix.
function range_overlap
range_overlap(l1, l2)
Overlap between ranges.
Parameters:
l1
(list): start and end integers of one range.l2
(list): start and end integers of other range.
Returns:
l
(list): overlapped range.
function get_windows
get_windows(
a,
size=None,
overlap=None,
windows=None,
overlap_fraction=None,
stretch_last=False,
out_ranges=True
)
Windows/segments from a range.
Parameters:
a
(list): range.size
(int): size of the windows.windows
(int): number of windows.overlap_fraction
(float): overlap fraction.overlap
(int): overlap length.stretch_last
(bool): stretch last window.out_ranges
(bool): whether to output ranges.
Returns:
df1
(DataFrame): output dataframe.
Notes:
- For development, use of
int
providesnp.floor
.
function bools2intervals
bools2intervals(v)
Convert bools to intervals.
Parameters:
v
(list): list of bools.
Returns:
l
(list): intervals.
function list2ranges
list2ranges(l)
function get_pairs
get_pairs(
items: list,
items_with: list = None,
size: int = 2,
with_self: bool = False
) → DataFrame
Creates a dataframe with the paired items.
Parameters:
items
: the list of items to pair. items_with: list of items to pair with. size: size of the combinations. with_self: pair with self or not.
Returns: table with pairs of items.
Notes:
- the ids of the items are sorted e.g. 'a'-'b' not 'b'-'a'. 2. itertools.combinations does not pair self.
module roux.lib.str
For processing strings.
function substitution
substitution(s, i, replaceby)
Substitute character in a string.
Parameters:
s
(string): string.i
(int): location.replaceby
(string): character to substitute with.
Returns:
s
(string): output string.
function substitution
substitution(s, i, replaceby)
Substitute character in a string.
Parameters:
s
(string): string.i
(int): location.replaceby
(string): character to substitute with.
Returns:
s
(string): output string.
function replace_many
replace_many(
s: str,
replaces: dict,
replacewith: str = '',
ignore: bool = False
)
Rename by replacing sub-strings.
Parameters:
s
(str): input string.replaces
(dict|list): from->to format or list containing substrings to remove.replacewith
(str): replace to in casereplaces
is a list.ignore
(bool): if True, not validate the successful replacements.
Returns:
s
(DataFrame): output dataframe.
function replace_many
replace_many(
s: str,
replaces: dict,
replacewith: str = '',
ignore: bool = False
)
Rename by replacing sub-strings.
Parameters:
s
(str): input string.replaces
(dict|list): from->to format or list containing substrings to remove.replacewith
(str): replace to in casereplaces
is a list.ignore
(bool): if True, not validate the successful replacements.
Returns:
s
(DataFrame): output dataframe.
function filter_list
filter_list(l: list, patterns: list, kind='out') → list
Filter a list of strings.
Args:
l
(list): list of strings.patterns
(list): list of regex patterns. patterns are applied after stripping the whitespaces.
Returns: (list) list of filtered strings.
function tuple2str
tuple2str(tup, sep=' ')
Join tuple items.
Parameters:
tup
(tuple|list): input tuple/list.sep
(str): separator between the items.
Returns:
s
(str): output string.
function linebreaker
linebreaker(text, width=None, break_pt=None, sep='\n', **kws)
Insert newline
s within a string.
Parameters:
text
(str): string.width
(int): insertnewline
at this interval.sep
(string): separator to split the sub-strings.
Returns:
s
(string): output string.
References:
1.
textwrap``: https://docs.python.org/3/library/textwrap.html
function findall
findall(s, ss, outends=False, outstrs=False, suffixlen=0)
Find the substrings or their locations in a string.
Parameters:
s
(string): input string.ss
(string): substring.outends
(bool): output end positions.outstrs
(bool): output strings.suffixlen
(int): length of the suffix.
Returns:
l
(list): output list.
function get_marked_substrings
get_marked_substrings(
s,
leftmarker='{',
rightmarker='}',
leftoff=0,
rightoff=0
) → list
Get the substrings flanked with markers from a string.
Parameters:
s
(str): input string.leftmarker
(str): marker on the left.rightmarker
(str): marker on the right.leftoff
(int): offset on the left.rightoff
(int): offset on the right.
Returns:
l
(list): list of substrings.
function get_marked_substrings
get_marked_substrings(
s,
leftmarker='{',
rightmarker='}',
leftoff=0,
rightoff=0
) → list
Get the substrings flanked with markers from a string.
Parameters:
s
(str): input string.leftmarker
(str): marker on the left.rightmarker
(str): marker on the right.leftoff
(int): offset on the left.rightoff
(int): offset on the right.
Returns:
l
(list): list of substrings.
function mark_substrings
mark_substrings(s, ss, leftmarker='(', rightmarker=')') → str
Mark sub-string/s in a string.
Parameters:
s
(str): input string.ss
(str): substring.leftmarker
(str): marker on the left.rightmarker
(str): marker on the right.
Returns:
s
(str): string.
function get_bracket
get_bracket(s, leftmarker='(', righttmarker=')') → str
Get bracketed substrings.
Parameters:
s
(string): string.leftmarker
(str): marker on the left.rightmarker
(str): marker on the right.
Returns:
s
(str): string.
TODOs: 1. Use get_marked_substrings
.
function align
align(
s1: str,
s2: str,
prefix: bool = False,
suffix: bool = False,
common: bool = True
) → list
Align strings.
Parameters:
s1
(str): string #1.s2
(str): string #2.prefix
(str): prefix.suffix
(str): suffix.common
(str): common substring.
Returns:
l
(list): output list.
Notes:
- Code to test: [ get_prefix(source,target,common=False), get_prefix(source,target,common=True), get_suffix(source,target,common=False), get_suffix(source,target,common=True),]
function get_prefix
get_prefix(s1, s2: str = None, common: bool = True, clean: bool = True) → str
Get the prefix of the strings
Parameters:
s1
(str|list): 1st string.s2
(str): 2nd string (default:None).common
(bool): get the common prefix (default:True).clean
(bool): clean the leading and trailing whitespaces (default:True).
Returns:
s
(str): prefix.
function get_suffix
get_suffix(s1, s2: str = None, common: bool = True, clean: bool = True) → str
Get the suffix of the strings
Parameters:
s1
(str|list): 1st string.s2
(str): 2nd string (default:None).common
(bool): get the common prefix (default:True).clean
(bool): clean the leading and trailing whitespaces (default:True).
Returns:
s
(str): prefix.
function get_fix
get_fix(s1: str, s2: str, **kws: dict) → str
Infer common prefix or suffix.
Parameters:
s1
(str): 1st string.s2
(str): 2nd string.
Keyword parameters:
kws
: parameters provided to theget_prefix
andget_suffix
functions.
Returns:
s
(str): prefix or suffix.
function removesuffix
removesuffix(s1: str, suffix: str) → str
Remove suffix.
Paramters: s1 (str): input string. suffix (str): suffix.
Returns:
s1
(str): string without the suffix.
TODOs: 1. Deprecate in py>39 use .removesuffix() instead.
function str2dict
str2dict(
s: str,
reversible: bool = True,
sep: str = ';',
sep_equal: str = '='
) → dict
String to dictionary.
Parameters:
s
(str): string.sep
(str): separator between entries (default:';').sep_equal
(str): separator between the keys and the values (default:'=').
Returns:
d
(dict): dictionary.
References:
1. https
: //stackoverflow.com/a/186873/3521099
function dict2str
dict2str(
d1: dict,
reversible: bool = True,
sep: str = ';',
sep_equal: str = '='
) → str
Dictionary to string.
Parameters:
d
(dict): dictionary.sep
(str): separator between entries (default:';').sep_equal
(str): separator between the keys and the values (default:'=').reversible
(str): use json
Returns:
s
(str): string.
function str2num
str2num(s: str) → float
String to number.
Parameters:
s
(str): string.
Returns:
i
(int): number.
function num2str
num2str(
num: float,
magnitude: bool = False,
coff: float = 10000,
decimals: int = 0
) → str
Number to string.
Parameters:
num
(int): number.magnitude
(bool): use magnitudes (default:False).coff
(int): cutoff (default:10000).decimals
(int): decimal points (default:0).
Returns:
s
(str): string.
TODOs 1. ~ if magnitude else not
function encode
encode(data, short: bool = False, method_short: str = 'sha256', **kws) → str
Encode the data as a string.
Parameters:
data
(str|dict|Series): input data.short
(bool): Outputs short string, compatible with paths but non-reversible. Defaults to False.method_short
(str): method used for encoding when short=True.
Keyword parameters:
kws
: parameters provided to encoding function.
Returns:
s
(string): output string.
function decode
decode(s, out=None, **kws_out)
Decode data from a string.
Parameters:
s
(string): encoded string.out
(str): output format (dict|df).
Keyword parameters:
kws_out
: parameters provided todict2df
.
Returns:
d
(dict|DataFrame): output data.
function to_formula
to_formula(
replaces={' ': 'SPACE', '(': 'LEFTBRACKET', ')': 'RIGHTTBRACKET', '.': 'DOT', ',': 'COMMA', '%': 'PERCENT', "'": 'INVCOMMA', '+': 'PLUS', '-': 'MINUS'},
reverse=False
) → dict
Converts strings to the formula format, compatible with patsy
for example.
module roux.lib.sys
For processing file paths for example.
function basenamenoext
basenamenoext(p)
Basename without the extension.
Args:
p
(str): path.
Returns:
s
(str): output.
function remove_exts
remove_exts(p: str, exts: tuple = None)
Filename without the extension.
Args:
p
(str): path.exts
(tuple): extensions.
Returns:
s
(str): output.
function read_ps
read_ps(ps, test: bool = True, verbose: bool = True) → list
Read a list of paths.
Parameters:
ps
(list|str): list of paths or a string with wildcard/s.test
(bool): testing.verbose
(bool): verbose.
Returns:
ps
(list): list of paths.
function to_path
to_path(s, replacewith='_', verbose=False, coff_len_escape_replacement=100)
Normalise a string to be used as a path of file.
Parameters:
s
(string): input string.replacewith
(str): replace the whitespaces or incompatible characters with.
Returns:
s
(string): output string.
function to_path
to_path(s, replacewith='_', verbose=False, coff_len_escape_replacement=100)
Normalise a string to be used as a path of file.
Parameters:
s
(string): input string.replacewith
(str): replace the whitespaces or incompatible characters with.
Returns:
s
(string): output string.
function makedirs
makedirs(p: str, exist_ok=True, **kws)
Make directories recursively.
Args:
p
(str): path.exist_ok
(bool, optional): no error if the directory exists. Defaults to True.
Returns:
p_
(str): the path of the directory.
function to_output_path
to_output_path(ps, outd=None, outp=None, suffix='')
Infer a single output path for a list of paths.
Parameters:
ps
(list): list of paths.outd
(str): path of the output directory.outp
(str): path of the output file.suffix
(str): suffix of the filename.
Returns:
outp
(str): path of the output file.
function to_output_paths
to_output_paths(
input_paths: list = None,
inputs: list = None,
output_path_base: str = None,
encode_short: bool = True,
replaces_output_path=None,
key_output_path: str = None,
force: bool = False,
verbose: bool = False
) → dict
Infer a output path for each of the paths or inputs.
Parameters:
input_paths (list)
: list of input paths. Defaults to None.inputs (list)
: list of inputs e.g. dictionaries. Defaults to None.output_path_base (str)
: output path with a placeholder '{KEY}' to be replaced. Defaults to None.encode_short
: (bool) : short encoded string, else long encoded string (reversible) is used. Defaults to True.replaces_output_path
: list, dictionary or function to replace the input paths. Defaults to None.key_output_path (str)
: key to be used to incorporate output_path variable among the inputs. Defaults to None.force
(bool): overwrite the outputs. Defaults to False.verbose (bool)
: show verbose. Defaults to False.
Returns: dictionary with the output path mapped to input paths or inputs.
TODOs: 1. Placeholders other than {KEY}.
function get_encoding
get_encoding(p)
Get encoding of a file.
Parameters:
p
(str): file path
Returns:
s
(string): encoding.
function get_all_subpaths
get_all_subpaths(d='.', include_directories=False)
Get all the subpaths.
Args:
d
(str, optional): description. Defaults to '.'.include_directories
(bool, optional): to include the directories. Defaults to False.
Returns:
paths
(list): sub-paths.
function get_env
get_env(env_name: str, return_path: bool = False)
Get the virtual environment as a dictionary.
Args:
env_name
(str): name of the environment.
Returns:
d
(dict): parameters of the virtual environment.
function runbash
runbash(s1, env=None, test=False, **kws)
Run a bash command.
Args:
s1
(str): command.env
(str): environment name.test
(bool, optional): testing. Defaults to False.
Returns:
output
: output of thesubprocess.call
function.
TODOs: 1. logp 2. error ignoring
function runbash_tmp
runbash_tmp(
s1: str,
env: str,
df1=None,
inp='INPUT',
input_type='df',
output_type='path',
tmp_infn='in.txt',
tmp_outfn='out.txt',
outp=None,
force=False,
test=False,
**kws
)
Run a bash command in /tmp
directory.
Args:
s1
(str): command.env
(str): environment name.df1
(DataFrame, optional): input dataframe. Defaults to None.inp
(str, optional): input path. Defaults to 'INPUT'.input_type
(str, optional): input type. Defaults to 'df'.output_type
(str, optional): output type. Defaults to 'path'.tmp_infn
(str, optional): temporary input file. Defaults to 'in.txt'.tmp_outfn
(str, optional): temporary output file.. Defaults to 'out.txt'.outp
(type, optional): output path. Defaults to None.force
(bool, optional): force. Defaults to False.test
(bool, optional): test. Defaults to False.
Returns:
output
: output of thesubprocess.call
function.
function create_symlink
create_symlink(p: str, outp: str, test=False, force=False)
Create symbolic links.
Args:
p
(str): input path.outp
(str): output path.test
(bool, optional): test. Defaults to False.
Returns:
outp
(str): output path.
TODOs:
Use
pathlib``:Path(p).symlink_to(Path(outp))
function input_binary
input_binary(q: str)
Get input in binary format.
Args:
q
(str): question.
Returns:
b
(bool): response.
function is_interactive
is_interactive()
Check if the UI is interactive e.g. jupyter or command line.
function is_interactive_notebook
is_interactive_notebook()
Check if the UI is interactive e.g. jupyter or command line.
Notes:
Reference:
function get_excecution_location
get_excecution_location(depth=1)
Get the location of the function being executed.
Args:
depth
(int, optional): Depth of the location. Defaults to 1.
Returns:
tuple
(tuple): filename and line number.
function get_datetime
get_datetime(outstr: bool = True, fmt='%G%m%dT%H%M%S')
Get the date and time.
Args:
outstr
(bool, optional): string output. Defaults to True.fmt
(str): format of the string.
Returns:
s
: date and time.
function p2time
p2time(filename: str, time_type='m')
Get the creation/modification dates of files.
Args:
filename
(str): filename.time_type
(str, optional): description. Defaults to 'm'.
Returns:
time
(str): time.
function ps2time
ps2time(ps: list, **kws_p2time)
Get the times for a list of files.
Args:
ps
(list): list of paths.
Returns:
ds
(Series): paths mapped to corresponding times.
function get_logger
get_logger(program='program', argv=None, level=None, dp=None)
Get the logging object.
Args:
program
(str, optional): name of the program. Defaults to 'program'.argv
(type, optional): arguments. Defaults to None.level
(type, optional): level of logging. Defaults to None.dp
(type, optional): description. Defaults to None.
function tree
tree(folder_path: str, log=True)
module roux.lib.text
For processing text files.
function get_header
get_header(path: str, comment='#', lineno=None)
Get the header of a file.
Args:
path
(str): path.comment
(str): comment identifier.lineno
(int): line numbers upto.
Returns:
lines
(list): header.
function cat
cat(ps, outp)
Concatenate text files.
Args:
ps
(list): list of paths.outp
(str): output path.
Returns:
outp
(str): output path.
module roux.stat.binary
For processing binary data.
function compare_bools_jaccard
compare_bools_jaccard(x, y)
Compare bools in terms of the jaccard index.
Args:
x
(list): list of bools.y
(list): list of bools.
Returns:
float
: jaccard index.
function compare_bools_jaccard_df
compare_bools_jaccard_df(df: DataFrame) → DataFrame
Pairwise compare bools in terms of the jaccard index.
Args:
df
(DataFrame): dataframe with boolean columns.
Returns:
DataFrame
: matrix with comparisons between the columns.
function classify_bools
classify_bools(l: list) → str
Classify bools.
Args:
l
(list): list of bools
Returns:
str
: classification.
function frac
frac(x: list) → float
Fraction.
Args:
x
(list): list of bools.
Returns:
float
: fraction of True values.
function perc
perc(x: list) → float
Percentage.
Args:
x
(list): list of bools.
Returns:
float
: Percentage of the True values
function get_stats_confusion_matrix
get_stats_confusion_matrix(df_: DataFrame) → DataFrame
Get stats confusion matrix.
Args:
df_
(DataFrame): Confusion matrix.
Returns:
DataFrame
: stats.
function get_cutoff
get_cutoff(
y_true,
y_score,
method,
show_diagonal=True,
show_area=True,
kws_area: dict = {},
show_cutoff=True,
plot_pr=True,
color='k',
returns=['ax'],
ax=None
)
Obtain threshold based on ROC or PR curve.
Returns: Table:
columns
: valuesmethod
: ROC, PRvariable
: threshold (index), TPR, FPR, TP counts, precision, recall values: Plots: AUC ROC, TPR vs TP counts PR Specificity vs TP counts Dictionary: Thresholds from AUC, PR
TODOs: 1. Separate the plotting functions.
module roux.stat.cluster
For clustering data.
function check_clusters
check_clusters(df: DataFrame)
Check clusters.
Args:
df
(DataFrame): dataframe.
function get_clusters
get_clusters(
X: <built-in function array>,
n_clusters: int,
random_state=88,
params={},
test=False
) → dict
Get clusters.
Args:
X
(np.array): vectorn_clusters
(int): intrandom_state
(int, optional): random state. Defaults to 88.params
(dict, optional): parameters for theMiniBatchKMeans
function. Defaults to {}.test
(bool, optional): test. Defaults to False.
Returns: dict:
function get_n_clusters_optimum
get_n_clusters_optimum(df5: DataFrame, test=False) → int
Get n clusters optimum.
Args:
df5
(DataFrame): input dataframetest
(bool, optional): test. Defaults to False.
Returns:
int
: knee point.
function plot_silhouette
plot_silhouette(df: DataFrame, n_clusters_optimum=None, ax=None)
Plot silhouette
Args:
df
(DataFrame): input dataframe.n_clusters_optimum
(int, optional): number of clusters. Defaults to None:int.ax
(axes, optional): axes object. Defaults to None:axes.
Returns:
ax
(axes, optional): axes object. Defaults to None:axes.
function get_clusters_optimum
get_clusters_optimum(
X: <built-in function array>,
n_clusters=range(2, 11),
params_clustering={},
test=False
) → dict
Get optimum clusters.
Args:
X
(np.array): samples to cluster in indexed format.n_clusters
(int, optional): description. Defaults to range(2,11).params_clustering
(dict, optional): parameters provided toget_clusters
. Defaults to {}.test
(bool, optional): test. Defaults to False.
Returns:
dict
: description
function get_gmm_params
get_gmm_params(g, x, n_clusters=2, test=False)
Intersection point of the two peak Gaussian mixture Models (GMMs).
Args:
out
(str):coff
only orparams
for all the parameters.
function get_gmm_intersection
get_gmm_intersection(x, two_pdfs, means, weights, test=False)
function cluster_1d
cluster_1d(
ds: Series,
n_clusters: int,
clf_type='gmm',
random_state=1,
test=False,
returns=['coff'],
**kws_clf
) → dict
Cluster 1D data.
Args:
ds
(Series): series.n_clusters
(int): number of clusters.clf_type
(str, optional): type of classification. Defaults to 'gmm'.random_state
(int, optional): random state. Defaults to 88.test
(bool, optional): test. Defaults to False.returns
(list, optional): return format. Defaults to ['df','coff','ax','model'].ax
(axes, optional): axes object. Defaults to None.
Raises:
ValueError
: clf_type
Returns:
dict
: description
function get_pos_umap
get_pos_umap(df1, spread=100, test=False, k='', **kws) → DataFrame
Get positions of the umap points.
Args:
df1
(DataFrame): input dataframespread
(int, optional): spead extent. Defaults to 100.test
(bool, optional): test. Defaults to False.k
(str, optional): number of clusters. Defaults to ''.
Returns:
DataFrame
: output dataframe.
module roux.stat.compare
For comparison related stats.
function get_comparison
get_comparison(
df1: DataFrame,
d1: dict = None,
coff_p: float = 0.05,
between_ys: bool = False,
verbose: bool = False,
**kws
)
Compare the x and y columns.
Parameters:
df1
(pd.DataFrame): input table.d1
(dict): columns dict, output ofget_cols_x_for_comparison
.between_ys
(bool): compare y's
Notes:
Column information: d1={'cols_index': ['id'], 'cols_x': {'cont': [], 'desc': []}, 'cols_y': {'cont': [], 'desc': []}} Comparison types: 1. continuous vs continuous -> correlation 2. decrete vs continuous -> difference 3. decrete vs decrete -> FE or chi square
function compare_strings
compare_strings(l0: list, l1: list, cutoff: float = 0.5) → DataFrame
Compare two lists of strings.
Parameters:
l0
(list): list of strings.l1
(list): list of strings to compare with.cutoff
(float): threshold to filter the comparisons.
Returns: table with the similarity scores.
TODOs: 1. Add option for semantic similarity.
module roux.stat.corr
For correlation stats.
function resampled
resampled(
x: <built-in function array>,
y: <built-in function array>,
method_fun: object,
method_kws: dict = {},
ci_type: str = 'max',
cv: int = 5,
random_state: int = 1,
verbose: bool = False
) → tuple
Get correlations after resampling.
Args:
x
(np.array): x vector.y
(np.array): y vector.method_fun
(str, optional): method function.ci_type
(str, optional): confidence interval type. Defaults to 'max'.cv
(int, optional): number of resamples. Defaults to 5.random_state
(int, optional): random state. Defaults to 1.verbose
(bool): verbose.
Returns:
dict
: results containing mean correlation coefficient, CI and CI type.
function get_corr
get_corr(
x: str,
y: str,
method: str,
df: DataFrame = None,
method_kws: dict = {},
pval: bool = True,
preprocess: bool = True,
n_min=10,
preprocess_kws: dict = {},
resample: bool = False,
cv=5,
resample_kws: dict = {},
verbose: bool = False,
test: bool = False
) → dict
Correlation between vectors. A unifying wrapper around scipy
's functions to calculate correlations and distances. Allows application of resampling on those functions.
Usage: 1. Linear table with paired values. For a matrix, use pd.DataFrame.corr
instead.
Args:
x
(str): x column name or a vector.y
(str): y column name or a vector.method
(str): method name.df
(pd.DataFrame): input table.pval
(bool): calculate p-value.resample
(bool, optional): resampling. Defaults to False.preprocess
(bool): preprocess the inputpreprocess_kws (dict)
: parameters provided to the pre-processing function i.e._pre
.resample
(bool): resampling.resample_kws
(dict): parameters provided to the resampling function i.e.resample
.verbose
(bool): verbose.
Returns:
res
(dict): a dictionary containing results.
Notes:
res
directory contains following values: method : method name r : correlation coefficient or distance p : pvalue of the correlation. n : sample size rr: resampled average 'r' ci: CI ci_type: CI type
function get_corrs
get_corrs(
data: DataFrame,
method: str,
cols: list = None,
cols_with: list = None,
coff_inflation_min: float = None,
get_pairs_kws={},
fast: bool = False,
test: bool = False,
verbose: bool = False,
**kws_get_corr
) → DataFrame
Correlate many columns of a dataframes.
Parameters:
df1
(DataFrame): input dataframe.method
(str): method of correlationspearman
orpearson
.cols
(str): columns.cols_with
(str): columns to correlate with i.e. variable2.fast
(bool): use parallel-processing if True.
Keyword arguments:
kws_get_corr
: parameters provided toget_corr
function.
Returns:
DataFrame
: output dataframe.
Notes:
In the fast mode (fast=True), to set the number of processes, before executing the
get_corrs
command, run from pandarallel import pandarallel pandarallel.initialize(nb_workers={},progress_bar=True,use_memory_fs=False)
function check_collinearity
check_collinearity(
df1: DataFrame,
threshold: float = 0.7,
colvalue: str = 'r',
cols_variable: list = ['variable1', 'variable2'],
coff_pval: float = 0.05,
method: str = 'spearman',
coff_inflation_min: int = 50
) → Series
Check collinearity.
Args:
df1
(DataFrame): input dataframe.threshold
(float): minimum threshold for the colinearity.
Returns:
DataFrame
: output dataframe with minimum correlation among correlated subnetwork of columns.
function pairwise_chi2
pairwise_chi2(df1: DataFrame, cols_values: list) → DataFrame
Pairwise chi2 test.
Args:
df1
(DataFrame): pd.DataFramecols_values
(list): list of columns.
Returns:
DataFrame
: output dataframe.
TODOs: 0. use lib.set.get_pairs
to get the combinations.
module roux.stat.diff
For difference related stats.
function compare_classes
compare_classes(x, y, method=None)
Compare classes
function compare_classes_many
compare_classes_many(df1: DataFrame, cols_y: list, cols_x: list) → DataFrame
function get_pval
get_pval(
df: DataFrame,
colvalue='value',
colsubset='subset',
colvalue_bool=False,
colindex=None,
subsets=None,
test=False,
fun=None
) → tuple
Get p-value.
Args:
df
(DataFrame): input dataframe.colvalue
(str, optional): column with values. Defaults to 'value'.colsubset
(str, optional): column with subsets. Defaults to 'subset'.colvalue_bool
(bool, optional): column with boolean values. Defaults to False.colindex
(str, optional): column with the index. Defaults to None.subsets
(list, optional): subset types. Defaults to None.test
(bool, optional): test. Defaults to False.fun
(function, optional): function. Defaults to None.
Raises:
ArgumentError
: colvalue or colsubset not found in df.ValueError
: need only 2 subsets.
Returns:
tuple
: stat,p-value
function get_stat
get_stat(
df1: DataFrame,
colsubset: str,
colvalue: str,
colindex: str,
subsets=None,
cols_subsets=['subset1', 'subset2'],
df2=None,
stats=[<function mean at 0x7fd396dafb00>, <function median at 0x7fd396c73b00>, <function var at 0x7fd396daff80>, <built-in function len>],
coff_samples_min=None,
verb=False,
**kws
) → DataFrame
Get statistics.
Args:
df1
(DataFrame): input dataframe.colvalue
(str, optional): column with values. Defaults to 'value'.colsubset
(str, optional): column with subsets. Defaults to 'subset'.colindex
(str, optional): column with the index. Defaults to None.subsets
(list, optional): subset types. Defaults to None.cols_subsets
(list, optional): columns with subsets. Defaults to ['subset1', 'subset2'].df2
(DataFrame, optional): second dataframe. Defaults to None.stats
(list, optional): summary statistics. Defaults to [np.mean,np.median,np.var]+[len].coff_samples_min
(int, optional): minimum sample size required. Defaults to None.verb
(bool, optional): verbose. Defaults to False.
Keyword Arguments:
kws
: parameters provided toget_pval
function.
Raises:
ArgumentError
: colvalue or colsubset not found in df.ValueError
: len(subsets)<2
Returns:
DataFrame
: output dataframe.
TODOs: 1. Rename to more specific get_diff
, also other get_stat*
/get_pval*
functions.
function get_stats
get_stats(
df1: DataFrame,
colsubset: str,
cols_value: list,
colindex: str,
subsets=None,
df2=None,
cols_subsets=['subset1', 'subset2'],
stats=[<function mean at 0x7fd396dafb00>, <function median at 0x7fd396c73b00>, <function var at 0x7fd396daff80>, <built-in function len>],
axis=0,
test=False,
**kws
) → DataFrame
Get statistics by iterating over columns wuth values.
Args:
df1
(DataFrame): input dataframe.colsubset
(str, optional): column with subsets.cols_value
(list): list of columns with values.colindex
(str, optional): column with the index.subsets
(list, optional): subset types. Defaults to None.df2
(DataFrame, optional): second dataframe, e.g.pd.DataFrame({"subset1":['test'],"subset2":['reference']})
. Defaults to None.cols_subsets
(list, optional): columns with subsets. Defaults to ['subset1', 'subset2'].stats
(list, optional): summary statistics. Defaults to [np.mean,np.median,np.var]+[len].axis
(int, optional): 1 if different tests else use 0. Defaults to 0.
Keyword Arguments:
kws
: parameters provided toget_pval
function.
Raises:
ArgumentError
: colvalue or colsubset not found in df.ValueError
: len(subsets)<2
Returns:
DataFrame
: output dataframe.
TODOs: 1. No column prefix if len(cols_value)==1
.
function get_significant_changes
get_significant_changes(
df1: DataFrame,
coff_p=0.025,
coff_q=0.1,
alpha=None,
changeby='mean',
value_aggs=['mean', 'median']
) → DataFrame
Get significant changes.
Args:
df1
(DataFrame): input dataframe.coff_p
(float, optional): cutoff on p-value. Defaults to 0.025.coff_q
(float, optional): cutoff on q-value. Defaults to 0.1.alpha
(float, optional): alias forcoff_p
. Defaults to None.changeby
(str, optional): "" if check for change by both mean and median. Defaults to "".value_aggs
(list, optional): values to aggregate. Defaults to ['mean','median'].
Returns:
DataFrame
: output dataframe.
function apply_get_significant_changes
apply_get_significant_changes(
df1: DataFrame,
cols_value: list,
cols_groupby: list,
cols_grouped: list,
fast=False,
**kws
) → DataFrame
Apply on dataframe to get significant changes.
Args:
df1
(DataFrame): input dataframe.cols_value
(list): columns with values.cols_groupby
(list): columns with groups.
Returns:
DataFrame
: output dataframe.
function get_stats_groupby
get_stats_groupby(
df1: DataFrame,
cols_group: list,
coff_p: float = 0.05,
coff_q: float = 0.1,
alpha=None,
fast=False,
**kws
) → DataFrame
Iterate over groups, to get the differences.
Args:
df1
(DataFrame): input dataframe.cols_group
(list): columns to interate over.coff_p
(float, optional): cutoff on p-value. Defaults to 0.025.coff_q
(float, optional): cutoff on q-value. Defaults to 0.1.alpha
(float, optional): alias forcoff_p
. Defaults to None.fast
(bool, optional): parallel processing. Defaults to False.
Returns:
DataFrame
: output dataframe.
function get_diff
get_diff(
df1: DataFrame,
cols_x: list,
cols_y: list,
cols_index: list,
cols_group: list,
coff_p: float = None,
test: bool = False,
**kws
) → DataFrame
Wrapper around the get_stats_groupby
Keyword parameters: cols=['variable x','variable y'], coff_p=0.05, coff_q=0.01, colindex=['id'],
function binby_pvalue_coffs
binby_pvalue_coffs(
df1: DataFrame,
coffs=[0.01, 0.05, 0.1],
color=False,
testn='MWU test, FDR corrected',
colindex='genes id',
colgroup='tissue',
preffix='',
colns=None,
palette=None
) → tuple
Bin data by pvalue cutoffs.
Args:
df1
(DataFrame): input dataframe.coffs
(list, optional): cut-offs. Defaults to [0.01,0.05,0.25].color
(bool, optional): color asignment. Defaults to False.testn
(str, optional): test number. Defaults to 'MWU test, FDR corrected'.colindex
(str, optional): column with index. Defaults to 'genes id'.colgroup
(str, optional): column with the groups. Defaults to 'tissue'.preffix
(str, optional): prefix. Defaults to ''.colns
(type, optional): columns number. Defaults to None.notcountedpalette
(type, optional): description. Defaults to None.
Returns:
tuple
: output.
Notes:
- To be deprecated in the favor of the functions used for enrichment analysis for example.
module roux.stat.io
For input/output of stats.
function perc_label
perc_label(a, b=None, bracket=True)
function pval2annot
pval2annot(
pval: float,
alternative: str = None,
alpha: float = 0.05,
fmt: str = '*',
power: bool = True,
linebreak: bool = False,
replace_prefix: str = None
)
P/Q-value to annotation.
Parameters:
fmt
(str): *|<|'num'
module roux.stat
Global Variables
- binary
- io
module roux.stat.network
For network related stats.
function get_subgraphs
get_subgraphs(df1: DataFrame, source: str, target: str) → DataFrame
Subgraphs from the the edge list.
Args:
df1
(pd.DataFrame): input dataframe containing edge-list.source
(str): source node.target
(str): taget node.
Returns:
pd.DataFrame
: output.
module roux.stat.norm
For normalisation.
function norm_by_quantile
norm_by_quantile(X: <built-in function array>) → <built-in function array>
Quantile normalize the columns of X.
Parameters:
X
: 2D array of float, shape (M, N). The input data, with M rows (genes/features) and N columns (samples).
Returns:
Xn
: 2D array of float, shape (M, N). The normalized data.
Notes:
Faster processing (~5 times compared to other function tested) because of the use of numpy arrays. TODOs: Use
from sklearn.preprocessing import QuantileTransformer
withoutput_distribution
parameter allowing rescaling back to the same distribution kind.
function norm_by_gaussian_kde
norm_by_gaussian_kde(
values: <built-in function array>
) → <built-in function array>
Normalise matrix by gaussian KDE.
Args:
values
(np.array): input matrix.
Returns:
np.array
: output matrix.
References:
https
: //github.com/saezlab/protein_attenuation/blob/6c1e81af37d72ef09835ee287f63b000c7c6663c/src/protein_attenuation/utils.py
function zscore
zscore(df: DataFrame, cols: list = None) → DataFrame
Z-score.
Args:
df
(pd.DataFrame): input table.
Returns:
pd.DataFrame
: output table.
TODOs: 1. Use scipy or sklearn's zscore because of it's additional options from scipy.stats import zscore df.apply(zscore)
function zscore_robust
zscore_robust(a: <built-in function array>) → <built-in function array>
Robust Z-score.
Args:
a
(np.array): input data.
Returns:
np.array
: output.
Example: t = sc.stats.norm.rvs(size=100, scale=1, random_state=123456) plt.hist(t,bins=40) plt.hist(apply_zscore_robust(t),bins=40) print(np.median(t),np.median(apply_zscore_robust(t)))
function norm_covariance_PCA
norm_covariance_PCA(
X: <built-in function array>,
use_svd: bool = True,
use_sklearn: bool = True,
rescale_centered: bool = True,
random_state: int = 0,
test: bool = False,
verbose: bool = False
) → <built-in function array>
Covariance normalization by PCA whitening.
Args:
X
(np.array): input arrayuse_svd
(bool, optional): use SVD method. Defaults to True.use_sklearn
(bool, optional): useskelearn
for SVD method. Defaults to True.rescale_centered
(bool, optional): rescale to centered input. Defaults to True.random_state
(int, optional): random state. Defaults to 0.test
(bool, optional): test mode. Defaults to False.verbose
(bool, optional): verbose. Defaults to False.
Returns:
np.array
: transformed data.
module roux.stat.paired
For paired stats.
function get_ratio_sorted
get_ratio_sorted(a: float, b: float, increase=True) → float
Get ratio sorted.
Args:
a
(float): value #1.b
(float): value #2.increase
(bool, optional): check for increase. Defaults to True.
Returns:
float
: output.
function diff
diff(a: float, b: float, absolute=True) → float
Get difference
Args:
a
(float): value #1.b
(float): value #2.absolute
(bool, optional): get absolute difference. Defaults to True.
Returns:
float
: output.
function get_diff_sorted
get_diff_sorted(a: float, b: float) → float
Difference sorted/absolute.
Args:
a
(float): value #1.b
(float): value #2.
Returns:
float
: output.
function balance
balance(a: float, b: float, absolute=True) → float
Balance.
Args:
a
(float): value #1.b
(float): value #2.absolute
(bool, optional): absolute difference. Defaults to True.
Returns:
float
: output.
function get_paired_sets_stats
get_paired_sets_stats(l1: list, l2: list, test: bool = False) → list
Paired stats comparing two sets.
Args:
l1
(list): set #1.l2
(list): set #2.test
(bool): test mode. Defaults to False.
Returns:
list
: tuple (overlap, intersection, union, ratio).
function get_stats_paired
get_stats_paired(
df1: DataFrame,
cols: list,
input_logscale: bool,
prefix: str = None,
drop_cols: bool = False,
unidirectional_stats: list = ['min', 'max'],
fast: bool = False
) → DataFrame
Paired stats, row-wise.
Args:
df1
(pd.DataFrame): input data.cols
(list): columns.input_logscale
(bool): if the input data is log-scaled.prefix
(str, optional): prefix of the output column/s. Defaults to None.drop_cols
(bool, optional): drop these columns. Defaults to False.unidirectional_stats
(list, optional): column-wise status. Defaults to ['min','max'].fast
(bool, optional): parallel processing. Defaults to False.
Returns:
pd.DataFrame
: output dataframe.
function get_stats_paired_agg
get_stats_paired_agg(
x: <built-in function array>,
y: <built-in function array>,
ignore: bool = False,
verb: bool = True
) → Series
Paired stats aggregated, for example, to classify 2D distributions.
Args:
x
(np.array): x vector.y
(np.array): y vector.ignore
(bool, optional): suppress warnings. Defaults to False.verb
(bool, optional): verbose. Defaults to True.
Returns:
pd.Series
: output.
function classify_sharing
classify_sharing(
df1: DataFrame,
column_value: str,
bins: list = [0, 25, 75, 100],
labels: list = ['low', 'medium', 'high'],
prefix: str = '',
verbose: bool = False
) → DataFrame
Classify sharing % calculated from Jaccard index.
Parameters:
df1
(pd.DataFrame): input table.column_value
(str): column with values.bins
(list): bins. Defaults to [0,25,75,100].labels
(list): bin labels. Defaults to ['low','medium','high'],prefix
(str): prefix of the columns.verbose
(bool): verbose. Defaults to False.
module roux.stat.preprocess
For classification.
function dropna_matrix
dropna_matrix(
df1,
coff_cols_min_perc_na=5,
coff_rows_min_perc_na=5,
test=False,
verbose=False
)
function drop_low_complexity
drop_low_complexity(
df1: DataFrame,
min_nunique: int,
max_inflation: int,
max_nunique: int = None,
cols: list = None,
cols_keep: list = [],
test: bool = False,
verbose: bool = False
) → DataFrame
Remove low-complexity columns from the data.
Args:
df1
(pd.DataFrame): input data.min_nunique
(int): minimum unique values.max_inflation
(int): maximum over-representation of the values.cols
(list, optional): columns. Defaults to None.cols_keep
(list, optional): columns to keep. Defaults to [].test
(bool, optional): test mode. Defaults to False.
Returns:
pd.DataFrame
: output data.
function get_cols_x_for_comparison
get_cols_x_for_comparison(
df1: DataFrame,
cols_y: list,
cols_index: list,
cols_drop: list = [],
cols_dropby_patterns: list = [],
dropby_low_complexity: bool = True,
min_nunique: int = 5,
max_inflation: int = 50,
dropby_collinearity: bool = True,
coff_rs: float = 0.7,
dropby_variance_inflation: bool = True,
verbose: bool = False,
test: bool = False
) → dict
Identify X columns.
Parameters:
df1
(pd.DataFrame): input table.cols_y
(list): y columns.
function to_preprocessed_data
to_preprocessed_data(
df1: DataFrame,
columns: dict,
fill_missing_desc_value: bool = False,
fill_missing_cont_value: bool = False,
normby_zscore: bool = False,
verbose: bool = False,
test: bool = False
) → DataFrame
Preprocess data.
function to_filteredby_samples
to_filteredby_samples(
df1: DataFrame,
colindex: str,
colsample: str,
coff_samples_min: int,
colsubset: str,
coff_subsets_min: int = 2
) → DataFrame
Filter table before calculating differences. (1) Retain minimum number of samples per item representing a subset and (2) Retain minimum number of subsets per item.
Parameters: df1 (pd.DataFrame): input table. colindex (str): column containing items. colsample (str): column containing samples. coff_samples_min (int): minimum number of samples. colsubset (str): column containing subsets. coff_subsets_min (int): minimum number of subsets. Defaults to 2.
Returns: pd.DataFrame
Examples:
Parameters: colindex='genes id', colsample='sample id', coff_samples_min=3, colsubset= 'pLOF or WT' coff_subsets_min=2,
function get_cvsplits
get_cvsplits(
X: <built-in function array>,
y: <built-in function array>,
cv: int = 5,
random_state: int = None,
outtest: bool = True
) → dict
Get cross-validation splits. A friendly wrapper around sklearn.model_selection.KFold
.
Args:
X
(np.array): X matrix.y
(np.array): y vector.cv
(int, optional): cross validations. Defaults to 5.random_state
(int, optional): random state. Defaults to None.outtest
(bool, optional): output test data. Defaults to True.
Returns:
dict
: output.
module roux.stat.sets
For set related stats.
function get_overlap
get_overlap(
items_set: list,
items_test: list,
output_format: str = 'list'
) → list
Get overlapping items as a string.
Args:
items_set
(list): items in the reference setitems_test
(list): items to testoutput_format
(str, optional): format of the output. Defaults to 'list'.
Raises:
ValueError
: output_format can be list or str
function get_overlap_size
get_overlap_size(
items_set: list,
items_test: list,
fraction: bool = False,
perc: bool = False,
by: str = None
) → float
Percentage Jaccard index.
Args:
items_set
(list): items in the reference setitems_test
(list): items to testfraction
(bool, optional): output fraction. Defaults to False.perc
(bool, optional): output percentage. Defaults to False.by
(str, optional): fraction by. Defaults to None.
Returns:
float
: overlap size.
function get_item_set_size_by_background
get_item_set_size_by_background(items_set: list, background: int) → float
Item set size by background
Args:
items_set
(list): items in the reference setbackground
(int): background size
Returns:
float
: Item set size by background
Notes:
Denominator of the fold change.
function get_fold_change
get_fold_change(items_set: list, items_test: list, background: int) → float
Get fold change.
Args:
items_set
(list): items in the reference setitems_test
(list): items to testbackground
(int): background size
Returns:
float
: fold change
Notes:
fc = (intersection/(test items))/((items in the item set)/background)
function get_hypergeom_pval
get_hypergeom_pval(items_set: list, items_test: list, background: int) → float
Calculate hypergeometric P-value.
Args:
items_set
(list): items in the reference setitems_test
(list): items to testbackground
(int): background size
Returns:
float
: hypergeometric P-value
function get_contigency_table
get_contigency_table(items_set: list, items_test: list, background: int) → list
Get a contingency table required for the Fisher's test.
Args:
items_set
(list): items in the reference setitems_test
(list): items to testbackground
(int): background size
Returns:
list
: contingency table
Notes:
within item (/referenece) set: True False within test item: True intersection True False False False False total-size of union
function get_odds_ratio
get_odds_ratio(items_set: list, items_test: list, background: int) → float
Calculate Odds ratio and P-values using Fisher's exact test.
Args:
items_set
(list): items in the reference setitems_test
(list): items to testbackground
(int): background size
Returns:
float
: Odds ratio
function get_enrichment
get_enrichment(
df1: DataFrame,
df2: DataFrame,
colid: str,
colset: str,
background: int,
coltest: str = None,
verbose: bool = False
) → DataFrame
Calculate the enrichments.
Args:
df1
(pd.DataFrame): table containing items to testdf2
(pd.DataFrame): table containing refence sets and itemscolid
(str): column with IDs of itemscolset
(str): column setscoltest
(str): column testsbackground
(int): background size.verbose
(bool): verbose
Returns:
pd.DataFrame
: output table
module roux.stat.solve
For solving equations.
function get_intersection_locations
get_intersection_locations(
y1: <built-in function array>,
y2: <built-in function array>,
test: bool = False,
x: <built-in function array> = None
) → list
Get co-ordinates of the intersection (x[idx]).
Args:
y1
(np.array): vector.y2
(np.array): vector.test
(bool, optional): test mode. Defaults to False.x
(np.array, optional): vector. Defaults to None.
Returns:
list
: output.
module roux.stat.transform
For transformations.
function plog
plog(x, p: float, base: int)
Psudo-log.
Args:
x
(float|np.array): input.p
(float): pseudo-count.base
(int): base of the log.
Returns: output.
function anti_plog
anti_plog(x, p: float, base: int)
Anti-psudo-log.
Args:
x
(float|np.array): input.p
(float): pseudo-count.base
(int): base of the log.
Returns: output.
function log_pval
log_pval(
x,
errors: str = 'raise',
replace_zero_with: float = None,
p_min: float = None
)
Transform p-values to Log10.
Paramters: x: input. errors (str): Defaults to 'raise' else replace (in case of visualization only). p_min (float): Replace zeros with this value. Note: to be used for visualization only.
Returns: output.
function get_q
get_q(ds1: Series, col: str = None, verb: bool = True, test_coff: float = 0.1)
To FDR corrected P-value.
function glog
glog(x: float, l=2)
Generalised logarithm.
Args:
x
(float): input.l
(int, optional): psudo-count. Defaults to 2.
Returns:
float
: output.
function rescale
rescale(
a: <built-in function array>,
range1: tuple = None,
range2: tuple = [0, 1]
) → <built-in function array>
Rescale within a new range.
Args:
a
(np.array): input vector.range1
(tuple, optional): existing range. Defaults to None.range2
(tuple, optional): new range. Defaults to [0,1].
Returns:
np.array
: output.
function rescale_divergent
rescale_divergent(df1: DataFrame, col: str) → DataFrame
Rescale divergently i.e. two-sided.
Args:
df1
(pd.DataFrame): descriptioncol
(str): column.
Returns:
pd.DataFrame
: column.
Notes:
Under development.
module roux.stat.variance
For variance related stats.
function confidence_interval_95
confidence_interval_95(x: <built-in function array>) → float
95% confidence interval.
Args:
x
(np.array): input vector.
Returns:
float
: output.
function get_ci
get_ci(rs, ci_type, outstr=False)
function get_variance_inflation
get_variance_inflation(data, coly: str, cols_x: list = None)
Variance Inflation Factor (VIF). A wrapper around statsmodels
's 'variance_inflation_factor
function.
Parameters:
data
(pd.DataFrame): input data.coly
(str): dependent variable.cols_x
(list): independent variables.
Returns: pd.Series
module roux.viz.annot
For annotations.
function annot_side
annot_side(
ax: Axes,
df1: DataFrame,
colx: str,
coly: str,
cols: str = None,
hue: str = None,
loc: str = 'right',
scatter=False,
scatter_marker='|',
scatter_alpha=0.75,
lines=True,
offx3: float = 0.15,
offymin: float = 0.1,
offymax: float = 0.9,
length_axhline: float = 3,
text=True,
text_offx: float = 0,
text_offy: float = 0,
invert_xaxis: bool = False,
break_pt: int = 25,
va: str = 'bottom',
zorder: int = 2,
color: str = 'gray',
kws_line: dict = {},
kws_scatter: dict = {},
**kws_text
) → Axes
Annot elements of the plots on the of the side plot.
Args:
df1
(pd.DataFrame): input datacolx
(str): column with x values.coly
(str): column with y values.cols
(str): column with labels.hue
(str): column with colors of the labels.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.loc
(str, optional): location. Defaults to 'right'.invert_xaxis
(bool, optional): invert xaxis. Defaults to False.offx3
(float, optional): x-offset for bend position of the arrow. Defaults to 0.15.offymin
(float, optional): x-offset minimum. Defaults to 0.1.offymax
(float, optional): x-offset maximum. Defaults to 0.9.break_pt
(int, optional): break point of the labels. Defaults to 25.length_axhline
(float, optional): length of the horizontal line i.e. the "underline". Defaults to 3.zorder
(int, optional): z-order. Defaults to 1.color
(str, optional): color of the line. Defaults to 'gray'.kws_line
(dict, optional): parameters for formatting the line. Defaults to {}.
Keyword Args:
kws
: parameters provided to theax.text
function.
Returns:
plt.Axes
:plt.Axes
object.
function show_outlines
show_outlines(
data: DataFrame,
colx: str,
coly: str,
column_outlines: str,
outline_colors: dict,
style=None,
legend: bool = True,
kws_legend: dict = {},
zorder: int = 3,
ax: Axes = None,
**kws_scatter
) → Axes
Outline points on the scatter plot by categories.
function show_confidence_ellipse
show_confidence_ellipse(x, y, ax, n_std=3.0, facecolor='none', **kwargs)
Create a plot of the covariance confidence ellipse of x and y.
Parameters:
---------- x, y : array-like, shape (n, ) Input data.
ax : matplotlib.axes.Axes The axes object to draw the ellipse into.
n_std : float The number of standard deviations to determine the ellipse's radiuses.
**kwargs Forwarded to ~matplotlib.patches.Ellipse
Returns ------- matplotlib.patches.Ellipse
References ---------- https://matplotlib.org/3.5.0/gallery/statistics/confidence_ellipse.html
function show_box
show_box(
ax: Axes,
xy: tuple,
width: float,
height: float,
fill: str = None,
alpha: float = 1,
lw: float = 1.1,
edgecolor: str = 'k',
clip_on: bool = False,
scale_width: float = 1,
scale_height: float = 1,
xoff: float = 0,
yoff: float = 0,
**kws
) → Axes
Highlight sections of a plot e.g. heatmap by drawing boxes.
Args:
xy
(tuple): position of left, bottom corner of the box.width
(float): width.height
(float): height.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.fill
(str, optional): fill the box with color. Defaults to None.alpha
(float, optional): alpha of color. Defaults to 1.lw
(float, optional): line width. Defaults to 1.1.edgecolor
(str, optional): edge color. Defaults to 'k'.clip_on
(bool, optional): clip the boxes by the axis limit. Defaults to False.scale_width
(float, optional): scale width. Defaults to 1.scale_height
(float, optional): scale height. Defaults to 1.xoff
(float, optional): x-offset. Defaults to 0.yoff
(float, optional): y-offset. Defaults to 0.
Keyword Args:
kws
: parameters provided to theRectangle
function.
Returns:
plt.Axes
:plt.Axes
object.
function color_ax
color_ax(ax: Axes, c: str, linewidth: float = None) → Axes
Color border of plt.Axes
.
Args:
ax
(plt.Axes):plt.Axes
object.c
(str): color.linewidth
(float, optional): line width. Defaults to None.
Returns:
plt.Axes
:plt.Axes
object.
function show_n_legend
show_n_legend(ax, df1: DataFrame, colid: str, colgroup: str, **kws)
function show_scatter_stats
show_scatter_stats(
ax: Axes,
data: DataFrame,
x,
y,
z,
method: str,
resample: bool = False,
show_n: bool = True,
show_n_prefix: str = '',
prefix: str = '',
loc=None,
zorder: int = 5,
verbose: bool = True,
**kws_set_label
)
resample (bool, optional): resample data. Defaults to False.
function show_crosstab_stats
show_crosstab_stats(
data: DataFrame,
cols: list,
method: str = None,
alpha: float = 0.05,
loc: str = None,
xoff: float = 0,
yoff: float = 0,
linebreak: bool = False,
ax: Axes = None,
**kws_set_label
) → Axes
Annotate a confusion matrix.
Args:
data
(pd.DataFrame): input data.cols
(list): list of columns with the categories.method
(str, optional): method used to calculate the statistical significance.alpha
(float, optional): alpha for the stats. Defaults to 0.05.loc
(str, optional): location. Over-rides kws_set_label. Defaults to None.xoff
(float, optional): x offset. Defaults to 0.yoff
(float, optional): y offset. Defaults to 0.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Keyword Args:
kws_set_label
: keyword parameters provided toset_label
.
Returns:
plt.Axes
:plt.Axes
object.
function show_confusion_matrix_stats
show_confusion_matrix_stats(
df_: DataFrame,
ax: Axes = None,
off: float = 0.5
) → Axes
Annotate a confusion matrix.
Args:
df_
(pd.DataFrame): input data.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.off
(float, optional): offset. Defaults to 0.5.
Returns:
plt.Axes
:plt.Axes
object.
function get_logo_ax
get_logo_ax(
ax: Axes,
size: float = 0.5,
bbox_to_anchor: list = None,
loc: str = 1,
axes_kwargs: dict = {'zorder': -1}
) → Axes
Get plt.Axes
for placing the logo.
Args:
ax
(plt.Axes):plt.Axes
object.size
(float, optional): size of the subplot. Defaults to 0.5.bbox_to_anchor
(list, optional): location. Defaults to None.loc
(str, optional): location. Defaults to 1.axes_kwargs
(type, optional): parameters provided toinset_axes
. Defaults to {'zorder':-1}.
Returns:
plt.Axes
:plt.Axes
object.
function set_logo
set_logo(
imp: str,
ax: Axes,
size: float = 0.5,
bbox_to_anchor: list = None,
loc: str = 1,
axes_kwargs: dict = {'zorder': -1},
params_imshow: dict = {'aspect': 'auto', 'alpha': 1, 'interpolation': 'catrom'},
test: bool = False,
force: bool = False
) → Axes
Set logo.
Args:
imp
(str): path to the logo file.ax
(plt.Axes):plt.Axes
object.size
(float, optional): size of the subplot. Defaults to 0.5.bbox_to_anchor
(list, optional): location. Defaults to None.loc
(str, optional): location. Defaults to 1.axes_kwargs
(type, optional): parameters provided toinset_axes
. Defaults to {'zorder':-1}.params_imshow
(type, optional): parameters provided to theimshow
function. Defaults to {'aspect':'auto','alpha':1, 'interpolation':'catrom'}.test
(bool, optional): test mode. Defaults to False.force
(bool, optional): overwrite file. Defaults to False.
Returns:
plt.Axes
:plt.Axes
object.
function set_suptitle
set_suptitle(axs, title, offy=0, **kws_text)
Combined title for a list of subplots.
module roux.viz.ax_
For setting up subplots.
function set_axes_minimal
set_axes_minimal(ax, xlabel=None, ylabel=None, off_axes_pad=0) → Axes
Set minimal axes labels, at the lower left corner.
function set_label
set_label(
s: str,
ax: Axes,
x: float = 0,
y: float = 0,
ha: str = 'left',
va: str = 'top',
loc=None,
off_loc=0.01,
title: bool = False,
**kws
) → Axes
Set label on a plot.
Args:
x
(float): x position.y
(float): y position.s
(str): label.ax
(plt.Axes):plt.Axes
object.ha
(str, optional): horizontal alignment. Defaults to 'left'.va
(str, optional): vertical alignment. Defaults to 'top'.loc
(int, optional): location of the label. 1:'upper right', 2:'upper left', 3:'lower left':3, 4:'lower right'offs_loc
(tuple,optional): x and y location offsets.title
(bool, optional): set as title. Defaults to False.
Returns:
plt.Axes
:plt.Axes
object.
function set_ylabel
set_ylabel(
ax: Axes,
s: str = None,
x: float = -0.1,
y: float = 1.02,
xoff: float = 0,
yoff: float = 0
) → Axes
Set ylabel horizontal.
Args:
ax
(plt.Axes):plt.Axes
object.s
(str, optional): ylabel. Defaults to None.x
(float, optional): x position. Defaults to -0.1.y
(float, optional): y position. Defaults to 1.02.xoff
(float, optional): x offset. Defaults to 0.yoff
(float, optional): y offset. Defaults to 0.
Returns:
plt.Axes
:plt.Axes
object.
function get_ax_labels
get_ax_labels(ax: Axes)
function format_labels
format_labels(ax, fmt='cap1', title_fontsize=15, rename_labels=None, test=False)
function rename_ticklabels
rename_ticklabels(
ax: Axes,
axis: str,
rename: dict = None,
replace: dict = None,
ignore: bool = False
) → Axes
Rename the ticklabels.
Args:
ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.axis
(str): axis (x|y).rename
(dict, optional): replace strings. Defaults to None.replace
(dict, optional): replace sub-strings. Defaults to None.ignore
(bool, optional): ignore warnings. Defaults to False.
Raises:
ValueError
: eitherrename
orreplace
should be provided.
Returns:
plt.Axes
:plt.Axes
object.
function get_ticklabel_position
get_ticklabel_position(ax: Axes, axis: str) → Axes
Get positions of the ticklabels.
Args:
ax
(plt.Axes):plt.Axes
object.axis
(str): axis (x|y).
Returns:
plt.Axes
:plt.Axes
object.
function set_ticklabels_color
set_ticklabels_color(ax: Axes, ticklabel2color: dict, axis: str = 'y') → Axes
Set colors to ticklabels.
Args:
ax
(plt.Axes):plt.Axes
object.ticklabel2color
(dict): colors of the ticklabels.axis
(str): axis (x|y).
Returns:
plt.Axes
:plt.Axes
object.
function format_ticklabels
format_ticklabels(
ax: Axes,
axes: tuple = ['x', 'y'],
interval: float = None,
n: int = None,
fmt: str = None,
font: str = None
) → Axes
format_ticklabels
Args:
ax
(plt.Axes):plt.Axes
object.axes
(tuple, optional): axes. Defaults to ['x','y'].n
(int, optional): number of ticks. Defaults to None.fmt
(str, optional): format e.g. '.0f'. Defaults to None.font
(str, optional): font. Defaults to 'DejaVu Sans Mono'.
Returns:
plt.Axes
:plt.Axes
object.
TODOs: 1. include color_ticklabels
function split_ticklabels
split_ticklabels(
ax: Axes,
fmt: str,
axis='x',
group_x=-0.45,
group_y=-0.25,
group_prefix=None,
group_suffix=False,
group_loc='center',
group_colors=None,
group_alpha=0.2,
show_group_line=True,
group_line_off_x=0.15,
group_line_off_y=0.1,
show_group_span=False,
group_span_kws={},
sep: str = '-',
pad_major=6,
off: float = 0.2,
test: bool = False,
**kws
) → Axes
Split ticklabels into major and minor. Two minor ticks are created per major tick.
Args:
ax
(plt.Axes):plt.Axes
object.fmt
(str): 'group'-wise or 'pair'-wise splitting of the ticklabels.axis
(str): name of the axis: x or y.sep
(str, optional): separator within the tick labels. Defaults to ' '.test
(bool, optional): test-mode. Defaults to False.
Returns:
plt.Axes
:plt.Axes
object.
function get_axlimsby_data
get_axlimsby_data(
X: Series,
Y: Series,
off: float = 0.2,
equal: bool = False
) → Axes
Infer axis limits from data.
Args:
X
(pd.Series): x values.Y
(pd.Series): y values.off
(float, optional): offsets. Defaults to 0.2.equal
(bool, optional): equal limits. Defaults to False.
Returns:
plt.Axes
:plt.Axes
object.
function get_axlims
get_axlims(ax: Axes) → Axes
Get axis limits.
Args:
ax
(plt.Axes):plt.Axes
object.
Returns:
plt.Axes
:plt.Axes
object.
function set_equallim
set_equallim(
ax: Axes,
diagonal: bool = False,
difference: float = None,
format_ticks: bool = True,
**kws_format_ticklabels
) → Axes
Set equal axis limits.
Args:
ax
(plt.Axes):plt.Axes
object.diagonal
(bool, optional): show diagonal. Defaults to False.difference
(float, optional): difference from . Defaults to None.
Returns:
plt.Axes
:plt.Axes
object.
function set_axlims
set_axlims(
ax: Axes,
off: float,
axes: list = ['x', 'y'],
equal=False,
**kws_set_equallim
) → Axes
Set axis limits.
Args:
ax
(plt.Axes):plt.Axes
object.off
(float): offset.axes
(list, optional): axis name/s. Defaults to ['x','y'].
Returns:
plt.Axes
:plt.Axes
object.
function set_grids
set_grids(ax: Axes, axis: str = None) → Axes
Show grids based on the shape (aspect ratio) of the plot.
Args:
ax
(plt.Axes):plt.Axes
object.axis
(str, optional): axis name. Defaults to None.
Returns:
plt.Axes
:plt.Axes
object.
function rename_legends
rename_legends(ax: Axes, replaces: dict, **kws_legend) → Axes
Rename legends.
Args:
ax
(plt.Axes):plt.Axes
object.replaces
(dict): description
Returns:
plt.Axes
:plt.Axes
object.
function append_legends
append_legends(ax: Axes, labels: list, handles: list, **kws) → Axes
Append to legends.
Args:
ax
(plt.Axes):plt.Axes
object.labels
(list): labels.handles
(list): handles.
Returns:
plt.Axes
:plt.Axes
object.
function sort_legends
sort_legends(ax: Axes, sort_order: list = None, **kws) → Axes
Sort or filter legends.
Args:
ax
(plt.Axes):plt.Axes
object.sort_order
(list, optional): order of legends. Defaults to None.
Returns:
plt.Axes
:plt.Axes
object.
Notes:
- Filter the legends by providing the indices of the legends to keep.
function drop_duplicate_legend
drop_duplicate_legend(ax, **kws)
function reset_legend_colors
reset_legend_colors(ax)
Reset legend colors.
Args:
ax
(plt.Axes):plt.Axes
object.
Returns:
plt.Axes
:plt.Axes
object.
function set_legends_merged
set_legends_merged(axs)
Reset legend colors.
Args:
axs
(list): list ofplt.Axes
objects.
Returns:
plt.Axes
: firstplt.Axes
object in the list.
function set_legend_custom
set_legend_custom(
ax: Axes,
legend2param: dict,
param: str = 'color',
lw: float = 1,
marker: str = 'o',
markerfacecolor: bool = True,
size: float = 10,
color: str = 'k',
linestyle: str = '',
title_ha: str = 'center',
frameon: bool = True,
**kws
) → Axes
Set custom legends.
Args:
ax
(plt.Axes):plt.Axes
object.legend2param
(dict): legend name to parameter to change e.g. name of the color.param
(str, optional): parameter to change. Defaults to 'color'.lw
(float, optional): line width. Defaults to 1.marker
(str, optional): marker type. Defaults to 'o'.markerfacecolor
(bool, optional): marker face color. Defaults to True.size
(float, optional): size of the markers. Defaults to 10.color
(str, optional): color of the markers. Defaults to 'k'.linestyle
(str, optional): line style. Defaults to ''.title_ha
(str, optional): title horizontal alignment. Defaults to 'center'.frameon
(bool, optional): show frame. Defaults to True.
Returns:
plt.Axes
:plt.Axes
object.
TODOs: 1. differnet number of points for eachh entry
from matplotlib.legend_handler import HandlerTuple l1, = plt.plot(-1, -1, lw=0, marker="o", markerfacecolor='k', markeredgecolor='k') l2, = plt.plot(-0.5, -1, lw=0, marker="o", markerfacecolor="none", markeredgecolor='k') plt.legend([(l1,), (l1, l2)], ["test 1", "test 2"],
handler_map={tuple
: HandlerTuple(2)} )
References:
https
: //matplotlib.org/stable/api/markers_api.htmlhttp
: //www.cis.jhu.edu/~shanest/mpt/js/mathjax/mathjax-dev/fonts/Tables/STIX/STIX/All/All.html
function get_line_cap_length
get_line_cap_length(ax: Axes, linewidth: float) → Axes
Get the line cap length.
Args:
ax
(plt.Axes):plt.Axes
objectlinewidth
(float): width of the line.
Returns:
plt.Axes
:plt.Axes
object
function set_colorbar
set_colorbar(
fig: object,
ax: Axes,
ax_pc: Axes,
label: str,
bbox_to_anchor: tuple = (0.05, 0.5, 1, 0.45),
orientation: str = 'vertical'
)
Set colorbar.
Args:
fig
(object): figure object.ax
(plt.Axes):plt.Axes
object.ax_pc
(plt.Axes):plt.Axes
object for the colorbar.label
(str): labelbbox_to_anchor
(tuple, optional): location. Defaults to (0.05, 0.5, 1, 0.45).orientation
(str, optional): orientation. Defaults to "vertical".
Returns: figure object.
function set_colorbar_label
set_colorbar_label(ax: Axes, label: str) → Axes
Find colorbar and set label for it.
Args:
ax
(plt.Axes):plt.Axes
object.label
(str): label.
Returns:
plt.Axes
:plt.Axes
object.
module roux.viz.bar
For bar plots.
function plot_barh
plot_barh(
df1: DataFrame,
colx: str,
coly: str,
colannnotside: str = None,
x1: float = None,
offx: float = 0,
ax: Axes = None,
**kws
) → Axes
Plot horizontal bar plot with text on them.
Args:
df1
(pd.DataFrame): input data.colx
(str): x column.coly
(str): y column.colannnotside
(str): column with annotations to show on the right side of the plot.x1
(float): x position of the text.offx
(float): x-offset of x1, multiplier.color
(str): color of the bars.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Keyword Args:
kws
: parameters provided to thebarh
function.
Returns:
plt.Axes
:plt.Axes
object.
function plot_value_counts
plot_value_counts(
df: DataFrame,
col: str,
logx: bool = False,
kws_hist: dict = {'bins': 10},
kws_bar: dict = {},
grid: bool = False,
axes: list = None,
fig: object = None,
hist: bool = True
)
Plot pandas's value_counts
.
Args:
df
(pd.DataFrame): input datavalue_counts
.col
(str): column with counts.logx
(bool, optional): x-axis on log-scale. Defaults to False.kws_hist
(type, optional): parameters provided to thehist
function. Defaults to {'bins':10}.kws_bar
(dict, optional): parameters provided to thebar
function. Defaults to {}.grid
(bool, optional): show grids or not. Defaults to False.axes
(list, optional): list ofplt.axes
. Defaults to None.fig
(object, optional): figure object. Defaults to None.hist
(bool, optional): show histgram. Defaults to True.
function plot_barh_stacked_percentage
plot_barh_stacked_percentage(
df1: DataFrame,
coly: str,
colannot: str,
color: str = None,
yoff: float = 0,
ax: Axes = None
) → Axes
Plot horizontal stacked bar plot with percentages.
Args:
df1
(pd.DataFrame): input data. values in rows sum to 100%.coly
(str): y column. yticklabels, e.g. retained and dropped.colannot
(str): column with annotations.color
(str, optional): color. Defaults to None.yoff
(float, optional): y-offset. Defaults to 0.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Returns:
plt.Axes
:plt.Axes
object.
function plot_bar_serial
plot_bar_serial(
d1: dict,
polygon: bool = False,
polygon_x2i: float = 0,
labelis: list = [],
y: float = 0,
ylabel: str = None,
off_arrowy: float = 0.15,
kws_rectangle={'height': 0.5, 'linewidth': 1},
ax: Axes = None
) → Axes
Barplots with serial increase in resolution.
Args:
d1
(dict): dictionary with the data.polygon
(bool, optional): show polygon. Defaults to False.polygon_x2i
(float, optional): connect polygon to this subset. Defaults to 0.labelis
(list, optional): label these subsets. Defaults to [].y
(float, optional): y position. Defaults to 0.ylabel
(str, optional): y label. Defaults to None.off_arrowy
(float, optional): offset for the arrow. Defaults to 0.15.kws_rectangle
(type, optional): parameters provided to therectangle
function. Defaults to dict(height=0.5,linewidth=1).ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Returns:
plt.Axes
:plt.Axes
object.
function plot_barh_stacked_percentage_intersections
plot_barh_stacked_percentage_intersections(
df0: DataFrame,
colxbool: str,
colybool: str,
colvalue: str,
colid: str,
colalt: str,
colgroupby: str,
coffgroup: float = 0.95,
ax: Axes = None
) → Axes
Plot horizontal stacked bar plot with percentages and intesections.
Args:
df0
(pd.DataFrame): input data.colxbool
(str): x column.colybool
(str): y column.colvalue
(str): column with the values.colid
(str): column with ids.colalt
(str): column with the alternative subset.colgroupby
(str): column with groups.coffgroup
(float, optional): cut-off between the groups. Defaults to 0.95.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Returns:
plt.Axes
:plt.Axes
object.
Examples:
Parameters: colxbool='paralog', colybool='essential', colvalue='value', colid='gene id', colalt='singleton', coffgroup=0.95, colgroupby='tissue',
function to_input_data_sankey
to_input_data_sankey(
df0,
colid,
cols_groupby=None,
colall='all',
remove_all=False
)
function plot_sankey
plot_sankey(
df1,
cols_groupby=None,
hues=None,
node_color=None,
link_color=None,
info=None,
x=None,
y=None,
colors=None,
hovertemplate=None,
text_width=20,
convert=True,
width=400,
height=400,
outp=None,
validate=True,
test=False,
**kws
)
module roux.viz.colors
For setting up colors.
function rgbfloat2int
rgbfloat2int(rgb_float)
function get_colors_default
get_colors_default() → list
get default colors.
Returns:
list
: colors.
function get_ncolors
get_ncolors(
n: int,
cmap: str = 'Spectral',
ceil: bool = False,
test: bool = False,
N: int = 20,
out: str = 'hex',
**kws_get_cmap_section
) → list
Get colors.
Args:
n
(int): number of colors to get.cmap
(str, optional): colormap. Defaults to 'Spectral'.ceil
(bool, optional): ceil. Defaults to False.test
(bool, optional): test mode. Defaults to False.N
(int, optional): number of colors in the colormap. Defaults to 20.out
(str, optional): output. Defaults to 'hex'.
Returns:
list
: colors.
function get_val2color
get_val2color(
ds: Series,
vmin: float = None,
vmax: float = None,
cmap: str = 'Reds'
) → dict
Get color for a value.
Args:
ds
(pd.Series): values.vmin
(float, optional): minimum value. Defaults to None.vmax
(float, optional): maximum value. Defaults to None.cmap
(str, optional): colormap. Defaults to 'Reds'.
Returns:
dict
: output.
function saturate_color
saturate_color(color, alpha: float) → object
Saturate a color.
Args: color (type):
alpha
(float): alpha level.
Returns:
object
: output.
References:
https
: //stackoverflow.com/a/60562502/3521099
function mix_colors
mix_colors(d: dict) → str
Mix colors.
Args:
d
(dict): colors to alpha map.
Returns:
str
: hex color.
References:
https
: //stackoverflow.com/a/61488997/3521099
function make_cmap
make_cmap(cs: list, N: int = 20, **kws)
Create a colormap.
Args:
cs
(list): colorsN
(int, optional): resolution i.e. number of colors. Defaults to 20.
Returns: cmap.
function get_cmap_section
get_cmap_section(
cmap,
vmin: float = 0.0,
vmax: float = 1.0,
n: int = 100
) → object
Get section of a colormap.
Args:
cmap
(object| str): colormap.vmin
(float, optional): minimum value. Defaults to 0.0.vmax
(float, optional): maximum value. Defaults to 1.0.n
(int, optional): resolution i.e. number of colors. Defaults to 100.
Returns:
object
: cmap.
function append_cmap
append_cmap(
cmap: str = 'Reds',
color: str = '#D3DDDC',
cmap_min: float = 0.2,
cmap_max: float = 0.8,
ncolors: int = 100,
ncolors_min: int = 1,
ncolors_max: int = 0
)
Append a color to colormap.
Args:
cmap
(str, optional): colormap. Defaults to 'Reds'.color
(str, optional): color. Defaults to '#D3DDDC'.cmap_min
(float, optional): cmap_min. Defaults to 0.2.cmap_max
(float, optional): cmap_max. Defaults to 0.8.ncolors
(int, optional): number of colors. Defaults to 100.ncolors_min
(int, optional): number of colors minimum. Defaults to 1.ncolors_max
(int, optional): number of colors maximum. Defaults to 0.
Returns: cmap.
References:
https
: //matplotlib.org/stable/tutorials/colors/colormap-manipulation.html
module roux.viz.compare
For comparative plots.
function plot_comparisons
plot_comparisons(
plot_data,
x,
ax=None,
output_dir_path=None,
force=False,
return_path=False
)
Parameters:
plot_data
: output of.stat.compare.get_comparison
Notes:
sample type
: different sample of the same data.
module roux.viz.diagram
For diagrams e.g. flowcharts
function diagram_nb
diagram_nb(graph: str, out: bool = False)
Show a diagram in jupyter notebook using mermaid.js.
Parameters:
graph
(str): markdown-formatted graph. Please see https://mermaid.js.org/intro/n00b-syntaxReference.htmlout
(bool): Output the URL. Defaults to False.
References:
1. https
: //mermaid.js.org/config/Tutorials.html#jupyter-integration-with-mermaid-js
Examples:
graph LR; i1(["input1"]) & d1[("data1")]
--> p1[["process1"]]
--> o1(["output1"]) p1
--> o2["output2"]:::ends classDef ends fill:#fff,stroke:#fff
module roux.viz.dist
For distribution plots.
function hist_annot
hist_annot(
dplot: DataFrame,
colx: str,
colssubsets: list = [],
bins: int = 100,
subset_unclassified: bool = True,
cmap: str = 'hsv',
ymin=None,
ymax=None,
ylimoff: float = 1,
ywithinoff: float = 1.2,
annotaslegend: bool = True,
annotn: bool = True,
params_scatter: dict = {'zorder': 2, 'alpha': 0.1, 'marker': '|'},
xlim: tuple = None,
ax: Axes = None,
**kws
) → Axes
Annoted histogram.
Args:
dplot
(pd.DataFrame): input dataframe.colx
(str): x column.colssubsets
(list, optional): columns indicating subsets. Defaults to [].bins
(int, optional): bins. Defaults to 100.subset_unclassified
(bool, optional): call non-annotated subset as 'unclassified'. Defaults to True.cmap
(str, optional): colormap. Defaults to 'Reds_r'.ylimoff
(float, optional): y-offset for y-axis limit . Defaults to 1.2.ywithinoff
(float, optional): y-offset for the distance within labels. Defaults to 1.2.annotaslegend
(bool, optional): convert labels to legends. Defaults to True.annotn
(bool, optional): annotate sample sizes. Defaults to True.params_scatter
(type, optional): parameters of the scatter plot. Defaults to {'zorder':2,'alpha':0.1,'marker':'|'}.xlim
(tuple, optional): x-axis limits. Defaults to None.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Keyword Args:
kws
: parameters provided to thehist
function.
Returns:
plt.Axes
:plt.Axes
object.
TODOs: For scatter, use annot_side
with loc='top'
.
function plot_gmm
plot_gmm(
x: Series,
coff: float = None,
mix_pdf: object = None,
two_pdfs: tuple = None,
weights: tuple = None,
n_clusters: int = 2,
bins: int = 20,
show_cutoff: bool = True,
show_cutoff_line: bool = True,
colors: list = ['gray', 'gray', 'lightgray'],
out_coff: bool = False,
hist: bool = True,
test: bool = False,
ax: Axes = None,
kws_axvline={'color': 'k'},
**kws
) → Axes
Plot Gaussian mixture Models (GMMs).
Args:
x
(pd.Series): input vector.coff
(float, optional): intersection between two fitted distributions. Defaults to None.mix_pdf
(object, optional): Probability density function of the mixed distribution. Defaults to None.two_pdfs
(tuple, optional): Probability density functions of the separate distributions. Defaults to None.weights
(tuple, optional): weights of the individual distributions. Defaults to None.n_clusters
(int, optional): number of distributions. Defaults to 2.bins
(int, optional): bins. Defaults to 50.colors
(list, optional): colors of the invividual distributions and of the mixed one. Defaults to ['gray','gray','lightgray']. 'gray'out_coff
(bool,False): return the cutoff. Defaults to False.hist
(bool, optional): show histogram. Defaults to True.test
(bool, optional): test mode. Defaults to False.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Keyword Args:
kws
: parameters provided to thehist
function.kws_axvline
: parameters provided to theaxvline
function.
Returns:
plt.Axes
:plt.Axes
object.
function plot_normal
plot_normal(x: Series, ax: Axes = None) → Axes
Plot normal distribution.
Args:
x
(pd.Series): input vector.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Returns:
plt.Axes
:plt.Axes
object.
function get_jitter_positions
get_jitter_positions(ax, df1, order, column_category, column_position)
function plot_dists
plot_dists(
df1: DataFrame,
x: str,
y: str,
colindex: str,
hue: str = None,
order: list = None,
hue_order: list = None,
kind: str = 'box',
show_p: bool = True,
show_n: bool = True,
show_n_prefix: str = '',
show_n_ha=None,
show_n_ticklabels: bool = True,
show_outlines: bool = False,
kws_outlines: dict = {},
alternative: str = 'two-sided',
offx_n: float = 0,
axis_cont_lim: tuple = None,
axis_cont_scale: str = 'linear',
offs_pval: dict = None,
alpha: float = 0.5,
ax: Axes = None,
test: bool = False,
kws_stats: dict = {},
**kws
) → Axes
Plot distributions.
Args:
df1
(pd.DataFrame): input data.x
(str): x column.y
(str): y column.colindex
(str): index column.hue
(str, optional): column with values to be encoded as hues. Defaults to None.order
(list, optional): order of categorical values. Defaults to None.hue_order
(list, optional): order of values to be encoded as hues. Defaults to None.kind
(str, optional): kind of distribution. Defaults to 'box'.show_p
(bool, optional): show p-values. Defaults to True.show_n
(bool, optional): show sample sizes. Defaults to True.show_n_prefix
(str, optional): show prefix of sample size label i.e.n=
. Defaults to ''.offx_n
(float, optional): x-offset for the sample size label. Defaults to 0.axis_cont_lim
(tuple, optional): x-axis limits. Defaults to None.offs_pval
(float, optional): x and y offsets for the p-value labels.# saturate_color_alpha (float, optional)
: saturation of the color. Defaults to 1.5.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.test
(bool, optional): test mode. Defaults to False.kws_stats
(dict, optional): parameters provided to the stat function. Defaults to {}.
Keyword Args:
kws
: parameters provided to theseaborn
function.
Returns:
plt.Axes
:plt.Axes
object.
TODOs: 1. Sort categories. 2. Change alpha of the boxplot rather than changing saturation of the swarmplot.
function pointplot_groupbyedgecolor
pointplot_groupbyedgecolor(data: DataFrame, ax: Axes = None, **kws) → Axes
Plot seaborn's pointplot
grouped by edgecolor of points.
Args:
data
(pd.DataFrame): input data.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Keyword Args:
kws
: parameters provided to theseaborn
'spointplot
function.
Returns:
plt.Axes
:plt.Axes
object.
module roux.viz.figure
For setting up figures.
function get_children
get_children(fig)
Get all the individual objects included in the figure.
function get_child_text
get_child_text(search_name, all_children=None, fig=None)
Get text object.
function align_texts
align_texts(fig, texts: list, align: str, test=False)
Align text objects.
function labelplots
labelplots(
axes: list = None,
fig=None,
labels: list = None,
xoff: float = 0,
yoff: float = 0,
auto: bool = False,
xoffs: dict = {},
yoffs: dict = {},
va: str = 'center',
ha: str = 'left',
verbose: bool = True,
test: bool = False,
**kws_text
)
Label (sub)plots.
Args:
fig
:plt.figure
object.axes
(type): list ofplt.Axes
objects.xoff
(int, optional): x offset. Defaults to 0.yoff
(int, optional): y offset. Defaults to 0.params_alignment
(dict, optional): alignment parameters. Defaults to {}.params_text
(dict, optional): parameters provided toplt.text
. Defaults to {'size':20,'va':'bottom', 'ha':'right' }.test
(bool, optional): test mode. Defaults to False.
Todos: 1. Get the x coordinate of the ylabel.
module roux.viz.heatmap
For heatmaps.
function plot_table
plot_table(
df1: DataFrame,
xlabel: str = None,
ylabel: str = None,
annot: bool = True,
cbar: bool = False,
linecolor: str = 'k',
linewidths: float = 1,
cmap: str = None,
sorty: bool = False,
linebreaky: bool = False,
scales: tuple = [1, 1],
ax: Axes = None,
**kws
) → Axes
Plot to show a table.
Args:
df1
(pd.DataFrame): input data.xlabel
(str, optional): x label. Defaults to None.ylabel
(str, optional): y label. Defaults to None.annot
(bool, optional): show numbers. Defaults to True.cbar
(bool, optional): show colorbar. Defaults to False.linecolor
(str, optional): line color. Defaults to 'k'.linewidths
(float, optional): line widths. Defaults to 1.cmap
(str, optional): color map. Defaults to None.sorty
(bool, optional): sort rows. Defaults to False.linebreaky
(bool, optional): linebreak for y labels. Defaults to False.scales
(tuple, optional): scale of the table. Defaults to [1,1].ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Keyword Args:
kws
: parameters provided to thesns.heatmap
function.
Returns:
plt.Axes
:plt.Axes
object.
module roux.viz.image
For visualization of images.
function plot_image
plot_image(
imp: str,
ax: Axes = None,
force=False,
margin=0,
axes=False,
test=False,
**kwarg
) → Axes
Plot image e.g. schematic.
Args:
imp
(str): path of the image.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.force
(bool, optional): overwrite output. Defaults to False.margin
(int, optional): margins. Defaults to 0.test
(bool, optional): test mode. Defaults to False.
Returns:
plt.Axes
:plt.Axes
object.
:param kwarg: cairosvg: {'dpi':500,'scale':2}; imagemagick: {'trim':False,'alpha':False}
module roux.viz.io
For input/output of plots.
function to_plotp
to_plotp(
ax: Axes = None,
prefix: str = 'plot/plot_',
suffix: str = '',
fmts: list = ['png']
) → str
Infer output path for a plot.
Args:
ax
(plt.Axes):plt.Axes
object.prefix
(str, optional): prefix with directory path for the plot. Defaults to 'plot/plot_'.suffix
(str, optional): suffix of the filename. Defaults to ''.fmts
(list, optional): formats of the images. Defaults to ['png'].
Returns:
str
: output path for the plot.
function savefig
savefig(
plotp: str,
tight_layout: bool = True,
bbox_inches: list = None,
fmts: list = ['png'],
savepdf: bool = False,
normalise_path: bool = True,
replaces_plotp: dict = None,
dpi: int = 500,
force: bool = True,
kws_replace_many: dict = {},
kws_savefig: dict = {},
**kws
) → str
Wrapper around plt.savefig
.
Args:
plotp
(str): output path orplt.Axes
object.tight_layout
(bool, optional): tight_layout. Defaults to True.bbox_inches
(list, optional): bbox_inches. Defaults to None.savepdf
(bool, optional): savepdf. Defaults to False.normalise_path
(bool, optional): normalise_path. Defaults to True.replaces_plotp
(dict, optional): replaces_plotp. Defaults to None.dpi
(int, optional): dpi. Defaults to 500.force
(bool, optional): overwrite output. Defaults to True.kws_replace_many
(dict, optional): parameters provided to thereplace_many
function. Defaults to {}.
Keyword Args:
kws
: parameters provided toto_plotp
function.kws_savefig
: parameters provided toto_savefig
function.kws_replace_many
: parameters provided toreplace_many
function.
Returns:
str
: output path.
function savelegend
savelegend(
plotp: str,
legend: object,
expand: list = [-5, -5, 5, 5],
**kws_savefig
) → str
Save only the legend of the plot/figure.
Args:
plotp
(str): output path.legend
(object): legend object.expand
(list, optional): expand. Defaults to [-5,-5,5,5].
Returns:
str
: output path.
References:
1. https
: //stackoverflow.com/a/47749903/3521099
function update_kws_plot
update_kws_plot(kws_plot: dict, kws_plotp: dict, test: bool = False) → dict
Update the input parameters.
Args:
kws_plot
(dict): input parameters.kws_plotp
(dict): saved parameters.test
(bool, optional): description. Defaults to False.
Returns:
dict
: updated parameters.
function get_plot_inputs
get_plot_inputs(
plotp: str,
df1: DataFrame = None,
kws_plot: dict = {},
outd: str = None
) → tuple
Get plot inputs.
Args:
plotp
(str): path of the plot.df1
(pd.DataFrame): data for the plot.kws_plot
(dict): parameters of the plot.outd
(str): output directory.
Returns:
tuple
: (path,dataframe,dict)
function log_code
log_code()
Log the code.
function log_code
log_code()
Log the code.
function get_lines
get_lines(
logp: str = 'log_notebook.log',
sep: str = 'begin_plot()',
test: bool = False
) → list
Get lines from the log.
Args:
logp
(str, optional): path to the log file. Defaults to 'log_notebook.log'.sep
(str, optional): label marking the start of code of the plot. Defaults to 'begin_plot()'.test
(bool, optional): test mode. Defaults to False.
Returns:
list
: lines of code.
function to_script
to_script(
srcp: str,
plotp: str,
defn: str = 'plot_',
s4: str = ' ',
test: bool = False,
**kws
) → str
Save the script with the code for the plot.
Args:
srcp
(str): path of the script.plotp
(str): path of the plot.defn
(str, optional): prefix of the function. Defaults to "plot_".s4
(str, optional): a tab. Defaults to ' '.test
(bool, optional): test mode. Defaults to False.
Returns:
str
: path of the script.
TODOs: 1. Compatible with names of the input dataframes other that df1
. 1. Get the variable name of the dataframe
def get_df_name(df): name =[x for x in globals() if globals()[x] is df and not x.startswith('-')][0] return name
- Replace
df1
with the variable name of the dataframe.
function to_plot
to_plot(
plotp: str,
data: DataFrame = None,
df1: DataFrame = None,
kws_plot: dict = {},
logp: str = 'log_notebook.log',
sep: str = 'begin_plot()',
validate: bool = False,
show_path: bool = False,
show_path_offy: float = -0.2,
force: bool = True,
test: bool = False,
quiet: bool = True,
**kws
) → str
Save a plot.
Args:
plotp
(str): output path.df1
(pd.DataFrame, optional): dataframe with plotting data. Defaults to None.data
(pd.DataFrame, optional): dataframe with plotting data. Defaults to None.kws_plot
(dict, optional): parameters for plotting. Defaults to dict().logp
(str, optional): path to the log. Defaults to 'log_notebook.log'.sep
(str, optional): separator marking the start of the plotting code in jupyter notebook. Defaults to 'begin_plot()'.validate
(bool, optional): validate the "readability" usingread_plot
function. Defaults to False.show_path
(bool, optional): show path on the plot. Defaults to False.show_path_offy
(float, optional): y-offset for the path label. Defaults to 0.force
(bool, optional): overwrite output. Defaults to True.test
(bool, optional): test mode. Defaults to False.quiet
(bool, optional): quiet mode. Defaults to False.
Returns:
str
: output path.
Notes:
Requirement: 1. Start logging in the jupyter notebook. from IPython import get_ipython log_notebookp=f'log_notebook.log';open(log_notebookp, 'w').close();get_ipython().run_line_magic('logstart','{log_notebookp} over')
function read_plot
read_plot(p: str, safe: bool = False, test: bool = False, **kws) → Axes
Generate the plot from data, parameters and a script.
Args:
p
(str): path of the plot saved usingto_plot
function.safe
(bool, optional): read as an image. Defaults to False.test
(bool, optional): test mode. Defaults to False.
Returns:
plt.Axes
:plt.Axes
object.
function to_concat
to_concat(
ps: list,
how: str = 'h',
use_imagemagick: bool = False,
use_conda_env: bool = False,
test: bool = False,
**kws_outp
) → str
Concat images.
Args:
ps
(list): list of paths.how
(str, optional): horizontal (h
) or verticalv
. Defaults to 'h'.test
(bool, optional): test mode. Defaults to False.
Returns:
str
: path of the output.
function to_montage
to_montage(
ps: list,
layout: str,
source_path: str = None,
env_name: str = None,
hspace: float = 0,
vspace: float = 0,
output_path: str = None,
test: bool = False,
**kws_outp
) → str
To montage.
Args:
ps
(type): list of paths.layout
(type): layout of the images.hspace
(int, optional): horizontal space. Defaults to 0.vspace
(int, optional): vertical space. Defaults to 0.test
(bool, optional): test mode. Defaults to False.
Returns:
str
: path of the output.
function to_gif
to_gif(
ps: list,
outp: str,
duration: int = 200,
loop: int = 0,
optimize: bool = True
) → str
Convert to GIF.
Args:
ps
(list): list of paths.outp
(str): output path.duration
(int, optional): duration. Defaults to 200.loop
(int, optional): loop or not. Defaults to 0.optimize
(bool, optional): optimize the size. Defaults to True.
Returns:
str
: output path.
References:
1. https
: //pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#gif2. https
: //stackoverflow.com/a/57751793/3521099
function to_data
to_data(path: str) → str
Convert to base64 string.
Args:
path
(str): path of the input.
Returns: base64 string.
function to_convert
to_convert(filep: str, outd: str = None, fmt: str = 'JPEG') → str
Convert format of image using PIL
.
Args:
filep
(str): input path.outd
(str, optional): output directory. Defaults to None.fmt
(str, optional): format of the output. Defaults to "JPEG".
Returns:
str
: output path.
function to_raster
to_raster(
plotp: str,
dpi: int = 500,
alpha: bool = False,
trim: bool = False,
force: bool = False,
test: bool = False
) → str
to_raster summary
Args:
plotp
(str): input path.dpi
(int, optional): DPI. Defaults to 500.alpha
(bool, optional): transparency. Defaults to False.trim
(bool, optional): trim margins. Defaults to False.force
(bool, optional): overwrite output. Defaults to False.test
(bool, optional): test mode. Defaults to False.
Returns:
str
: description
Notes:
- Runs a bash command:
convert -density 300 -trim
.
function to_rasters
to_rasters(plotd, ext='svg')
Convert many images to raster. Uses inkscape.
Args:
plotd
(str): directory.ext
(str, optional): extension of the output. Defaults to 'svg'.
module roux.viz.line
For line plots.
function plot_range
plot_range(
df00: DataFrame,
colvalue: str,
colindex: str,
k: str,
headsize: int = 15,
headcolor: str = 'lightgray',
ax: Axes = None,
**kws_area
) → Axes
Plot range/intervals e.g. genome coordinates as lines.
Args:
df00
(pd.DataFrame): input data.colvalue
(str): column with values.colindex
(str): column with ids.k
(str): subset name.headsize
(int, optional): margin at top. Defaults to 15.headcolor
(str, optional): color of the margin. Defaults to 'lightgray'.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Keyword args:
kws
: keyword parameters provided toarea
function.
Returns:
plt.Axes
:plt.Axes
object.
function plot_connections
plot_connections(
dplot: DataFrame,
label2xy: dict,
colval: str = '$r_{s}$',
line_scale: int = 40,
legend_title: str = 'similarity',
label2rename: dict = None,
element2color: dict = None,
xoff: float = 0,
yoff: float = 0,
rectangle: dict = {'width': 0.2, 'height': 0.32},
params_text: dict = {'ha': 'center', 'va': 'center'},
params_legend: dict = {'bbox_to_anchor': (1.1, 0.5), 'ncol': 1, 'frameon': False},
legend_elements: list = [],
params_line: dict = {'alpha': 1},
ax: Axes = None,
test: bool = False
) → Axes
Plot connections between points with annotations.
Args:
dplot
(pd.DataFrame): input data.label2xy
(dict): label to position.colval
(str, optional): column with values. Defaults to '{s}$'.line_scale
(int, optional): line_scale. Defaults to 40.legend_title
(str, optional): legend_title. Defaults to 'similarity'.label2rename
(dict, optional): label2rename. Defaults to None.element2color
(dict, optional): element2color. Defaults to None.xoff
(float, optional): xoff. Defaults to 0.yoff
(float, optional): yoff. Defaults to 0.rectangle
(type, optional): rectangle. Defaults to {'width':0.2,'height':0.32}.params_text
(type, optional): params_text. Defaults to {'ha':'center','va':'center'}.params_legend
(type, optional): params_legend. Defaults to {'bbox_to_anchor':(1.1, 0.5), 'ncol':1, 'frameon':False}.legend_elements
(list, optional): legend_elements. Defaults to [].params_line
(type, optional): params_line. Defaults to {'alpha':1}.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.test
(bool, optional): test mode. Defaults to False.
Returns:
plt.Axes
:plt.Axes
object.
function plot_kinetics
plot_kinetics(
df1: DataFrame,
x: str,
y: str,
hue: str,
cmap: str = 'Reds_r',
ax: Axes = None,
test: bool = False,
kws_legend: dict = {},
**kws_set
) → Axes
Plot time-dependent kinetic data.
Args:
df1
(pd.DataFrame): input data.x
(str): x column.y
(str): y column.hue
(str): hue column.cmap
(str, optional): colormap. Defaults to 'Reds_r'.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.test
(bool, optional): test mode. Defaults to False.kws_legend
(dict, optional): legend parameters. Defaults to {}.
Returns:
plt.Axes
:plt.Axes
object.
function plot_steps
plot_steps(
df1: DataFrame,
col_step_name: str,
col_step_size: str,
ax: Axes = None,
test: bool = False
) → Axes
Plot step-wise changes in numbers, e.g. for a filtering process.
Args:
df1
(pd.DataFrame): input data.col_step_size
(str): column containing the numbers.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.test
(bool, optional): test mode. Defaults to False.
Returns:
plt.Axes
:plt.Axes
object.
module roux.viz
Global Variables
- io
- colors
- diagram
module roux.viz.scatter
For scatter plots.
function plot_scatter_agg
plot_scatter_agg(
dplot: DataFrame,
x: str = None,
y: str = None,
z: str = None,
kws_legend={'bbox_to_anchor': [1, 1], 'loc': 'upper left'}
)
UNDER DEV.
function plot_scatter
plot_scatter(
data: DataFrame,
x: str = None,
y: str = None,
z: str = None,
kind: str = 'scatter',
scatter_kws={},
line_kws={},
stat_method: str = 'spearman',
stat_kws={},
hollow: bool = False,
ax: Axes = None,
verbose: bool = True,
**kws
) → Axes
Plot scatter with multiple layers and stats.
Args:
data
(pd.DataFrame): input dataframe.x
(str): x column.y
(str): y column.z
(str, optional): z column. Defaults to None.kind
(str, optional): kind of scatter. Defaults to 'hexbin'.trendline_method
(str, optional): trendline method ['poly','lowess']. Defaults to 'poly'.stat_method
(str, optional): method of annoted stats ['mlr',"spearman"]. Defaults to "spearman".cmap
(str, optional): colormap. Defaults to 'Reds'.label_colorbar
(str, optional): label of the colorbar. Defaults to None.gridsize
(int, optional): number of grids in the hexbin. Defaults to 25.bbox_to_anchor
(list, optional): location of the legend. Defaults to [1,1].loc
(str, optional): location of the legend. Defaults to 'upper left'.title
(str, optional): title of the plot. Defaults to None.#params_plot (dict, optional)
: parameters provided to theplot
function. Defaults to {}.line_kws
(dict, optional): parameters provided to theplot_trendline
function. Defaults to {}.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Keyword Args:
kws
: parameters provided to theplot
function.
Returns:
plt.Axes
:plt.Axes
object.
Notes:
- For a rasterized scatter plot set
scatter_kws={'rasterized': True}
2. This function does not apply multiple colors, similar tosns.regplot
.
function plot_qq
plot_qq(x: Series) → Axes
plot QQ.
Args:
x
(pd.Series): input vector.
Returns:
plt.Axes
:plt.Axes
object.
function plot_ranks
plot_ranks(
df1: DataFrame,
colid: str,
colx: str,
coly: str = 'rank',
ascending: bool = True,
ax=None,
**kws
) → Axes
Plot rankings.
Args:
dplot
(pd.DataFrame): input data.colx
(str): x column.coly
(str): y column.colid
(str): column with unique ids.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Keyword Args:
kws
: parameters provided to theseaborn.scatterplot
function.
Returns:
plt.Axes
:plt.Axes
object.
function plot_volcano
plot_volcano(
data: DataFrame,
colx: str,
coly: str,
colindex: str,
hue: str = 'x',
style: str = 'P=0',
style_order: list = ['o', '^'],
markers: list = ['o', '^'],
show_labels: int = None,
show_outlines: int = None,
outline_colors: list = ['k'],
collabel: str = None,
show_line=True,
line_pvalue=0.1,
line_x: float = 0.0,
line_x_min: float = None,
show_text: bool = True,
text_increase: str = None,
text_decrease: str = None,
text_diff: str = None,
legend: bool = False,
verbose: bool = False,
p_min: float = None,
ax: Axes = None,
outmore: bool = False,
kws_legend: dict = {},
**kws_scatterplot
) → Axes
Volcano plot.
Parameters:
Keyword parameters:
Returns: plt.Axes
module roux.viz.sets
For plotting sets.
function plot_venn
plot_venn(
ds1: Series,
ax: Axes = None,
figsize: tuple = [2.5, 2.5],
show_n: bool = True,
outmore=False,
**kws
) → Axes
Plot Venn diagram.
Args:
ds1
(pd.Series): input pandas.Series or dictionary. Subsets in the index levels, mapped to counts.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.figsize
(tuple, optional): figure size. Defaults to [2.5,2.5].show_n
(bool, optional): show sample sizes. Defaults to True.
Returns:
plt.Axes
:plt.Axes
object.
function plot_intersection_counts
plot_intersection_counts(
df1: DataFrame,
cols: list = None,
kind: str = 'table',
method: str = None,
show_pval: bool = True,
confusion: bool = False,
rename_cols: bool = False,
sort_cols: tuple = [True, True],
order_x: list = None,
order_y: list = None,
cmap: str = 'Reds',
ax: Axes = None,
kws_show_stats: dict = {},
**kws_plot
) → Axes
Plot counts for the intersection between two sets.
Args:
df1
(pd.DataFrame): input datacols
(list, optional): columns. Defaults to None.kind
(str, optional): kind of plot: table or barplot. Detaults to table.method
(str, optional): method to check the association ['chi2','FE']. Defaults to None.rename_cols
(bool, optional): rename the columns. Defaults to True.show_pval
(bool, optional): annotate p-values. Defaults to True.cmap
(str, optional): colormap. Defaults to 'Reds'.kws_show_stats
(dict, optional): arguments provided to stats function. Defaults to {}.ax
(plt.Axes, optional):plt.Axes
object. Defaults to None.
Raises:
ValueError
:show_pval
position should be the allowed one.
Keyword Args:
kws_plot
: keyword arguments provided to the plotting function.
Returns:
plt.Axes
:plt.Axes
object.
TODOs: 1. Use compare_classes
to get the stats.
function plot_intersections
plot_intersections(
ds1: Series,
item_name: str = None,
figsize: tuple = [4, 4],
text_width: float = 2,
yorder: list = None,
sort_by: str = 'cardinality',
sort_categories_by: str = None,
element_size: int = 40,
facecolor: str = 'gray',
bari_annot: int = None,
totals_bar: bool = False,
totals_text: bool = True,
intersections_ylabel: float = None,
intersections_min: float = None,
test: bool = False,
annot_text: bool = False,
set_ylabelx: float = -0.25,
set_ylabely: float = 0.5,
**kws
) → Axes
Plot upset plot.
Args:
ds1
(pd.Series): input vector.item_name
(str, optional): name of items. Defaults to None.figsize
(tuple, optional): figure size. Defaults to [4,4].text_width
(float, optional): max. width of the text. Defaults to 2.yorder
(list, optional): order of y elements. Defaults to None.sort_by
(str, optional): sorting method. Defaults to 'cardinality'.sort_categories_by
(str, optional): sorting method. Defaults to None.element_size
(int, optional): size of elements. Defaults to 40.facecolor
(str, optional): facecolor. Defaults to 'gray'.bari_annot
(int, optional): annotate nth bar. Defaults to None.totals_text
(bool, optional): show totals. Defaults to True.intersections_ylabel
(float, optional): y-label of the intersections. Defaults to None.intersections_min
(float, optional): intersection minimum to show. Defaults to None.test
(bool, optional): test mode. Defaults to False.annot_text
(bool, optional): annotate text. Defaults to False.set_ylabelx
(float, optional): x position of the ylabel. Defaults to -0.25.set_ylabely
(float, optional): y position of the ylabel. Defaults to 0.5.
Keyword Args:
kws
: parameters provided to theupset.plot
function.
Returns:
plt.Axes
:plt.Axes
object.
Notes:
sort_by:{‘cardinality’, ‘degree’} If ‘cardinality’, subset are listed from largest to smallest. If ‘degree’, they are listed in order of the number of categories intersected. sort_categories_by:{‘cardinality’, None} Whether to sort the categories by total cardinality, or leave them in the provided order. References: https://upsetplot.readthedocs.io/en/stable/api.html
function plot_enrichment
plot_enrichment(
data: DataFrame,
x: str,
y: str,
s: str,
hue='Q',
xlabel=None,
ylabel='significance\n(-log10(Q))',
size: int = None,
color: str = None,
annots_side: int = 5,
annots_side_labels=None,
coff_fdr: float = None,
xlim: tuple = None,
xlim_off: float = 0.2,
ylim: tuple = None,
ax: Axes = None,
break_pt: int = 25,
annot_coff_fdr: bool = False,
kws_annot: dict = {'loc': 'right', 'offx3': 0.15},
returns='ax',
**kwargs
) → Axes
Plot enrichment stats.
Args:
- <b>`data`</b> (pd.DataFrame): input data.
- <b>`x`</b> (str): x column.
- <b>`y`</b> (str): y column.
- <b>`s`</b> (str): size column.
- <b>`size`</b> (int, optional): size of the points. Defaults to None.
- <b>`color`</b> (str, optional): color of the points. Defaults to None.
- <b>`annots_side`</b> (int, optional): how many labels to show on side. Defaults to 5.
- <b>`coff_fdr`</b> (float, optional): FDR cutoff. Defaults to None.
- <b>`xlim`</b> (tuple, optional): x-axis limits. Defaults to None.
- <b>`xlim_off`</b> (float, optional): x-offset on limits. Defaults to 0.2.
- <b>`ylim`</b> (tuple, optional): y-axis limits. Defaults to None.
- <b>`ax`</b> (plt.Axes, optional): `plt.Axes` object. Defaults to None.
- <b>`break_pt`</b> (int, optional): break point (' ') for the labels. Defaults to 25.
- <b>`annot_coff_fdr`</b> (bool, optional): show FDR cutoff. Defaults to False.
- <b>`kws_annot`</b> (dict, optional): parameters provided to the `annot_side` function. Defaults to dict( loc='right', annot_count_max=5, offx3=0.15, ).
Keyword Args:
- kwargs
: parameters provided to the sns.scatterplot
function.
Returns:
- <b>`plt.Axes`</b>: `plt.Axes` object.
function plot_pie
plot_pie(
counts: list,
labels: list,
scales_line_xy: tuple = (1.1, 1.1),
remove_wedges: list = None,
remove_wedges_index: list = [],
line_color: str = 'k',
annot_side: bool = False,
kws_annot_side: dict = {},
ax: Axes = None,
**kws_pie
) → Axes
Pie plot.
Args:
counts
(list): counts.labels
(list): labels.scales_line_xy
(tuple, optional): scales for the lines. Defaults to (1.1,1.1).remove_wedges
(list, optional): remove wedge/s. Defaults to None.remove_wedges_index
(list, optional): remove wedge/s by index. Defaults to [].line_color
(str, optional): line color. Defaults to 'k'.annot_side
(bool, optional): annotations on side using theannot_side
function. Defaults to False.kws_annot_side
(dict, optional): keyword arguments provided to theannot_side
function. Defaults to {}.ax
(plt.Axes, optional): subplot. Defaults to None.
Keyword Args:
kws_pie
: keyword arguments provided to thepie
chart function.
Returns:
plt.Axes
: subplot
References:
https
: //matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html
module roux.vizi
module roux.workflow.checks
For workflow checks.
function grep
grep(p, checks, exclude=[], exclude_str=[], verbose=True)
Get the output of grep as a list of strings.
module roux.workflow.df
For management of tables.
function exclude_items
exclude_items(df1: DataFrame, metadata: dict) → DataFrame
Exclude items from the table with the workflow info.
Args:
df1
(pd.DataFrame): input table.metadata
(dict): metadata of the repository.
Returns:
pd.DataFrame
: output.
module roux.workflow.function
For function management.
function get_quoted_path
get_quoted_path(s1: str) → str
Quoted paths.
Args:
s1
(str): path.
Returns:
str
: quoted path.
function get_path
get_path(
s: str,
validate: bool,
prefixes=['data/', 'metadata/', 'plot/'],
test=False
) → str
Extract pathsfrom a line of code.
Args:
s
(str): line of code.validate
(bool): validate the output.prefixes
(list, optional): allowed prefixes. Defaults to ['data/','metadata/','plot/'].test
(bool, optional): test mode. Defaults to False.
Returns:
str
: path.
TODOs: 1. Use wildcards i.e. *'s.
function remove_dirs_from_outputs
remove_dirs_from_outputs(outputs: list, test: bool = False) → list
Remove directories from the output paths.
Args:
outputs
(list): output paths.test
(bool, optional): test mode. Defaults to False.
Returns:
list
: paths.
function get_ios
get_ios(l: list, test=False) → tuple
Get input and output (IO) paths.
Args:
l
(list): list of lines of code.test
(bool, optional): test mode. Defaults to False.
Returns:
tuple
: paths of inputs and outputs.
function get_name
get_name(s: str, i: int, sep_step: str = '## step') → str
Get name of the function.
Args:
s
(str): lines in markdown format.sep_step
(str, optional): separator marking the start of a step. Defaults to "## step".i
(int): index of the step.
Returns:
str
: name of the function.
function get_step
get_step(
l: list,
name: str,
sep_step: str = '## step',
sep_step_end: str = '## tests',
test=False,
tab=' '
) → dict
Get code for a step.
Args:
l
(list): list of lines of codename
(str): name of the function.test
(bool, optional): test mode. Defaults to False.tab
(str, optional): tab format. Defaults to ' '.
Returns:
dict
: step name to code map.
function to_task
to_task(
notebookp,
task=None,
sep_step: str = '## step',
sep_step_end: str = '## tests',
notebook_suffix: str = '_v',
force=False,
validate=False,
path_prefix=None,
verbose=True,
test=False
) → str
Get the lines of code for a task (script to be saved as an individual .py
file).
Args:
notebookp
(type): path of the notebook.sep_step
(str, optional): separator marking the start of a step. Defaults to "## step".sep_step_end
(str, optional): separator marking the end of a step. Defaults to "## tests".notebook_suffix
(str, optional): suffix of the notebook file to be considered as a "task".force
(bool, optional): overwrite output. Defaults to False.validate
(bool, optional): validate output. Defaults to False.path_prefix
(type, optional): prefix to the path. Defaults to None.verbose
(bool, optional): show verbose. Defaults to True.test
(bool, optional): test mode. Defaults to False.
Returns:
str
: lines of the code.
function get_global_imports
get_global_imports() → DataFrame
Get the metadata of the functions imported from from roux import global_imports
.
module roux.workflow.io
For input/output of workflow.
function clear_variables
clear_variables(dtype=None, variables=None)
Clear dataframes from the workspace.
function clear_dataframes
clear_dataframes()
function to_py
to_py(
notebookp: str,
pyp: str = None,
force: bool = False,
**kws_get_lines
) → str
To python script (.py).
Args:
notebookp
(str): path to the notebook path.pyp
(str, optional): path to the python file. Defaults to None.force
(bool, optional): overwrite output. Defaults to False.
Returns:
str
: path of the output.
function to_nb_cells
to_nb_cells(notebook, outp, new_cells, validate_diff=None)
Replace notebook cells.
function import_from_file
import_from_file(pyp: str)
Import functions from python (.py
) file.
Args:
pyp
(str): python file (.py
).
function infer_parameters
infer_parameters(input_value, default_value)
Infer the input values and post warning messages.
Parameters:
input_value
: the primary value.default_value
: the default/alternative/inferred value.
Returns: Inferred value.
function to_parameters
to_parameters(f: object, test: bool = False) → dict
Get function to parameters map.
Args:
f
(object): function.test
(bool, optional): test mode. Defaults to False.
Returns:
dict
: output.
function read_config
read_config(
p: str,
config_base=None,
inputs=None,
append_to_key=None,
convert_dtype: bool = True,
verbose: bool = True
)
Read configuration.
Parameters:
p
(str): input path.config_base
: base config with the inputs for the interpolations
function read_metadata
read_metadata(
p: str,
ind: str = None,
max_paths: int = 30,
config_path_key: str = 'config_path',
config_paths: list = [],
config_paths_auto=False,
verbose: bool = False,
**kws_read_config
) → dict
Read metadata.
Args:
p
(str, optional): file containing metadata. Defaults to './metadata.yaml'.ind
(str, optional): directory containing specific setings and other data to be incorporated into metadata. Defaults to './metadata/'.
Returns:
dict
: output.
function to_workflow
to_workflow(df2: DataFrame, workflowp: str, tab: str = ' ') → str
Save workflow file.
Args:
df2
(pd.DataFrame): input table.workflowp
(str): path of the workflow file.tab
(str, optional): tab format. Defaults to ' '.
Returns:
str
: path of the workflow file.
function create_workflow_report
create_workflow_report(workflowp: str, env: str) → int
Create report for the workflow run.
Parameters:
workflowp
(str): path of the workflow file (snakemake
).env
(str): name of the conda virtual environment where required the workflow dependency is available i.e.snakemake
.
function replacestar
replacestar(
input_path,
output_path=None,
replace_from='from roux.global_imports import *',
in_place: bool = False,
attributes={'pandarallel': ['parallel_apply'], 'rd': ['.rd.', '.log.']},
verbose: bool = False,
test: bool = False,
**kws_fix_code
)
Post-development, replace wildcard (global) import from roux i.e. 'from roux.global_imports import *' with individual imports with accompanying documentation.
Parameters input_path (str): path to the .py or .ipynb file. output_path (str): path to the output. py_path (str): path to the intermediate .py file. in_place (bool): whether to carry out the modification in place. return_replacements (bool): return dict with strings to be replaced. attributes (dict): attribute names mapped to their keywords for searching. verbose (bool): verbose toggle. test (bool): test-mode if output file not provided and in-place modification not allowed.
Returns:
output_path
(str): path to the modified notebook.
module roux.workflow.knit
For workflow set up.
function nb_to_py
nb_to_py(
notebookp: str,
test: bool = False,
validate: bool = True,
sep_step: str = '## step',
notebook_suffix: str = '_v'
)
notebook to script.
Args:
notebookp
(str): path to the notebook.sep_step
(str, optional): separator marking the start of a step. Defaults to "## step".notebook_suffix
(str, optional): suffix of the notebook file to be considered as a "task".test
(bool, optional): test mode. Defaults to False.validate
(bool, optional): validate. Defaults to True.
TODOs: 1. Add check_outputs
parameter to only filter out non-executable code (i.e. tests) if False else edit the code.
function sort_stepns
sort_stepns(l: list) → list
Sort steps (functions) of a task (script).
Args:
l
(list): list of steps.
Returns:
list
: sorted list of steps.
module roux.workflow.log
function print_parameters
print_parameters(d: dict)
Print a directory with parameters as lines of code
Parameters:
d
(dict): directory with parameters
module roux.workflow
Global Variables
- io
- log
module roux.workflow.monitor
For workflow monitors.
function plot_workflow_log
plot_workflow_log(dplot: DataFrame) → Axes
Plot workflow log.
Args:
dplot
(pd.DataFrame): input data (dparam).
Returns:
plt.Axes
: output.
TODOs: 1. use the statistics tagged as ## stats
.
module roux.workflow.nb
For operations on jupyter notebooks.
function get_lines
get_lines(p: str, keep_comments: bool = True) → list
Get lines of code from notebook.
Args:
p
(str): path to notebook.keep_comments
(bool, optional): keep comments. Defaults to True.
Returns:
list
: lines.
function read_nb_md
read_nb_md(p: str, n: int = None) → list
Read notebook's documentation in the markdown cells.
Args:
p
(str): path of the notebook.n
(int): number of the markdown cells to extract.
Returns:
list
: lines of the strings.
function to_info
to_info(p: str, outp: str, linkd: str = '') → str
Save README.md file.
Args:
p
(str, optional): path of the notebook files that would be converted to "tasks".outp
(str, optional): path of the output file, e.g. 'README.md'.
Returns:
str
: path of the output file.
function to_replaced_nb
to_replaced_nb(
nb_path,
output_path,
replaces: dict = {},
cell_type: str = 'code',
drop_lines_with_substrings: list = None,
test=False
)
Replace text in a jupyter notebook.
Parameters nb: notebook object obtained from nbformat.reads
. replaces (dict): mapping of text to 'replace from' to the one to 'replace with'. cell_type (str): the type of the cell.
Returns:
new_nb
: notebook object.
function to_filtered_nb
to_filtered_nb(
p: str,
outp: str,
header: str,
kind: str = 'include',
validate_diff: int = None
)
Filter sections in a notebook based on markdown headings.
Args:
header
(str): exact first line of a markdown cell marking a section in a notebook. validate_diff
function to_filter_nbby_patterns
to_filter_nbby_patterns(p, outp, patterns=None, **kws)
Filter out notebook cells if the pattern string is found.
Args:
patterns
(list): list of string patterns.
function to_clear_unused_cells
to_clear_unused_cells(
notebook_path,
new_notebook_path,
validate_diff: int = None
)
Remove code cells with all lines commented.
function to_clear_outputs
to_clear_outputs(notebook_path, new_notebook_path)
function to_filtered_outputs
to_filtered_outputs(input_path, output_path, warnings=True, strings=True)
function to_diff_notebooks
to_diff_notebooks(
notebook_paths,
url_prefix='https://localhost:8888/nbdime/difftool?',
remove_prefix='file://',
verbose=True
) → list
"Diff" notebooks using nbdiff
(https://nbdime.readthedocs.io/en/latest/)
Start the nb-diff session by running: nbdiff-web
Todos: 1. Deprecate if functionality added to nbdiff-web
.
module roux.workflow.task
For task management.
function run_task
run_task(
parameters: dict,
input_notebook_path: str,
kernel: str = None,
output_notebook_path: str = None,
test=False,
verbose=False,
force=False,
**kws_papermill
) → str
Run a single task.
Prameters: parameters (dict): parameters including output_path
s. input_notebook_path (dict): path to the input notebook which is parameterized. kernel (str): kernel to be used. output_notebook_path: path to the output notebook which is used as a report. test (bool): test-mode. verbose (bool): verbose.
Keyword parameters: kws_papermill: parameters provided to the pm.execute_notebook
function.
Returns: Output path.
function run_tasks
run_tasks(
input_notebook_path: str,
kernel: str = None,
inputs: list = None,
output_path_base: str = None,
parameters_list: list = None,
fast: bool = False,
fast_workers: int = 6,
to_filter_nbby_patterns_kws=None,
input_notebook_temp_path=None,
out_paths: bool = False,
test1: bool = False,
force: bool = False,
test: bool = False,
verbose: bool = False,
**kws_papermill
) → list
Run a list of tasks.
Prameters: input_notebook_path (dict): path to the input notebook which is parameterized. kernel (str): kernel to be used. inputs (list): list of parameters without the output paths, which would be inferred by encoding. output_path_base (str): output path with a placeholder e.g. 'path/to/{KEY}/file'. parameters_list (list): list of parameters including the output paths. fast (bool): enable parallel-processing. fast_workers (bool): number of parallel-processes. force (bool): overwrite the outputs. test (bool): test-mode. verbose (bool): verbose.
Keyword parameters: kws_papermill: parameters provided to the pm.execute_notebook
function. to_filter_nbby_patterns_kws (list): dictionary containing parameters to be provided to to_filter_nbby_patterns
function (Defaults to None).
Returns:
parameters_list
(list): list of parameters including the output paths, inferred if not provided.
TODOs: 0. Ignore temporary parameters e.g test, verbose etc while encoding inputs. 1. Integrate with apply_on_paths for parallel processing etc.
Notes:
- To resolve
RuntimeError: This event loop is already running in python
frommultiprocessing
, execute import nest_asyncio nest_asyncio.apply()
module roux.workflow.version
For version control.
function git_commit
git_commit(repop: str, suffix_message: str = '', force=False)
Version control.
Args:
repop
(str): path to the repository.suffix_message
(str, optional): add suffix to the version (commit) message. Defaults to ''.
module roux.workflow.workflow
For workflow management.
function get_scripts
get_scripts(
ps: list,
notebook_prefix: str = '\\d{2}',
notebook_suffix: str = '_v\\d{2}',
test: bool = False,
fast: bool = True,
cores: int = 6,
force: bool = False,
tab: str = ' ',
**kws
) → DataFrame
Get scripts.
Args:
ps
(list): paths.notebook_prefix
(str, optional): prefix of the notebook file to be considered as a "task".notebook_suffix
(str, optional): suffix of the notebook file to be considered as a "task".test
(bool, optional): test mode. Defaults to False.fast
(bool, optional): parallel processing. Defaults to True.cores
(int, optional): cores to use. Defaults to 6.force
(bool, optional): overwrite the outputs. Defaults to False.tab
(str, optional): tab in spaces. Defaults to ' '.
Returns:
pd.DataFrame
: output table.
function to_scripts
to_scripts(
packagep: str,
notebooksdp: str,
validate: bool = False,
ps: list = None,
notebook_prefix: str = '\\d{2}',
notebook_suffix: str = '_v\\d{2}',
scripts: bool = True,
workflow: bool = True,
sep_step: str = '## step',
todos: bool = False,
git: bool = True,
clean: bool = False,
test: bool = False,
force: bool = True,
tab: str = ' ',
**kws
)
To scripts.
Args:
# packagen (str)
: package name.packagep
(str): path to the package.notebooksdp
(str, optional): path to the notebooks. Defaults to None.validate
(bool, optional): validate if functions are formatted correctly. Defaults to False.ps
(list, optional): paths. Defaults to None.notebook_prefix
(str, optional): prefix of the notebook file to be considered as a "task".notebook_suffix
(str, optional): suffix of the notebook file to be considered as a "task".scripts
(bool, optional): make scripts. Defaults to True.workflow
(bool, optional): make workflow file. Defaults to True.sep_step
(str, optional): separator marking the start of a step. Defaults to "## step".todos
(bool, optional): show todos. Defaults to False.git
(bool, optional): save version. Defaults to True.clean
(bool, optional): clean temporary files. Defaults to False.test
(bool, optional): test mode. Defaults to False.force
(bool, optional): overwrite outputs. Defaults to True.tab
(str, optional): tab size. Defaults to ' '.
Keyword parameters:
kws
: parameters provided to theget_script
function, includingsep_step
andsep_step_end
TODOs:
1. For version control, use https
: //github.com/jupyterlab/jupyterlab-git.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.