Extra Functional Tools beyond Standard and Third-Party Libraries
Project description
Extra functional tools that go beyond standard libraries itertools
, functools
, etc. and popular third-party libraries like toolz
and fancy
.
- Like
toolz
, most of the tools are designed to be efficient, pure, and lazy.
This library is under active development, and new functions will be added on regular basis.
- Any idea or contribution is highly welcome.
Installation
This package is available on PyPi. Just use pip3 install -U extratools
to install it.
Available Tools
Please check individual source file for details.
seqtools
Tools for matching sequences (including strings), with or without gaps allowed between matching items. Note that empty sequence is always a sub-sequence of any other sequence.
-
findsubseq(a, b)
returns the first position wherea
is a sub-sequence ofb
, or-1
when not found. -
issubseq(a, b)
checks ifa
is a sub-sequence ofb
. -
findsubseqwithgap(a, b)
returns the matching positions wherea
is a sub-sequence ofb
, where gaps are allowed, orNone
when not found. -
issubseqwithgap(a, b)
checks ifa
is a sub-sequence ofb
, where gaps are allowed.
Tools for comparing sequences (including strings).
-
productcmp(x, y)
compares two sequencesx
andy
with equal length according to product order. Returns-1
if smaller,0
if equal,1
if greater, andNone
if not comparable.- Throw exception if
x
andy
have different lengths.
- Throw exception if
Tools for sorting sequences.
sortedbyrank(data, ranks, reverse=False)
returns the sorted list ofdata
, according to the respective rank of each individual element inranks
.
Tools for encoding/decoding sequences.
-
compress(data, key=None)
compresses the sequence by encoding continuous identicalItem
to(Item, Count)
, according to run-length encoding.- Different from
itertools.compress
.
- Different from
list(compress([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]))
# [(1, 1), (2, 2), (3, 3), (4, 4)]
-
decompress(data)
decompresses the sequence by decoding(Item, Count)
to continuous identicalItem
, according to run-length encoding. -
todeltas(data, op=operator.sub)
compresses the sequence by encoding the difference between previous and current items, according to delta encoding.- For custom type of item, either define the
-
operator or specify theop
function computing the difference.
- For custom type of item, either define the
list(todeltas([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]))
# [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
-
fromdeltas(data, op=operator.add)
decompresses the sequence by decoding the difference between previous and current items, according to delta encoding.- For custom type of item, either define the
+
operator or specify theop
function merging the difference.
- For custom type of item, either define the
sortedtools
Tools for sorted sequences.
-
sortedcommon(a, b, key=None)
returns the common elements betweena
andb
.- When both
a
andb
are sorted sets with no duplicate element, equal tosorted(set(a) & set(b))
but more efficient.
- When both
-
sortedalone(a, b, key=None)
returns the elements not in botha
andb
.- When both
a
andb
are sorted sets with no duplicate element, equal tosorted((set(a) | set(b)) - (set(a) & set(b)))
but more efficient.
- When both
-
sorteddiff(a, b, key=None)
returns the elements only ina
and not inb
.- When both
a
andb
are sorted sets with no duplicate element, equal tosorted(set(a) - set(b))
but more efficient.
- When both
-
issubsorted(a, b, key=None)
checks ifa
is a sorted sub-sequence ofb
.- When both
a
andb
are sorted sets with no duplicate element, equal toset(a) <= set(b)
but more efficient.
- When both
strtools
Tools for string transformations.
-
str2grams(s, n, pad=None)
returns the orderedn
-grams of strings
.- Optional padding at the start and end can be added by specifying
pad
.\0
is usually a safe choice forpad
when not displaying.
- Optional padding at the start and end can be added by specifying
Tools for checksums.
sha1sum(f)
,sha256sum(f)
,sha512sum(f)
,md5sum(f)
compute the respective checksum, accepting string, bytes, text file object, and binary file object.
Tools for string matching.
-
tagstats(tags, lines, separator=None)
efficiently computes the number of lines containing each tag.-
TagStats is used to compute efficiently, where the common prefixes among tags are matched only once.
-
separator
is a regex to tokenize each string. In default whenseparator
isNone
, each string is not tokenized.
-
tagstats(
["a b", "a c", "b c"],
["a b c", "b c d", "c d e"]
)
# {'a b': 1, 'a c': 0, 'b c': 2}
dicttools
Tools for inverting dictionaries.
-
invertdict(d)
inverts(Key, Value)
pairs to(Value, Key)
.- If multiple keys share the same value, the inverted directory keeps last of the respective keys.
-
invertdict_multiple(d)
inverts(Key, List[Value])
pairs to(Value, Key)
.- If multiple keys share the same value, the inverted directory keeps last of the respective keys.
-
invertdict_safe(d)
inverts(Key, Value)
pairs to(Value, List[Key])
.- If multiple keys share the same value, the inverted directory keeps a list of all the respective keys.
Tools for remapping elements.
-
remap(data, mapping, key=None)
remaps each unique element indata
according to functionkey
.-
mapping
is a dictionary recording all the mappings, optionally containing previous mappings to reuse. -
In default,
key
returns integers starting from0
.
-
wordmap = {}
db = [list(remap(doc, wordmap)) for doc in docs]
settools
Tools for set operations.
addtoset(s, x)
checks whether addingx
to sets
is successful.
Tools for set similarities.
-
jaccard(a, b)
computes the Jaccard similarity between two setsa
andb
. -
multisetjaccard(a, b)
computes the Jaccard similarity between two multi-sets (Counters)a
andb
. -
weightedjaccard(a, b, key=sum)
computes the weighted Jaccard similarity between two setsa
andb
, using functionkey
to compute the total weight of the elements within a set.
tabletools
Tools for tables.
-
transpose(data)
returns the transpose of tabledata
, i.e., switch rows and columns.- Useful to switch table
data
from row-based to column-based and backwards.
- Useful to switch table
transpose([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
# [[1, 4, 7],
# [2, 5, 8],
# [3, 6, 9]]
-
loadcsv(path)
loads a CSV file, from either a file path or a file object. -
dumpcsv(path, data)
dumps a tabledata
in CSV, to either a file path or a file object.
mathtools
Tools for math.
-
safediv(a, b)
avoids thedivision by zero
exception, by returning infinite with proper sign.- Closely referring IEEE Standard 754.
stattools
Tools for statistics.
-
medianabsdev(data)
computes the median absolute deviation of a list of floats. -
entropy(data)
computes the entropy of a list of any items.- You can also pass a dictionary of
(item, frequency)
as frequency distribution todata
.
- You can also pass a dictionary of
-
histogram(thresholds, data)
computes the histogram over all the floats indata
.-
The search space is divided by the thresholds of bins specified in
thresholds
. -
Each bin of the histogram is labelled by its lower threshold.
-
All values in the bin are no less than the current threshold and less than the next threshold.
-
The first bin is always labelled by
-infinity
.
-
-
histogram(
[0.1, 0.5, 0.8, 0.9],
[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
)
# {-inf: 1, 0.1: 4, 0.5: 3, 0.8: 1, 0.9: 2}
disjointsets
Disjoint sets with path compression, based a lot on this implementation. After d = DisjointSets()
:
-
d.add(x)
adds a new disjoint set containingx
. -
d[x]
returns the representing element of the disjoint set containingx
. -
d.disjoints()
returns all the representing elements and their respective disjoint sets. -
d.union(*xs)
union all the elements inxs
into a single disjoint set.
misctools
Tools for miscellaneous purposes.
-
cmp(a, b)
restores the usefulcmp
function previously in Python 2.- Implemented according to What's New in Python 3.0.
-
parsebool(s)
parses a string to boolean, if its lowercase equals to either1
,true
, oryes
.
printtools
Tools for non-functional but useful printing purposes.
-
print2(*args, **kwargs)
redirects the output ofprint
to standard error.- The same parameters are accepted.
debugtools
Tools for non-functional but useful debugging purposes.
-
stopwatch()
returns both the duration since program start and the duration since last call in seconds.- Technically, the stopwatch starts when
debugtools
is imported.
- Technically, the stopwatch starts when
-
peakmem()
returns the peak memory usage since program start.- In bytes on macOS, and in kilobytes on Linux.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.