Extra Functional Tools beyond Standard and Third-Party Libraries
Project description
Extra functional tools that go beyond standard library's itertools, functools, etc. and popular third-party libraries like toolz, fancy, and more-itertools.
-
Like
toolzand others, most of the tools are designed to be efficient, pure, and lazy. -
Several useful yet non-functional tools are also included.
-
While
toolzand others target basic scenarios, many tools in this library target more advanced and complete scenarios. -
A few useful CLI tools for respective functions are also installed. They are named as
extratools-[funcname].
This library is under active development, and new functions will be added on regular basis.
-
Any idea or contribution is highly welcome.
-
Currently adopted by TopSim and PrefixSpan-py.
Installation
This package is available on PyPI. Just use pip3 install -U extratools to install it.
Examples
Here are three examples out of dozens of our tools.
compress(data, key=None)compresses the sequence by encoding continuous identicalItemto(Item, Count), according to run-length encoding.
list(compress([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]))
# [(1, 1), (2, 2), (3, 3), (4, 4)]
gaps(covered, whole=(-inf, inf))computes the uncovered ranges of the whole rangewhole, given the covered rangescovered.
list(gaps(
[(-inf, 0), (0.1, 0.2), (0.5, 0.7), (0.6, 0.9)],
(0, 1)
))
# [(0, 0.1), (0.2, 0.5), (0.9, 1)]
flatten(data, force=False)flattens a JSON object by returning(Path, Value) tuples with each pathPathfrom root to each valueValue.
flatten(json.loads("""{
"name": "John",
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null
}"""))
# {'name': 'John',
# 'address.streetAddress': '21 2nd Street',
# 'address.city': 'New York',
# 'phoneNumbers[0].type': 'home',
# 'phoneNumbers[0].number': '212 555-1234',
# 'phoneNumbers[1].type': 'office',
# 'phoneNumbers[1].number': '646 555-4567',
# 'children': [],
# 'spouse': None}
All Available Tools
Functions:
seqtools
sortedtools
strtools
rangetools
dicttools
jsontools
settools
tabletools
mathtools
stattools
misctools
printtools
debugtools
Data Structures:
CLI Tools:
remap in dicttools
flatten in jsontools
Functions
seqtools
Tools for matching sequences (including strings), with or without gaps allowed between matching items. Note that empty sequence is always a sub-sequence of any other sequence.
-
findsubseq(a, b)returns the first position whereais a sub-sequence ofb, or-1when not found. -
issubseq(a, b)checks ifais a sub-sequence ofb. -
findsubseqwithgap(a, b)returns the matching positions whereais a sub-sequence ofb, where gaps are allowed, orNonewhen not found. -
issubseqwithgap(a, b)checks ifais a sub-sequence ofb, where gaps are allowed. -
nextentries(data, entries)scans the sequences indatafrom left to right after current entriesentries, and returns each item and its respective following entries.- Each entry is a pair of
(ID, Position)denoting the sequence ID and its respective matching position.
- Each entry is a pair of
data = [
s.split() for s in [
"a b c d e",
"b b b d e",
"c b c c a",
"b b b c c"
]
]
entries = [(0, 2), (2, 0), (3, 3)]
# the first positions of `c` among sequences.
nextentries(data, entries)
# {'d': [(0, 3)],
# 'e': [(0, 4)],
# 'b': [(2, 1)],
# 'c': [(2, 2), (3, 4)],
# 'a': [(2, 4)]}
Tools for comparing sequences (including strings).
-
productcmp(x, y)compares two sequencesxandywith equal length according to product order. Returns-1if smaller,0if equal,1if greater, andNoneif not comparable.- Throw exception if
xandyhave different lengths.
- Throw exception if
Tools for sorting sequences.
sortedbyrank(data, ranks, reverse=False)returns the sorted list ofdata, according to the respective rank of each individual element inranks.
Tools for encoding/decoding sequences.
-
compress(data, key=None)compresses the sequence by encoding continuous identicalItemto(Item, Count), according to run-length encoding.- Different from
itertools.compress.
- Different from
list(compress([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]))
# [(1, 1), (2, 2), (3, 3), (4, 4)]
-
decompress(data)decompresses the sequence by decoding(Item, Count)to continuous identicalItem, according to run-length encoding. -
todeltas(data, op=operator.sub)compresses the sequence by encoding the difference between previous and current items, according to delta encoding.- For custom type of item, either define the
-operator or specify theopfunction computing the difference.
- For custom type of item, either define the
list(todeltas([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]))
# [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
-
fromdeltas(data, op=operator.add)decompresses the sequence by decoding the difference between previous and current items, according to delta encoding.- For custom type of item, either define the
+operator or specify theopfunction merging the difference.
- For custom type of item, either define the
sortedtools
Tools for sorted sequences.
-
sortedcommon(a, b, key=None)returns the common elements betweenaandb.- When both
aandbare sorted sets with no duplicate element, equal tosorted(set(a) & set(b))but more efficient.
- When both
-
sortedalone(a, b, key=None)returns the elements not in bothaandb.- When both
aandbare sorted sets with no duplicate element, equal tosorted((set(a) | set(b)) - (set(a) & set(b)))but more efficient.
- When both
-
sorteddiff(a, b, key=None)returns the elements only inaand not inb.- When both
aandbare sorted sets with no duplicate element, equal tosorted(set(a) - set(b))but more efficient.
- When both
-
issubsorted(a, b, key=None)checks ifais a sorted sub-sequence ofb.- When both
aandbare sorted sets with no duplicate element, equal toset(a) <= set(b)but more efficient.
- When both
strtools
Tools for string transformations.
-
str2grams(s, n, pad=None)returns the orderedn-grams of strings.- Optional padding at the start and end can be added by specifying
pad.\0is usually a safe choice forpadwhen not displaying.
- Optional padding at the start and end can be added by specifying
Tools for checksums.
sha1sum(f),sha256sum(f),sha512sum(f),md5sum(f)compute the respective checksum, accepting string, bytes, text file object, and binary file object.
Tools for string matching.
-
tagstats(tags, lines, separator=None)efficiently computes the number of lines containing each tag.-
TagStats is used to compute efficiently, where the common prefixes among tags are matched only once.
-
separatoris a regex to tokenize each string. In default whenseparatorisNone, each string is not tokenized.
-
tagstats(
["a b", "a c", "b c"],
["a b c", "b c d", "c d e"]
)
# {'a b': 1, 'a c': 0, 'b c': 2}
rangetools
Tools for statistics over ranges. Note that each range is closed on the left side, and open on the right side.
-
histogram(thresholds, data, leftmost=-inf)computes the histogram over all the floats indata.-
The search space is divided by the thresholds of bins specified in
thresholds. -
Each bin of the histogram is labelled by its lower threshold.
-
All values in the bin are no less than the current threshold and less than the next threshold.
-
The first bin is labelled by
leftmost, which is-infin default.
-
-
histogram(
[0.1, 0.5, 0.8, 0.9],
[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
)
# {-inf: 1, 0.1: 4, 0.5: 3, 0.8: 1, 0.9: 2}
Tools for querying ranges.
-
rangequery(keyvalues, query, func=min)finds the best value from the covered values inkeyvalues, if each key inkeyvaluesis within the query rangequery.-
Implemented by RangeMinQuery to solve the range minimum query problem.
-
funcdefines how the best value is computed, and defaults tominfor minimum value.
-
rangequery(
{0.1: 1, 0.2: 3, 0.3: 0},
(0.2, 0.4)
)
# 0
Tools for transformations over ranges. Note that each range is closed on the left side, and open on the right side.
-
covers(covered)merges the covered rangescoveredto resolve any overlap.- Covered ranges in
coveredare sorted by the left side of each range.
- Covered ranges in
list(covers([(-inf, 0), (0.1, 0.2), (0.5, 0.7), (0.6, 0.9)]))
# [(-inf, 0), (0.1, 0.2), (0.5, 0.9)]
-
gaps(covered, whole=(-inf, inf))computes the uncovered ranges of the whole rangewhole, given the covered rangescovered.-
Covered ranges in
coveredare sorted by the left side of each range. -
Overlaps among covered ranges
coveredare resolved, likecovers(covered).
-
list(gaps(
[(-inf, 0), (0.1, 0.2), (0.5, 0.7), (0.6, 0.9)],
(0, 1)
))
# [(0, 0.1), (0.2, 0.5), (0.9, 1)]
dicttools
Tools for inverting dictionaries.
-
invertdict(d)inverts(Key, Value)pairs to(Value, Key).- If multiple keys share the same value, the inverted directory keeps last of the respective keys.
-
invertdict_multiple(d)inverts(Key, List[Value])pairs to(Value, Key).- If multiple keys share the same value, the inverted directory keeps last of the respective keys.
-
invertdict_safe(d)inverts(Key, Value)pairs to(Value, List[Key]).- If multiple keys share the same value, the inverted directory keeps a list of all the respective keys.
Tools for remapping elements.
-
remap(data, mapping, key=None)remaps each unique element indataaccording to functionkey.-
mappingis a dictionary recording all the mappings, optionally containing previous mappings to reuse. -
In default,
keyreturns integers starting from0.
-
wordmap = {}
db = [list(remap(doc, wordmap)) for doc in docs]
Tools for flatten/unflatten a dictionary.
-
flatten(d, force=False)flattens a dictionary by returning(Path, Value) tuples with each pathPathfrom root to each valueValue.-
For each path, if any array with nested dictionary is encountered, the index of the array also becomes part of the path.
-
In default, only an array with nested dictionary is flatten. Instead, parameter
forcecan be specified to flatten any array. Note that an empty array contains no child and disappears after being flatten.
-
flatten(json.loads("""{
"name": "John",
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null
}"""))
# {'name': 'John',
# ('address', 'streetAddress'): '21 2nd Street',
# ('address', 'city'): 'New York',
# (('phoneNumbers', 0), 'type'): 'home',
# (('phoneNumbers', 0), 'number'): '212 555-1234',
# (('phoneNumbers', 1), 'type'): 'office',
# (('phoneNumbers', 1), 'number'): '646 555-4567',
# 'children': [],
# 'spouse': None}
jsontools
Tools for flatten/unflatten a JSON object.
-
flatten(data, force=False)flattens a JSON object by returning(Path, Value) tuples with each pathPathfrom root to each valueValue.-
For each path, if any array with nested dictionary is encountered, the index of the array also becomes part of the path.
-
In default, only an array with nested dictionary is flatten. Instead, parameter
forcecan be specified to flatten any array. Note that an empty array contains no child and disappears after being flatten.
-
flatten(json.loads("""{
"name": "John",
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null
}"""))
# {'name': 'John',
# 'address.streetAddress': '21 2nd Street',
# 'address.city': 'New York',
# 'phoneNumbers[0].type': 'home',
# 'phoneNumbers[0].number': '212 555-1234',
# 'phoneNumbers[1].type': 'office',
# 'phoneNumbers[1].number': '646 555-4567',
# 'children': [],
# 'spouse': None}
settools
Tools for set operations.
addtoset(s, x)checks whether addingxto setsis successful.
Tools for set similarities.
-
jaccard(a, b)computes the Jaccard similarity between two setsaandb. -
multisetjaccard(a, b)computes the Jaccard similarity between two multi-sets (Counters)aandb. -
weightedjaccard(a, b, key=sum)computes the weighted Jaccard similarity between two setsaandb, using functionkeyto compute the total weight of the elements within a set.
tabletools
Tools for tables.
-
transpose(data)returns the transpose of tabledata, i.e., switch rows and columns.- Useful to switch table
datafrom row-based to column-based and backwards.
- Useful to switch table
list(transpose([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]))
# [[1, 4, 7],
# [2, 5, 8],
# [3, 6, 9]]
-
loadcsv(path, delimiter=',')loads a CSV file, from either a file path or a file object. -
dumpcsv(path, data, delimiter=',')dumps a tabledatain CSV, to either a file path or a file object.
mathtools
Tools for math.
-
safediv(a, b)avoids thedivision by zeroexception, by returning infinite with proper sign.- Closely referring IEEE Standard 754.
stattools
Tools for statistics.
-
medianabsdev(data)computes the median absolute deviation of a list of floats. -
entropy(data)computes the entropy of a list of any items.- You can also pass a dictionary of
(item, frequency)as frequency distribution todata.
- You can also pass a dictionary of
-
histogramis alias of a tool inrangetools.
Tools for binary classification.
-
teststats(truths, predictions)matches the truth labels and the prediction labels. Return a tuples of(tp, fp, tn, fn)as true positive, false positive, true negative, and false negative. -
accuracy(tp, fp, tn, fn)returns the accuracy.- Note that you can simply call
accuracy(*teststats(truths, predictions)).
- Note that you can simply call
-
precision(tp, fp, tn, fn)andrecall(tp, fp, tn, fn)return the precision and recall. -
f1(tp, fp, tn, fn, beta=1)returns the F-1 measure in default, and returns the F-β measure whenbetais specified.
misctools
Tools for miscellaneous purposes.
-
cmp(a, b)restores the usefulcmpfunction previously in Python 2.- Implemented according to What's New in Python 3.0.
-
parsebool(s)parses a string to boolean, if its lowercase equals to either1,true, oryes.
printtools
Tools for non-functional but useful printing purposes.
-
print2(*args, **kwargs)redirects the output ofprintto standard error.- The same parameters are accepted.
debugtools
Tools for non-functional but useful debugging purposes.
-
stopwatch()returns both the duration since program start and the duration since last call in seconds.- Technically, the stopwatch starts when
debugtoolsis imported.
- Technically, the stopwatch starts when
-
peakmem()returns the peak memory usage since program start.- In bytes on macOS, and in kilobytes on Linux.
Data Structures
disjointsets
Disjoint sets with path compression, based a lot on this implementation. After d = DisjointSets():
-
d.add(x)adds a new disjoint set containingx. -
d[x]returns the representing element of the disjoint set containingx. -
d.disjoints()returns all the representing elements and their respective disjoint sets. -
d.union(*xs)union all the elements inxsinto a single disjoint set.
defaultlist
A sub-class of list that automatically grows when setting an index beyond the list size.
-
When creating a list, use
DefaultList(default, ...)to specify a function that returns default value when visiting an unassigned index. -
This library is designed to be highly similar to
collections.defaultdictin standard library.
l = DefaultList(lambda: None, range(10))
l[11] = 11
l
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, None, 11]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file extratools-0.4.9.tar.gz.
File metadata
- Download URL: extratools-0.4.9.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0840fb5cfc6e51df11574b3487c2836f215349c885e174f985df01ff8643f7f0
|
|
| MD5 |
ef4c6f7ad4eb6c55865863768a5b7a93
|
|
| BLAKE2b-256 |
04f5b0fde8549a600bbf7fa97b57c537cb7d2c79ed58f0e44758e6c9567714b7
|