Extra Functional Tools beyond Standard and Third-Party Libraries

These details have not been verified by PyPI

Project links

Project description

Extra functional tools that go beyond standard library's itertools, functools, etc. and popular third-party libraries like toolz, fancy, and more-itertools.

Like toolz and others, most of the tools are designed to be efficient, pure, and lazy.
Several useful yet non-functional tools are also included.
While toolz and others target basic scenarios, many tools in this library target more advanced and complete scenarios.

This library is under active development, and new functions will be added on regular basis.

Any idea or contribution is highly welcome.
Currently adopted by TopSim and PrefixSpan-py.

Installation

This package is available on PyPI. Just use pip3 install -U extratools to install it.

Available Tools

Functions:

seqtools sortedtools strtools rangetools dicttools jsontools settools tabletools mathtools stattools misctools printtools debugtools

Data Structures:

disjointsets defaultlist

Functions

`seqtools`

Tools for matching sequences (including strings), with or without gaps allowed between matching items. Note that empty sequence is always a sub-sequence of any other sequence.

findsubseq(a, b) returns the first position where a is a sub-sequence of b, or -1 when not found.
issubseq(a, b) checks if a is a sub-sequence of b.
findsubseqwithgap(a, b) returns the matching positions where a is a sub-sequence of b, where gaps are allowed, or None when not found.
issubseqwithgap(a, b) checks if a is a sub-sequence of b, where gaps are allowed.
nextentries(data, entries) scans the sequences in data from left to right after current entries entries, and returns each item and its respective following entries.
- Each entry is a pair of (ID, Position) denoting the sequence ID and its respective matching position.

data = [
    s.split() for s in [
        "a b c d e",
        "b b b d e",
        "c b c c a",
        "b b b c c"
    ]
]

entries = [(0, 2), (2, 0), (3, 3)]
# the first positions of `c` among sequences.

nextentries(data, entries)
# {'d': [(0, 3)],
#  'e': [(0, 4)],
#  'b': [(2, 1)],
#  'c': [(2, 2), (3, 4)],
#  'a': [(2, 4)]}

Tools for comparing sequences (including strings).

productcmp(x, y) compares two sequences x and y with equal length according to product order. Returns -1 if smaller, 0 if equal, 1 if greater, and None if not comparable.
- Throw exception if x and y have different lengths.

Tools for sorting sequences.

sortedbyrank(data, ranks, reverse=False) returns the sorted list of data, according to the respective rank of each individual element in ranks.

Tools for encoding/decoding sequences.

compress(data, key=None) compresses the sequence by encoding continuous identical Item to (Item, Count), according to run-length encoding.
- Different from itertools.compress.

list(compress([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]))
# [(1, 1), (2, 2), (3, 3), (4, 4)]

decompress(data) decompresses the sequence by decoding (Item, Count) to continuous identical Item, according to run-length encoding.
todeltas(data, op=operator.sub) compresses the sequence by encoding the difference between previous and current items, according to delta encoding.
- For custom type of item, either define the - operator or specify the op function computing the difference.

list(todeltas([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]))
# [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]

fromdeltas(data, op=operator.add) decompresses the sequence by decoding the difference between previous and current items, according to delta encoding.
- For custom type of item, either define the + operator or specify the op function merging the difference.

`sortedtools`

Tools for sorted sequences.

sortedcommon(a, b, key=None) returns the common elements between a and b.
- When both a and b are sorted sets with no duplicate element, equal to sorted(set(a) & set(b)) but more efficient.
sortedalone(a, b, key=None) returns the elements not in both a and b.
- When both a and b are sorted sets with no duplicate element, equal to sorted((set(a) | set(b)) - (set(a) & set(b))) but more efficient.
sorteddiff(a, b, key=None) returns the elements only in a and not in b.
- When both a and b are sorted sets with no duplicate element, equal to sorted(set(a) - set(b)) but more efficient.
issubsorted(a, b, key=None) checks if a is a sorted sub-sequence of b.
- When both a and b are sorted sets with no duplicate element, equal to set(a) <= set(b) but more efficient.

`strtools`

Tools for string transformations.

str2grams(s, n, pad=None) returns the ordered n-grams of string s.
- Optional padding at the start and end can be added by specifying pad. \0 is usually a safe choice for pad when not displaying.

Tools for checksums.

sha1sum(f), sha256sum(f), sha512sum(f), md5sum(f) compute the respective checksum, accepting string, bytes, text file object, and binary file object.

Tools for string matching.

tagstats(tags, lines, separator=None) efficiently computes the number of lines containing each tag.
- TagStats is used to compute efficiently, where the common prefixes among tags are matched only once.
- separator is a regex to tokenize each string. In default when separator is None, each string is not tokenized.

tagstats(
    ["a b", "a c", "b c"],
    ["a b c", "b c d", "c d e"]
)
# {'a b': 1, 'a c': 0, 'b c': 2}

`rangetools`

Tools for statistics over ranges. Note that each range is closed on the left side, and open on the right side.

histogram(thresholds, data, leftmost=-inf) computes the histogram over all the floats in data.
- The search space is divided by the thresholds of bins specified in thresholds.
- Each bin of the histogram is labelled by its lower threshold.
  - All values in the bin are no less than the current threshold and less than the next threshold.
  - The first bin is labelled by leftmost, which is -inf in default.

histogram(
    [0.1, 0.5, 0.8, 0.9],
    [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
)
# {-inf: 1, 0.1: 4, 0.5: 3, 0.8: 1, 0.9: 2}

Tools for querying ranges.

rangequery(keyvalues, query, func=min) finds the best value from the covered values in keyvalues, if each key in keyvalues is within the query range query.
- Implemented by RangeMinQuery to solve the range minimum query problem.
- func defines how the best value is computed, and defaults to min for minimum value.

rangequery(
    {0.1: 1, 0.2: 3, 0.3: 0},
    (0.2, 0.4)
)
# 0

Tools for transformations over ranges. Note that each range is closed on the left side, and open on the right side.

covers(covered) merges the covered ranges covered to resolve any overlap.
- Covered ranges in covered are sorted by the left side of each range.

list(covers([(-inf, 0), (0.1, 0.2), (0.5, 0.7), (0.6, 0.9)]))
# [(-inf, 0), (0.1, 0.2), (0.5, 0.9)]

gaps(covered, whole=(-inf, inf)) computes the uncovered ranges of the whole range whole, given the covered ranges covered.
- Covered ranges in covered are sorted by the left side of each range.
- Overlaps among covered ranges covered are resolved, like covers(covered).

list(gaps(
    [(-inf, 0), (0.1, 0.2), (0.5, 0.7), (0.6, 0.9)],
    (0, 1)
))
# [(0, 0.1), (0.2, 0.5), (0.9, 1)]

`dicttools`

Tools for inverting dictionaries.

invertdict(d) inverts (Key, Value) pairs to (Value, Key).
- If multiple keys share the same value, the inverted directory keeps last of the respective keys.
invertdict_multiple(d) inverts (Key, List[Value]) pairs to (Value, Key).
- If multiple keys share the same value, the inverted directory keeps last of the respective keys.
invertdict_safe(d) inverts (Key, Value) pairs to (Value, List[Key]).
- If multiple keys share the same value, the inverted directory keeps a list of all the respective keys.

Tools for remapping elements.

remap(data, mapping, key=None) remaps each unique element in data according to function key.
- mapping is a dictionary recording all the mappings, optionally containing previous mappings to reuse.
- In default, key returns integers starting from 0.

wordmap = {}
db = [list(remap(doc, wordmap)) for doc in docs]

Tools for flatten/unflatten a dictionary.

flatten(d, force=False) flattens a dictionary by returning (Path, Value) tuples with each path Path from root to each value Value.
- For each path, if any array with nested dictionary is encountered, the index of the array also becomes part of the path.
- In default, only an array with nested dictionary is flatten. Instead, parameter force can be specified to flatten any array. Note that an empty array contains no child and disappears after being flatten.

flatten(json.loads("""{
  "firstName": "John",
  "lastName": "Smith",
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null
}"""))
# {'firstName': 'John',
#  'lastName': 'Smith',
#  ('address', 'streetAddress'): '21 2nd Street',
#  ('address', 'city'): 'New York',
#  (('phoneNumbers', 0), 'type'): 'home',
#  (('phoneNumbers', 0), 'number'): '212 555-1234',
#  (('phoneNumbers', 1), 'type'): 'office',
#  (('phoneNumbers', 1), 'number'): '646 555-4567',
#  'children': [],
#  'spouse': None}

`jsontools`

Tools for flatten/unflatten a JSON object.

flatten(data, force=False) flattens a JSON object by returning (Path, Value) tuples with each path Path from root to each value Value.
- For each path, if any array with nested dictionary is encountered, the index of the array also becomes part of the path.
- In default, only an array with nested dictionary is flatten. Instead, parameter force can be specified to flatten any array. Note that an empty array contains no child and disappears after being flatten.

flatten(json.loads("""{
  "firstName": "John",
  "lastName": "Smith",
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null
}"""))
# {'firstName': 'John',
#  'lastName': 'Smith',
#  'address.streetAddress': '21 2nd Street',
#  'address.city': 'New York',
#  'phoneNumbers[0].type': 'home',
#  'phoneNumbers[0].number': '212 555-1234',
#  'phoneNumbers[1].type': 'office',
#  'phoneNumbers[1].number': '646 555-4567',
#  'children': [],
#  'spouse': None}

`settools`

Tools for set operations.

addtoset(s, x) checks whether adding x to set s is successful.

Tools for set similarities.

jaccard(a, b) computes the Jaccard similarity between two sets a and b.
multisetjaccard(a, b) computes the Jaccard similarity between two multi-sets (Counters) a and b.
weightedjaccard(a, b, key=sum) computes the weighted Jaccard similarity between two sets a and b, using function key to compute the total weight of the elements within a set.

`tabletools`

Tools for tables.

transpose(data) returns the transpose of table data, i.e., switch rows and columns.
- Useful to switch table data from row-based to column-based and backwards.

list(transpose([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
]))
# [[1, 4, 7],
#  [2, 5, 8],
#  [3, 6, 9]]

loadcsv(path) loads a CSV file, from either a file path or a file object.
dumpcsv(path, data) dumps a table data in CSV, to either a file path or a file object.

`mathtools`

Tools for math.

safediv(a, b) avoids the division by zero exception, by returning infinite with proper sign.
- Closely referring IEEE Standard 754.

`stattools`

Tools for statistics.

medianabsdev(data) computes the median absolute deviation of a list of floats.
entropy(data) computes the entropy of a list of any items.
- You can also pass a dictionary of (item, frequency) as frequency distribution to data.
histogram is alias of a tool in rangetools.

Tools for binary classification.

teststats(truths, predictions) matches the truth labels and the prediction labels. Return a tuples of (tp, fp, tn, fn) as true positive, false positive, true negative, and false negative.
accuracy(tp, fp, tn, fn) returns the accuracy.
- Note that you can simply call accuracy(*teststats(truths, predictions)).
precision(tp, fp, tn, fn) and recall(tp, fp, tn, fn) return the precision and recall.
f1(tp, fp, tn, fn, beta=1) returns the F-1 measure in default, and returns the F-β measure when beta is specified.

`misctools`

Tools for miscellaneous purposes.

cmp(a, b) restores the useful cmp function previously in Python 2.
- Implemented according to What's New in Python 3.0.
parsebool(s) parses a string to boolean, if its lowercase equals to either 1, true, or yes.

`printtools`

Tools for non-functional but useful printing purposes.

print2(*args, **kwargs) redirects the output of print to standard error.
- The same parameters are accepted.

`debugtools`

Tools for non-functional but useful debugging purposes.

stopwatch() returns both the duration since program start and the duration since last call in seconds.
- Technically, the stopwatch starts when debugtools is imported.
peakmem() returns the peak memory usage since program start.
- In bytes on macOS, and in kilobytes on Linux.

Data Structures

`disjointsets`

Disjoint sets with path compression, based a lot on this implementation. After d = DisjointSets():

d.add(x) adds a new disjoint set containing x.
d[x] returns the representing element of the disjoint set containing x.
d.disjoints() returns all the representing elements and their respective disjoint sets.
d.union(*xs) union all the elements in xs into a single disjoint set.

`defaultlist`

A sub-class of list that automatically grows when setting an index beyond the list size.

When creating a list, use DefaultList(default, ...) to specify a function that returns default value when visiting an unassigned index.
This library is designed to be highly similar to collections.defaultdict in standard library.

l = DefaultList(lambda: None, range(10))

l[11] = 11

l
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, None, 11]

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.2.1

Dec 2, 2018

0.8.1

Sep 29, 2018

0.8

Jun 5, 2018

0.7.9

Jun 5, 2018

0.7.8

Jun 4, 2018

0.7.7

Jun 4, 2018

0.7.6

Jun 4, 2018

0.7.5.2

Jun 3, 2018

0.7.4

Jun 2, 2018

0.7.3

Jun 2, 2018

0.7.2

Jun 2, 2018

0.7.1

May 30, 2018

0.7

May 29, 2018

0.6.19

May 29, 2018

0.6.18

May 28, 2018

0.6.17

May 28, 2018

0.6.16

May 28, 2018

0.6.15.1

May 28, 2018

0.6.14.2

May 25, 2018

0.6.14.1

May 25, 2018

0.6.14

May 25, 2018

0.6.13.1

May 20, 2018

0.6.12

May 20, 2018

0.6.11

May 20, 2018

0.6.10

May 19, 2018

0.6.9

May 18, 2018

0.6.8

May 18, 2018

0.6.7.1

May 17, 2018

0.6.7

May 17, 2018

0.6.6

May 17, 2018

0.6.5

May 17, 2018

0.6.4

May 14, 2018

0.6.2

May 14, 2018

0.6.1

May 12, 2018

0.6

May 12, 2018

0.5.16

May 12, 2018

0.5.15

May 11, 2018

0.5.14

May 11, 2018

0.5.13

May 11, 2018

0.5.12

May 11, 2018

0.5.11

May 11, 2018

0.5.10

May 11, 2018

0.5.9

May 10, 2018

0.5.8

May 10, 2018

0.5.7

May 7, 2018

0.5.6

May 7, 2018

0.5.5

May 7, 2018

0.5.4

May 7, 2018

0.5.3

May 6, 2018

0.5.2.1

May 6, 2018

0.5.1

May 6, 2018

0.5

May 5, 2018

0.4.9

May 5, 2018

This version

0.4.8.1

May 5, 2018

0.4.6

May 4, 2018

0.4.5

May 3, 2018

0.4.4

May 2, 2018

0.4.3

May 2, 2018

0.4.2

May 2, 2018

0.3.8

May 1, 2018

0.3.7

May 1, 2018

0.3.6.1

May 1, 2018

0.3.5

May 1, 2018

0.3.4

May 1, 2018

0.3.3

May 1, 2018

0.3.2

May 1, 2018

0.3.1

May 1, 2018

0.3

Apr 30, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extratools-0.4.8.1.tar.gz (13.3 kB view hashes)

Uploaded May 5, 2018 Source

Hashes for extratools-0.4.8.1.tar.gz

Hashes for extratools-0.4.8.1.tar.gz
Algorithm	Hash digest
SHA256	`b6a0997205bb3be1925d34d178d23f17a03807c1752c36e8d1bd16fd60ac165e`
MD5	`4d98aa9563ad2a782b68bd4bf5cb688e`
BLAKE2b-256	`44d2c0f1983cd9102fd426b68cc5e63092bf485c479e114740421376a104470e`