Python Data Science Kit for Humans.

These details have not been verified by PyPI

Project links

Homepage

Project description

DSKit

DSKit (Data Science Kit) is a Python package that provides tools for solving simple Data Science routine problems.

Installing

pip install dskit

Tutorial

DSKit consists of two submodules:

dskit.frame - contains a set of functions for pandas.DataFrame and pandas.Series manipulation.
dskit.tensor - contains a set of functions for numpy.ndarray manipulation.

dskit.frame

dummifier

dummifier is less harmful alternative to pd.get_dummies. This function takes a Dict[str, Tuple[object, ...]] and returns a Callable[[pd.DataFrame], pd.DataFrame] which takes a frame and returns a dummified frame. Key of the dictionary is treated as a name of a column and value of the dictionary is treated as a set of unique values of that column. dummifier also takes an optional parameter name which has a type Callable[[str, object], str]. The name function takes a name of a column and a unique value of that column to produce a name of a column in a dummified frame. The default implementation of the name is: lambda n, x: n + "_" + str(x). dummifier uses encoder function under the hood.

xs = pd.DataFrame({"A": (1, 2, 2, 5, 5), "B": ("a", "a", "b", "c", "d")})

dummify = dummifier(dict(xs))
print(dummify(xs))

#    A_1  A_2  A_5  B_a  B_b  B_c  B_d
# 0  1.0  0.0  0.0  1.0  0.0  0.0  0.0
# 1  0.0  1.0  0.0  1.0  0.0  0.0  0.0
# 2  0.0  1.0  0.0  0.0  1.0  0.0  0.0
# 3  0.0  0.0  1.0  0.0  0.0  1.0  0.0
# 4  0.0  0.0  1.0  0.0  0.0  0.0  1.0

ys = pd.DataFrame({"C": (True, True, False, True), "A": (1, 2, 3, 4)})
print(dummify(ys))

#        C  A_1  A_2  A_5
# 0   True  1.0  0.0  0.0
# 1   True  0.0  1.0  0.0
# 2  False  0.0  0.0  0.0
# 3   True  0.0  0.0  0.0

One of the reasons why dummifier is less harmful than pd.get_dummies is that it will not dummify new values. Thanks to that Machine Learning models will operate on data with the same number of dimensions regardless of new values presence in a new portion of data.

old_frame = pd.DataFrame({"B": ("a", "a", "b")})
dummify = dummifier(dict(old_frame))

new_frame = pd.DataFrame({"B": ("a", "b", "c")})
print(dummify(new_frame))

#    B_a  B_b
# 0  1.0  0.0
# 1  0.0  1.0
# 2  0.0  0.0

print(pd.get_dummies(new_frame))

#    B_a  B_b  B_c
# 0    1    0    0
# 1    0    1    0
# 2    0    0    1

encoder

encoder is a function which takes a set of values and returns a Callable[[Tuple[object, ...]], pd.DataFrame]. The returned function one-hot-encodes passed values. encoder also takes an optional parameter name which has a type Callable[[object], str]. The name function takes a unique value from the passed set to produce a name of a column in a one-hot-encoded frame. The default implementation of the name is: str. This function uses sklearn.preprocessing.OneHotEncoder under the hood.

encoded = encoder((1, 2, 3))((1, 2, 3, 4, np.nan))
print(encoded)

#      1    2    3
# 0  1.0  0.0  0.0
# 1  0.0  1.0  0.0
# 2  0.0  0.0  1.0
# 3  0.0  0.0  0.0
# 4  0.0  0.0  0.0

encoded = encoder((1, 2, 3), name=lambda x: "column_" + str(x))((1, 2, 3, 4, np.nan))
print(encoded)

#    column_1  column_2  column_3
# 0       1.0       0.0       0.0
# 1       0.0       1.0       0.0
# 2       0.0       0.0       1.0
# 3       0.0       0.0       0.0
# 4       0.0       0.0       0.0

dskit.tensor

batch

batch is a function which takes a Tuple[Tuple[np.ndarray, ...], ...], transposes it and applies np.stack on each element resulting in a Tuple[np.ndarray, ...].

xs = (
  (np.array([1, 2, 3]), np.array([4, 5]), np.ones((2, 3))),
  (np.array([7, 8, 9]), np.array([5, 4]), np.zeros((2, 3)))
)

x, y, z = batch(xs)

print(x)
print("=" * 5)
print(y)
print("=" * 5)
print(z)

# [[1 2 3]
#  [7 8 9]]
# =====
# [[4 5]
#  [5 4]]
# =====
# [[[1. 1. 1.]
#   [1. 1. 1.]]
#
#  [[0. 0. 0.]
#   [0. 0. 0.]]]

batches

batches is a function which takes a sliding window length n and a step, and returns a function which takes an Iterable[Tuple[np.ndarray, ...]], applies sliding window over it and uses batch function on each window. This function returns an Iterable[Tuple[np.ndarray, ...]]. Each window has length equal to n. In case when exact=False option is passed, each window has at most length equal to n. step is simply a shift of a sliding window. By default step is equal to n.

xs = np.arange(15).reshape(-1, 3)
ys = np.arange(10).reshape(-1, 2)

print(xs)

# [[ 0  1  2]
#  [ 3  4  5]
#  [ 6  7  8]
#  [ 9 10 11]
#  [12 13 14]]

print(ys)

# [[0 1]
#  [2 3]
#  [4 5]
#  [6 7]
#  [8 9]]

for x, y in batches(n=3)(zip(xs, ys)):
  print(x)
  print("=" * 5)
  print(y)

  print()

# [[0 1 2]
#  [3 4 5]
#  [6 7 8]]
# =====
# [[0 1]
#  [2 3]
#  [4 5]]
#

for x, y in batches(n=3, step=2, exact=False)(zip(xs, ys)):
  print(x)
  print("=" * 5)
  print(y)

  print()

# [[0 1 2]
#  [3 4 5]
#  [6 7 8]]
# =====
# [[0 1]
#  [2 3]
#  [4 5]]
#
# [[ 6  7  8]
#  [ 9 10 11]
#  [12 13 14]]
# =====
# [[4 5]
#  [6 7]
#  [8 9]]
#
# [[12 13 14]]
# =====
# [[8 9]]
#

cycle

cycle is a multidimensional version of itertools.cycle function. This function takes a np.ndarray with Tuple[int, ...] and returns "cycled" np.ndarray.

xs = np.arange(4).reshape(-1, 2)
print(xs)

# [[0 1]
#  [2 3]]

cycled_xs = cycle(xs, (3, 3))
print(cycled_xs)

# [[0 1 0 1 0 1]
#  [2 3 2 3 2 3]
#  [0 1 0 1 0 1]
#  [2 3 2 3 2 3]
#  [0 1 0 1 0 1]
#  [2 3 2 3 2 3]]

zeros = cycle(0, (2, 2, 3))
print(zeros)

# [[[0 0 0]
#   [0 0 0]]
#
#  [[0 0 0]
#   [0 0 0]]]

gridrange

gridrange is a function similar to Python's range function. The difference between gridrange and range is that gridrange operates on Tuple[int, ...] instead of int.

for x in gridrange((2, 3)):
  print(x)

# (0, 0)
# (0, 1)
# (0, 2)
# (1, 0)
# (1, 1)
# (1, 2)

for x in gridrange((1, 1), (3, 4)):
  print(x)

# (1, 1)
# (1, 2)
# (1, 3)
# (2, 1)
# (2, 2)
# (2, 3)

for x in gridrange((1, 1), (10, 20), (5, 5)):
  print(x)

# (1, 1)
# (1, 6)
# (1, 11)
# (1, 16)
# (6, 1)
# (6, 6)
# (6, 11)
# (6, 16)

iteraxis

iteraxis is a function which takes a np.ndarray and returns Iterable[np.ndarray] along passed axis. This function is similar to np.apply_along_axis. The difference between iteraxis and np.apply_along_axis is that np.apply_along_axis applies some function to arrays, when iteraxis returns those arrays.

xs = np.arange(27).reshape(-1, 3, 3)

for x in iteraxis(xs, axis=-1):
  print(x)

# [0 1 2]
# [3 4 5]
# [6 7 8]
# [ 9 10 11]
# [12 13 14]
# [15 16 17]
# [18 19 20]
# [21 22 23]
# [24 25 26]

move

move allows you to move source np.ndarray to destination np.ndarray at coordinate Tuple[int, ...]. move works on a copy of the destination array unless inplace=True is passed. The default coordinate is (0, 0, ...).

xs = np.arange(4).reshape(-1, 2)
ys = np.zeros((3, 3), dtype=np.uint)

moved = move(xs, ys, coordinate=(1, 1))
print(moved)

# [[0 0 0]
#  [0 0 1]
#  [0 2 3]]

xs = np.arange(4).reshape(-1, 2)
ys = np.zeros((3, 3), dtype=np.uint)

_ = move(xs, ys, inplace=True)
print(ys)

# [[0 1 0]
#  [2 3 0]
#  [0 0 0]]

slices

slices is simply:

RawSlice = Union[
  Tuple[Optional[int]],
  Tuple[Optional[int], Optional[int]],
  Tuple[Optional[int], Optional[int], Optional[int]]
]

def slices(xs: Iterable[RawSlice]) -> Tuple[slice, ...]:
  return tuple(starmap(slice, xs))

Example of slices usage:

xs = np.arange(9).reshape(-1, 3)
ys = (1, None), (0, 1)

print(xs[slices(ys)])

# [[3]
#  [6]]

# same as

print(xs[1:, 0:1])

# [[3]
#  [6]

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.1

Jan 23, 2022

0.1

Aug 20, 2021

0.0.11

Jul 3, 2021

0.0.10

May 3, 2021

0.0.9

Apr 11, 2021

0.0.8

Mar 31, 2021

0.0.7

Mar 6, 2021

0.0.6

Mar 6, 2021

0.0.5

Feb 6, 2021

0.0.4

Feb 6, 2021

0.0.3

Feb 3, 2021

0.0.2

Jan 30, 2021

0.0.1

Jan 6, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dskit-0.1.1.tar.gz (8.9 kB view details)

Uploaded Jan 23, 2022 Source

Built Distribution

dskit-0.1.1-py3-none-any.whl (9.5 kB view details)

Uploaded Jan 23, 2022 Python 3

File details

Details for the file dskit-0.1.1.tar.gz.

File metadata

Download URL: dskit-0.1.1.tar.gz
Upload date: Jan 23, 2022
Size: 8.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for dskit-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`5c17920baef13ec5389070cc4cb22944c575a3d738416a91b3cef78215f4c41e`
MD5	`977544d3726da4bcadb379e733270ef7`
BLAKE2b-256	`ee0c55137dc98f0dacd80eba793573c3965fb4a943fea348f7e5d2fa5d9acf46`

See more details on using hashes here.

File details

Details for the file dskit-0.1.1-py3-none-any.whl.

File metadata

Download URL: dskit-0.1.1-py3-none-any.whl
Upload date: Jan 23, 2022
Size: 9.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for dskit-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`97977e0b03eb6ab1d412013c1cbbeffda3be3fdf9c0949fdb3b1ec3d44f3f2ac`
MD5	`12603bcbae8436ca53d407382b4c61af`
BLAKE2b-256	`18be24d61afd6707c724f4a99181de6f39cff12b02d767441f5d5ce35b88cd5e`