Skip to main content

Python Data Science Kit for Humans.

Project description

DSKit

DSKit (Data Science Kit) is a Python package that provides tools for solving simple Data Science routine problems.

Installing

pip install dskit

Tutorial

DSKit consists of two submodules:

  • dskit.frame - contains a set of functions for pandas.DataFrame and pandas.Series manipulation.
  • dskit.tensor - contains a set of functions for numpy.ndarray manipulation.

dskit.frame

dummifier

dummifier is less harmful alternative to pd.get_dummies. This function takes a Dict[str, Tuple[object, ...]] and returns a Callable[[pd.DataFrame], pd.DataFrame] which takes a frame and returns a dummified frame. Key of the dictionary is treated as a name of a column and value of the dictionary is treated as a set of unique values of that column. dummifier also takes an optional parameter name which has a type Callable[[str, object], str]. The name function takes a name of a column and a unique value of that column to produce a name of a column in a dummified frame. The default implementation of the name is: lambda n, x: n + "_" + str(x). dummifier uses encoder function under the hood.

xs = pd.DataFrame({"A": (1, 2, 2, 5, 5), "B": ("a", "a", "b", "c", "d")})

dummify = dummifier(dict(xs))
print(dummify(xs))

#    A_1  A_2  A_5  B_a  B_b  B_c  B_d
# 0  1.0  0.0  0.0  1.0  0.0  0.0  0.0
# 1  0.0  1.0  0.0  1.0  0.0  0.0  0.0
# 2  0.0  1.0  0.0  0.0  1.0  0.0  0.0
# 3  0.0  0.0  1.0  0.0  0.0  1.0  0.0
# 4  0.0  0.0  1.0  0.0  0.0  0.0  1.0

ys = pd.DataFrame({"C": (True, True, False, True), "A": (1, 2, 3, 4)})
print(dummify(ys))

#        C  A_1  A_2  A_5
# 0   True  1.0  0.0  0.0
# 1   True  0.0  1.0  0.0
# 2  False  0.0  0.0  0.0
# 3   True  0.0  0.0  0.0

One of the reasons why dummifier is less harmful than pd.get_dummies is that it will not dummify new values. Thanks to that Machine Learning models will operate on data with the same number of dimensions regardless of new values presence in a new portion of data.

old_frame = pd.DataFrame({"B": ("a", "a", "b")})
dummify = dummifier(dict(old_frame))

new_frame = pd.DataFrame({"B": ("a", "b", "c")})
print(dummify(new_frame))

#    B_a  B_b
# 0  1.0  0.0
# 1  0.0  1.0
# 2  0.0  0.0

print(pd.get_dummies(new_frame))

#    B_a  B_b  B_c
# 0    1    0    0
# 1    0    1    0
# 2    0    0    1

encoder

encoder is a function which takes a set of values and returns a Callable[[Tuple[object, ...]], pd.DataFrame]. The returned function one-hot-encodes passed values. encoder also takes an optional parameter name which has a type Callable[[object], str]. The name function takes a unique value from the passed set to produce a name of a column in a one-hot-encoded frame. The default implementation of the name is: str. This function uses sklearn.preprocessing.OneHotEncoder under the hood.

encoded = encoder((1, 2, 3))((1, 2, 3, 4, np.nan))
print(encoded)

#      1    2    3
# 0  1.0  0.0  0.0
# 1  0.0  1.0  0.0
# 2  0.0  0.0  1.0
# 3  0.0  0.0  0.0
# 4  0.0  0.0  0.0

encoded = encoder((1, 2, 3), name=lambda x: "column_" + str(x))((1, 2, 3, 4, np.nan))
print(encoded)

#    column_1  column_2  column_3
# 0       1.0       0.0       0.0
# 1       0.0       1.0       0.0
# 2       0.0       0.0       1.0
# 3       0.0       0.0       0.0
# 4       0.0       0.0       0.0

dskit.tensor

batch

batch is a function which takes a Tuple[Tuple[np.ndarray, ...], ...], transposes it and applies np.stack on each element resulting in a Tuple[np.ndarray, ...].

xs = (
  (np.array([1, 2, 3]), np.array([4, 5]), np.ones((2, 3))),
  (np.array([7, 8, 9]), np.array([5, 4]), np.zeros((2, 3)))
)

x, y, z = batch(xs)

print(x)
print("=" * 5)
print(y)
print("=" * 5)
print(z)

# [[1 2 3]
#  [7 8 9]]
# =====
# [[4 5]
#  [5 4]]
# =====
# [[[1. 1. 1.]
#   [1. 1. 1.]]
#
#  [[0. 0. 0.]
#   [0. 0. 0.]]]

batches

batches is a function which takes a sliding window length n and a step, and returns a function which takes an Iterable[Tuple[np.ndarray, ...]], applies sliding window over it and uses batch function on each window. This function returns an Iterable[Tuple[np.ndarray, ...]]. Each window has length equal to n. In case when exact=False option is passed, each window has at most length equal to n. step is simply a shift of a sliding window. By default step is equal to n.

xs = np.arange(15).reshape(-1, 3)
ys = np.arange(10).reshape(-1, 2)

print(xs)

# [[ 0  1  2]
#  [ 3  4  5]
#  [ 6  7  8]
#  [ 9 10 11]
#  [12 13 14]]

print(ys)

# [[0 1]
#  [2 3]
#  [4 5]
#  [6 7]
#  [8 9]]

for x, y in batches(n=3)(zip(xs, ys)):
  print(x)
  print("=" * 5)
  print(y)

  print()

# [[0 1 2]
#  [3 4 5]
#  [6 7 8]]
# =====
# [[0 1]
#  [2 3]
#  [4 5]]
#

for x, y in batches(n=3, step=2, exact=False)(zip(xs, ys)):
  print(x)
  print("=" * 5)
  print(y)

  print()

# [[0 1 2]
#  [3 4 5]
#  [6 7 8]]
# =====
# [[0 1]
#  [2 3]
#  [4 5]]
#
# [[ 6  7  8]
#  [ 9 10 11]
#  [12 13 14]]
# =====
# [[4 5]
#  [6 7]
#  [8 9]]
#
# [[12 13 14]]
# =====
# [[8 9]]
#

cycle

cycle is a multidimensional version of itertools.cycle function. This function takes a np.ndarray with Tuple[int, ...] and returns "cycled" np.ndarray.

xs = np.arange(4).reshape(-1, 2)
print(xs)

# [[0 1]
#  [2 3]]

cycled_xs = cycle(xs, (3, 3))
print(cycled_xs)

# [[0 1 0 1 0 1]
#  [2 3 2 3 2 3]
#  [0 1 0 1 0 1]
#  [2 3 2 3 2 3]
#  [0 1 0 1 0 1]
#  [2 3 2 3 2 3]]

zeros = cycle(0, (2, 2, 3))
print(zeros)

# [[[0 0 0]
#   [0 0 0]]
#
#  [[0 0 0]
#   [0 0 0]]]

gridrange

gridrange is a function similar to Python's range function. The difference between gridrange and range is that gridrange operates on Tuple[int, ...] instead of int.

for x in gridrange((2, 3)):
  print(x)

# (0, 0)
# (0, 1)
# (0, 2)
# (1, 0)
# (1, 1)
# (1, 2)

for x in gridrange((1, 1), (3, 4)):
  print(x)

# (1, 1)
# (1, 2)
# (1, 3)
# (2, 1)
# (2, 2)
# (2, 3)

for x in gridrange((1, 1), (10, 20), (5, 5)):
  print(x)

# (1, 1)
# (1, 6)
# (1, 11)
# (1, 16)
# (6, 1)
# (6, 6)
# (6, 11)
# (6, 16)

iteraxis

iteraxis is a function which takes a np.ndarray and returns Iterable[np.ndarray] along passed axis. This function is similar to np.apply_along_axis. The difference between iteraxis and np.apply_along_axis is that np.apply_along_axis applies some function to arrays, when iteraxis returns those arrays.

xs = np.arange(27).reshape(-1, 3, 3)

for x in iteraxis(xs, axis=-1):
  print(x)

# [0 1 2]
# [3 4 5]
# [6 7 8]
# [ 9 10 11]
# [12 13 14]
# [15 16 17]
# [18 19 20]
# [21 22 23]
# [24 25 26]

move

move allows you to move source np.ndarray to destination np.ndarray at coordinate Tuple[int, ...]. move works on a copy of the destination array unless inplace=True is passed. The default coordinate is (0, 0, ...).

xs = np.arange(4).reshape(-1, 2)
ys = np.zeros((3, 3), dtype=np.uint)

moved = move(xs, ys, coordinate=(1, 1))
print(moved)

# [[0 0 0]
#  [0 0 1]
#  [0 2 3]]

xs = np.arange(4).reshape(-1, 2)
ys = np.zeros((3, 3), dtype=np.uint)

_ = move(xs, ys, inplace=True)
print(ys)

# [[0 1 0]
#  [2 3 0]
#  [0 0 0]]

slices

slices is simply:

RawSlice = Union[
  Tuple[Optional[int]],
  Tuple[Optional[int], Optional[int]],
  Tuple[Optional[int], Optional[int], Optional[int]]
]

def slices(xs: Iterable[RawSlice]) -> Tuple[slice, ...]:
  return tuple(starmap(slice, xs))

Example of slices usage:

xs = np.arange(9).reshape(-1, 3)
ys = (1, None), (0, 1)

print(xs[slices(ys)])

# [[3]
#  [6]]

# same as

print(xs[1:, 0:1])

# [[3]
#  [6]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dskit-0.1.1.tar.gz (8.9 kB view hashes)

Uploaded Source

Built Distribution

dskit-0.1.1-py3-none-any.whl (9.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page