Skip to main content

Python Data Science Kit for Humans.

Project description

DSKit

DSKit (Data Science Kit) is a Python package that provides tools for solving simple Data Science tasks.

Installing

pip install dskit

Tutorial

DSKit consists of two submodules:

  • dskit.frame - contains a set of functions for pandas.DataFrame and pandas.Series manipulation.
  • dskit.tensor - contains a set of functions for numpy.ndarray manipulation.

dskit.frame

create_dummifier

create_dummifier is less harmful alternative to pd.get_dummies. This function takes a Dict[str, Tuple[object, ...]] and returns Callable[[pd.DataFrame], pd.DataFrame]. Key of the dictionary is treated as a name of a column and value of the dictionary is treated as unique values of that column. Each key and value of the dictionary are passed to create_encoder function. Bases on created encoders a new function is created which takes pd.DataFrame and returns pd.DataFrame. The returned function is a "dummify" function.

frame = pd.DataFrame({"A": (1, 2, 2, 5, 5), "B": ("a", "a", "b", "c", "d")})
uniques: Dict[str, Tuple[object, ...]] = unique(frame)

dummify: Callable[[pd.DataFrame], pd.DataFrame] = create_dummifier(uniques)
print(dummify(frame))

#    A_1  A_2  A_5  B_a  B_b  B_c  B_d
# 0  1.0  0.0  0.0  1.0  0.0  0.0  0.0
# 1  0.0  1.0  0.0  1.0  0.0  0.0  0.0
# 2  0.0  1.0  0.0  0.0  1.0  0.0  0.0
# 3  0.0  0.0  1.0  0.0  0.0  1.0  0.0
# 4  0.0  0.0  1.0  0.0  0.0  0.0  1.0

other_frame = pd.DataFrame({"C": (True, True, False, True), "A": (1, 2, 3, 4)})
print(dummify(other_frame))

#    A_1  A_2  A_5      C
# 0  1.0  0.0  0.0   True
# 1  0.0  1.0  0.0   True
# 2  0.0  0.0  0.0  False
# 3  0.0  0.0  0.0   True

# notice, that columns are sorted alphabetically (C comes after A)
# you can pass sort_columns=False if you want to omit sorting

One of the reasons why create_dummifier is less harmful than pd.get_dummies is that it will not dummify new values. Thanks to that Machine Learning models will operate on data with the same number of dimensions regardless of new values presence in new portion of data.

old_frame = pd.DataFrame({"B": ("a", "a", "b")})
new_frame = pd.DataFrame({"B": ("a", "b", "c")})

uniques: Dict[str, Tuple[object, ...]] = unique(old_frame)
dummify: Callable[[pd.DataFrame], pd.DataFrame] = create_dummifier(uniques)

print(dummify(new_frame))

#    B_a  B_b
# 0  1.0  0.0
# 1  0.0  1.0
# 2  0.0  0.0

print(pd.get_dummies(new_frame))

#    B_a  B_b  B_c
# 0    1    0    0
# 1    0    1    0
# 2    0    0    1

create_encoder

create_encoder is a function used by create_dummifier. create_encoder takes a column name with a set of values and returns Callable[[Tuple[object, ...]], pd.DataFrame]. The returned function one-hot-encodes passed values. This function uses sklearn.preprocessing.OneHotEncoder under the hood.

encoded: pd.DataFrame = create_encoder("A", (1, 2, 3))((1, 2, 3, 4, 5))
print(encoded)

#    A_1  A_2  A_3
# 0  1.0  0.0  0.0
# 1  0.0  1.0  0.0
# 2  0.0  0.0  1.0
# 3  0.0  0.0  0.0
# 4  0.0  0.0  0.0

unique

unique is a function which takes pd.DataFrame and returns Dict[str, Tuple[object, ...]]. Key of the dictionary is a name of a column of the passed frame and value of the dictionary is unique values of that column.

frame = pd.DataFrame({"A": (1, 2, 2, 5, 5), "B": ("a", "a", "b", "c", "d")})
uniques: Dict[str, Tuple[object, ...]] = unique(frame)

print(uniques)

# {'A': (1, 2, 5), 'B': ('a', 'b', 'c', 'd')}

dskit.tensor

create_batches

create_batches is a function which takes Iterable[Tuple[np.ndarray, ...]] with length of batches and returns an Iterable[Tuple[np.ndarray, ...]] of created batches.

xs = np.arange(15).reshape(-1, 3)
ys = np.arange(10).reshape(-1, 2)

print(xs)

# [[ 0  1  2]
#  [ 3  4  5]
#  [ 6  7  8]
#  [ 9 10 11]
#  [12 13 14]]

print(ys)

# [[0 1]
#  [2 3]
#  [4 5]
#  [6 7]
#  [8 9]]

for sub_xs, sub_ys in create_batches(zip(xs, ys), length=3):
  print(sub_xs)
  print(sub_ys)

  print()

# [[0 1 2]
#  [3 4 5]
#  [6 7 8]]
# [[0 1]
#  [2 3]
#  [4 5]]
#
# [[ 9 10 11]
#  [12 13 14]]
# [[6 7]
#  [8 9]]

cycle_tensor

cycle_tensor is a "cycle" function used for tensors. This function takes a np.ndarray with Tuple[int, ...] and returns "cycled" np.ndarray.

xs = np.arange(4).reshape(-1, 2)
print(xs)

# [[0 1]
#  [2 3]]

cycled_xs = cycle_tensor(xs, (3, 3))
print(cycled_xs)

# [[0 1 0 1 0 1]
#  [2 3 2 3 2 3]
#  [0 1 0 1 0 1]
#  [2 3 2 3 2 3]
#  [0 1 0 1 0 1]
#  [2 3 2 3 2 3]]

zeros = cycle_tensor(0, (2, 2, 3))
print(zeros)

# [[[0 0 0]
#   [0 0 0]]
#
#  [[0 0 0]
#   [0 0 0]]]

expand_tail

expand_tail is simply:

def expand_tail(xs: Tuple[X, ...], length: int, filler: X) -> Tuple[X, ...]:
  count = max(length - len(xs), 0)

  fillers = repeat(filler)
  sliced_fillers = islice(fillers, count)

  expanded = chain(xs, sliced_fillers)
  return tuple(expanded)

Example of expand_tail usage:

xs = expand_tail((1, 2), length=5, filler=3)
print(xs) # (1, 2, 3, 3, 3)

expand_tail_dimensions

expand_tail_dimensions is simply:

def expand_tail_dimensions(tensor: np.ndarray, length: int) -> np.ndarray:
  expanded_shape: Shape = expand_tail(tensor.shape, length, filler=1)
  return tensor.reshape(expanded_shape)

Example of expand_tail_dimensions usage:

xs = np.arange(27).reshape(-1, 3, 3)
ys = expand_tail_dimensions(xs, 5)

print(ys.shape) # (3, 3, 3, 1, 1)

gridrange

gridrange is a function similar to Python's range function. The difference between gridrange and range is that gridrange operates on Tuple[int, ...] instead of int.

for x in gridrange((2, 3)):
  print(x)

# (0, 0)
# (0, 1)
# (0, 2)
# (1, 0)
# (1, 1)
# (1, 2)

for x in gridrange((1, 1), (3, 4)):
  print(x)

# (1, 1)
# (1, 2)
# (1, 3)
# (2, 1)
# (2, 2)
# (2, 3)

for x in gridrange((1, 1), (10, 20), (5, 5)):
  print(x)

# (1, 1)
# (1, 6)
# (1, 11)
# (1, 16)
# (6, 1)
# (6, 6)
# (6, 11)
# (6, 16)

iteraxis

iteraxis is a function which takes np.ndarray and returns Iterable[np.ndarray] along passed axis. This function is similar to np.apply_along_axis. The difference between iteraxis and np.apply_along_axis is that np.apply_along_axis applies some function to arrays and iteraxis returns those arrays.

xs = np.arange(27).reshape(-1, 3, 3)

for x in iteraxis(xs, axis=-1):
  print(x)

# [0 1 2]
# [3 4 5]
# [6 7 8]
# [ 9 10 11]
# [12 13 14]
# [15 16 17]
# [18 19 20]
# [21 22 23]
# [24 25 26]

move_tensor

move_tensor is simply:

def move_tensor(source: np.ndarray, destination: np.ndarray, coordinate: Optional[Coordinate] = None) -> np.ndarray:
  tensor = destination.copy()
  move_tensor_inplace(source, tensor, coordinate)

  return tensor

Example of move_tensor usage:

xs = np.arange(4).reshape(-1, 2)
ys = np.zeros((3, 3), dtype=np.uint)

moved = move_tensor(xs, ys, coordinate=(1, 1))
print(moved)

# [[0 0 0]
#  [0 0 1]
#  [0 2 3]]

move_tensor_inplace

move_tensor_inplace is a procedure which moves source np.ndarray to destination np.ndarray at coordinate Tuple[int, ...]. Only destination np.ndarray is modified. The coordinate is optional.

xs = np.arange(4).reshape(-1, 2)
ys = np.zeros((3, 3), dtype=np.uint)

move_tensor_inplace(xs, ys)
print(ys)

# [[0 1 0]
#  [2 3 0]
#  [0 0 0]]

next_batch

next_batch is a function used by create_batches. next_batch takes an Iterable[Tuple[np.ndarray, ...]] with length of batch and returns a next batch. The next batch might have smaller length than the passed one.

xs = np.arange(15).reshape(-1, 3)
ys = np.arange(10).reshape(-1, 2)

print(xs)

# [[ 0  1  2]
#  [ 3  4  5]
#  [ 6  7  8]
#  [ 9 10 11]
#  [12 13 14]]

print(ys)

# [[0 1]
#  [2 3]
#  [4 5]
#  [6 7]
#  [8 9]]

sub_xs, sub_ys = next_batch(zip(xs, ys), length=3)

print(sub_xs)
print(sub_ys)

# [[0 1 2]
#  [3 4 5]
#  [6 7 8]]
# [[0 1]
#  [2 3]
#  [4 5]]

slices

slices is simply:

RawSlice = Union[
  Tuple[Optional[int]],
  Tuple[Optional[int], Optional[int]],
  Tuple[Optional[int], Optional[int], Optional[int]]
]

def slices(xs: Iterable[RawSlice]) -> Tuple[slice, ...]:
  ys = starmap(slice, xs)
  return tuple(ys)

Example of slices usage:

xs = np.arange(9).reshape(-1, 3)
ys = (1, None), (0, 1)

print(xs[slices(ys)])

# [[3]
#  [6]]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dskit-0.0.1.tar.gz (8.2 kB view hashes)

Uploaded Source

Built Distribution

dskit-0.0.1-py3-none-any.whl (18.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page