Skip to main content

Helper functions for data science

Project description

# seipy

Helper functions for the python data science stack as well as spark, AWS, jupyter.

## What is it

This library contains helpers and wrappers for common data science libraries in the python stack:
- pandas
- numpy
- scipy
- sklearn
- matplotlib
- pyspark

There are also functions that simplify common manipulations for machine learning and data science
in general, as well as interfacing with the following tools:
- s3
- jupyter
- aws
- spark SQL

## Installation
```
# PyPI
pip install seipy
```

## Here are some examples

### pandas

#### Apply function to unique DataFrame entries only (for speedup)
```
from seipy import apply_uniq
df2 = apply_uniq(df, orig_col, new_col, _func)
```
This will return the same DataFrame as performing:
`df[new_col] = df[orig_col].apply(_func)`
but is much more performant when there are many duplicate entries in `orig_col`.

It works by performing the function `_func` only on the unique entries and then merging with the original DataFrame.
Originally answered on stack overflow:
https://stackoverflow.com/questions/46798532/how-do-you-effectively-use-pd-dataframe-apply-on-rows-with-duplicate-values/

#### Filtering DataFrame with multiple conditions
```
from seipy import filt
# example with keyword arguments
filt(df,
season="summer",
age=(">", 18),
sport=("isin", ["Basketball", "Soccer"]),
name=("contains", "Armstrong")
)

# example with dict notation
a = {'season': "summer", 'age': (">", 18)}
filt(df, **a)
```

### linear algebra

```
from seipy import distmat
distmat()
```
This will prints possible distance metrics such as "euclidean" "chebyshev", "hamming".

```
distmat(fframe, metric)
```
This generates a distance matrix using `metric`.
Note, this function is a wrapper of scipy.spatial.distance.cdist


### jupyter

```
from seipy import notebook_contains
notebook_contains(search_str,
on_docker=False,
git_dir='~/git/experiments/',
start_date='2015-01-01', end_date='2018-12-31')
```
Prints a list of notebooks that contain the str `search_str`.
Very useful for these situations: "Where's that notebook where I was trying that one thing that one time?"

### s3
```
from seipy import s3zip_func
s3zip_func(s3zip_path, _func, cred_fpath=cred_fpath, **kwargs)
```
This one's kinda nice. It allows one to apply a function `_func` to each subfile in a zip file sitting on s3.
I use it to filter and enrich some csv files that periodically get zipped to s3, for example.


### spark and s3 on jupyter

```
from seipy import s3spark_init
spark = s3spark_init(cred_fpath)
```
Returns `spark`, a `SparkSession` that makes it possible to interact with s3 from jupyter notebooks.
`cred_fpath` is the file path to the aws credentials file containing your keys.


### Miscellaneous

```
from seiji import merge_two_dicts
merge_two_dicts(dict_1, dict_2)
```
Returns the merged dict `{**dict_1, **dict_2}`.
An extension for mulitple dicts is `reduce(lambda d1,d2: {**d1,**d2}, dict_args[0])`

### Getting help

Please either post an issue on this github repo, or email the author `seiji dot armstrong at gmail` with feedback,
feature requests, or to complain that something doesn't work as expected.




Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seipy-1.3.2.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

seipy-1.3.2-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file seipy-1.3.2.tar.gz.

File metadata

  • Download URL: seipy-1.3.2.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for seipy-1.3.2.tar.gz
Algorithm Hash digest
SHA256 bc7aa12b6d1428bdd22fa7150ebee06be942113df73240b08deec3cb6fcbbc79
MD5 1f4aaa1af5370122ee0952d3ccdda394
BLAKE2b-256 97ebbcdcd82b6d3a74ba998ab9b0eefb9794ad7235c3a63e45427ddfdf80e451

See more details on using hashes here.

File details

Details for the file seipy-1.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for seipy-1.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d4e2613e63d53f55c86f7c80eb62eb4942de0553d9b5107f626db31e2b49e1a7
MD5 ecf290c3e7d6204fc016cf50f5516c82
BLAKE2b-256 78e68b8501c38653dc998cdbdd4aca4acc04ef6dec53e8c7d18017b20e60cbf3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page