Skip to main content

bpd

Project description



bpd

ModulesCode structureInstalling the applicationMakefile commandsEnvironmentsRunning the applicationNotebookPipelineRessources

Code structure

from setuptools import setup
from bpd import __version__


setup(
    name="bpd",
    version=__version__,
    short_description="bpd",
    packages=[
        "bpd",
        "bpd.dask",
        "bpd.dask.types",
        "bpd.pandas",
        "bpd.pyspark",
        "bpd.pyspark.udf",
        "bpd.tests",
    ],
    long_description="".join(open("README.md", "r").readlines()),
    long_description_content_type="text/markdown",
    include_package_data=True,
    package_data={"": ["*.yml"]},
    url="https://github.com/zakuro-ai/bpd",
    license="MIT",
    author="CADIC Jean-Maximilien",
    python_requires=">=3.6",
    install_requires=[r.rsplit()[0] for r in open("requirements.txt")],
    author_email="git@zakuro.ai",
    description="bpd",
    platforms="linux_debian_10_x86_64",
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
    ],
)

Installing the application

To clone and run this application, you'll need the following installed on your computer:

Install bpd:

# Clone this repository and install the code
git clone https://github.com/JeanMaximilienCadic/bpd

# Go into the repository
cd bpd

Makefile commands

Exhaustive list of make commands:

install_wheels
sandbox_cpu
sandbox_gpu
build_sandbox
push_environment
push_container_sandbox
push_container_vanilla
pull_container_vanilla
pull_container_sandbox
build_vanilla
build_wheels
auto_branch 

Environments

Docker

Note

Running this application by using Docker is recommended.

To build and run the docker image

make build
make sandbox

PythonEnv

Warning

Running this application by using PythonEnv is possible but not recommended.

make install_wheels

Running the application

make tests
=1= TEST PASSED : bpd
=1= TEST PASSED : bpd.dask
=1= TEST PASSED : bpd.dask.types
=1= TEST PASSED : bpd.pandas
=1= TEST PASSED : bpd.pyspark
=1= TEST PASSED : bpd.pyspark.udf
=1= TEST PASSED : bpd.tests
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|
|          5|    116|           74|            0|      0|25.6|                   0.201| 30|      0|
|          3|     78|           50|           32|     88|  31|                   0.248| 26|      1|
|         10|    115|            0|            0|      0|35.3|                   0.134| 29|      0|
|          2|    197|           70|           45|    543|30.5|                   0.158| 53|      1|
|          8|    125|           96|            0|      0|   0|                   0.232| 54|      1|
|          4|    110|           92|            0|      0|37.6|                   0.191| 30|      0|
|         10|    168|           74|            0|      0|  38|                   0.537| 34|      1|
|         10|    139|           80|            0|      0|27.1|                   1.441| 57|      0|
|          1|    189|           60|           23|    846|30.1|                   0.398| 59|      1|
|          5|    166|           72|           19|    175|25.8|                   0.587| 51|      1|
|          7|    100|            0|            0|      0|  30|                   0.484| 32|      1|
|          0|    118|           84|           47|    230|45.8|                   0.551| 31|      1|
|          7|    107|           74|            0|      0|29.6|                   0.254| 31|      1|
|          1|    103|           30|           38|     83|43.3|                   0.183| 33|      0|
|          1|    115|           70|           30|     96|34.6|                   0.529| 32|      1|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 20 rows

.
----------------------------------------------------------------------
Ran 1 test in 2.701s

OK

Notebook

Pipeline

from gnutools import fs
from gnutools.remote import gdrivezip
from bpd import cfg
from bpd.dask import DataFrame, udf
from bpd.dask import functions as F
from bpd.dask.pipelines import *
# Import a sample dataset
df = DataFrame({"filename": fs.listfiles(gdrivezip(cfg.gdrive.google_mini)[0], [".wav"])})
df.compute()      
filename
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919...
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2...
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682...
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb...
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37...
# Register a user-defined function
@udf
def word(f):
    return fs.name(fs.parent(f))

@udf
def initial(classe):
    return classe[0]

@udf
def lists(classes):
    return list(set(classes))
    

df.run_pipelines(
    [
        {
            select_cols: ("filename",),
            pipeline: (
                ("classe", word(F.col("filename"))),
                ("name", udf(fs.name)(F.col("filename"))),
            ),
        },
        {
            group_on: "classe",
            select_cols: ("name", ),
            pipeline: (
                ("initial", initial(F.col("classe"))),
            ),
        },
        {
            group_on: "initial",
            select_cols: ("classe", ),
            pipeline: (
                ("_initial", lists(F.col("classe"))),
            ),
        },
    ]
)\
.withColumnRenamed("_initial", "initial")\
.compute()
filename initial
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... [wow]

Sequential calls

from gnutools import fs
from bpd.dask import DataFrame, udf
from bpd.dask import functions as F
from gnutools.remote import gdrivezip
# Import a sample dataset
gdrivezip("gdrive://1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE")
df = DataFrame({"filename": fs.listfiles("/tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE", [".wav"])})
df.compute()      
filename
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919...
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2...
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682...
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb...
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37...
... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a...
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3...
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68...
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7...
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65...

150 rows × 1 columns

# Register a user-defined function
@udf
def word(f):
    return fs.name(fs.parent(f))

# Apply a udf function
df\
.withColumn("classe", word(F.col("filename")))\
.compute()    
filename classe
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... wow
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... wow
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... wow
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... wow
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... wow
... ... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... left
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... left
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... left
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... left
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... left

150 rows × 2 columns

# You can use inline udf functions
df\
.withColumn("name", udf(fs.name)(F.col("filename")))\
.display()
filename name
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... 919d3c0e_nohash_2
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... 6a27a9bf_nohash_0
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... 6823565f_nohash_2
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... beb49c22_nohash_1
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... d37e4bf1_nohash_0
... ... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... 6a27a9bf_nohash_0
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... e32ff49d_nohash_0
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... 6823565f_nohash_2
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... e77d88fc_nohash_0
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... 659b7fae_nohash_2

150 rows × 2 columns

# Retrieve the first 3 filename per classe
df\
.withColumn("classe", word(F.col("filename")))\
.aggregate("classe")\
.withColumn("filename", F.top_k(F.col("filename"), 3))\
.explode("filename")\
.compute()
filename
classe
wow /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919...
wow /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2...
wow /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682...
nine /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/nine/0f...
nine /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/nine/6a...
... ...
yes /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/yes/0a9...
yes /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/yes/0a7...
left /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a...
left /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3...
left /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68...

90 rows × 1 columns

# Add the classe column to the original dataframe
df = df\
.withColumn("classe", word(F.col("filename")))

# Display the modified dataframe
df.display()
filename classe
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... wow
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... wow
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... wow
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... wow
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... wow
... ... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... left
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... left
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... left
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... left
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... left

150 rows × 2 columns

# Display the dataframe
# Retrieve the first 3 filename per classe
@udf
def initial(classe):
    return classe[0]
    

_df = df\
.aggregate("classe")\
.reset_index(hard=False)\
.withColumn("initial", initial(F.col("classe")))\
.select(["classe", "initial"])\
.set_index("classe")

# Display the dataframe grouped by classe
_df.compute()
    
initial
classe
bed b
bird b
cat c
dog d
down d
eight e
five f
four f
go g
happy h
house h
left l
marvin m
nine n
no n
off o
on o
one o
right r
seven s
sheila s
six s
stop s
three t
tree t
two t
up u
wow w
yes y
zero z
_df_initial = _df.reset_index(hard=False).aggregate("initial")
_df_initial.compute()
classe
initial
b [bed, bird]
c [cat]
d [dog, down]
e [eight]
f [five, four]
g [go]
h [happy, house]
l [left]
m [marvin]
n [nine, no]
o [off, on, one]
r [right]
s [seven, sheila, six, stop]
t [three, tree, two]
u [up]
w [wow]
y [yes]
z [zero]
# Join the dataframes
df\
.join(_df, on="classe").drop_column("classe")\
.join(_df_initial, on="initial")\
.display()
filename initial classe
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... w [wow]
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... w [wow]
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... w [wow]
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... w [wow]
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... w [wow]
... ... ... ...
13 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... l [left]
14 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... l [left]
15 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... l [left]
16 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... l [left]
17 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... l [left]

150 rows × 3 columns

Ressources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

bpd-0.1.0-py3-none-any.whl (18.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page