Skip to main content

bpd

Project description



bpd

ModulesCode structureInstalling the applicationMakefile commandsEnvironmentsRunning the applicationNotebookPipelineRessources

Code structure

from setuptools import setup
from bpd import __version__


setup(
    name="bpd",
    version=__version__,
    short_description="bpd",
    packages=[
        "bpd",
        "bpd.dask",
        "bpd.dask.types",
        "bpd.pandas",
        "bpd.pyspark",
        "bpd.pyspark.udf",
        "bpd.tests",
    ],
    long_description="".join(open("README.md", "r").readlines()),
    long_description_content_type="text/markdown",
    include_package_data=True,
    package_data={"": ["*.yml"]},
    url="https://github.com/zakuro-ai/bpd",
    license="MIT",
    author="CADIC Jean-Maximilien",
    python_requires=">=3.6",
    install_requires=[r.rsplit()[0] for r in open("requirements.txt")],
    author_email="git@zakuro.ai",
    description="bpd",
    platforms="linux_debian_10_x86_64",
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
    ],
)

Installing the application

To clone and run this application, you'll need the following installed on your computer:

Install bpd:

# Clone this repository and install the code
git clone https://github.com/JeanMaximilienCadic/bpd

# Go into the repository
cd bpd

Makefile commands

Exhaustive list of make commands:

install_wheels
sandbox_cpu
sandbox_gpu
build_sandbox
push_environment
push_container_sandbox
push_container_vanilla
pull_container_vanilla
pull_container_sandbox
build_vanilla
build_wheels
auto_branch 

Environments

Docker

Note

Running this application by using Docker is recommended.

To build and run the docker image

make build
make sandbox

PythonEnv

Warning

Running this application by using PythonEnv is possible but not recommended.

make install_wheels

Running the application

make tests
=1= TEST PASSED : bpd
=1= TEST PASSED : bpd.dask
=1= TEST PASSED : bpd.dask.types
=1= TEST PASSED : bpd.pandas
=1= TEST PASSED : bpd.pyspark
=1= TEST PASSED : bpd.pyspark.udf
=1= TEST PASSED : bpd.tests
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|
|          5|    116|           74|            0|      0|25.6|                   0.201| 30|      0|
|          3|     78|           50|           32|     88|  31|                   0.248| 26|      1|
|         10|    115|            0|            0|      0|35.3|                   0.134| 29|      0|
|          2|    197|           70|           45|    543|30.5|                   0.158| 53|      1|
|          8|    125|           96|            0|      0|   0|                   0.232| 54|      1|
|          4|    110|           92|            0|      0|37.6|                   0.191| 30|      0|
|         10|    168|           74|            0|      0|  38|                   0.537| 34|      1|
|         10|    139|           80|            0|      0|27.1|                   1.441| 57|      0|
|          1|    189|           60|           23|    846|30.1|                   0.398| 59|      1|
|          5|    166|           72|           19|    175|25.8|                   0.587| 51|      1|
|          7|    100|            0|            0|      0|  30|                   0.484| 32|      1|
|          0|    118|           84|           47|    230|45.8|                   0.551| 31|      1|
|          7|    107|           74|            0|      0|29.6|                   0.254| 31|      1|
|          1|    103|           30|           38|     83|43.3|                   0.183| 33|      0|
|          1|    115|           70|           30|     96|34.6|                   0.529| 32|      1|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 20 rows

.
----------------------------------------------------------------------
Ran 1 test in 2.701s

OK

Notebook

Pipeline

from gnutools import fs
from gnutools.remote import gdrivezip
from bpd import cfg
from bpd.dask import DataFrame, udf
from bpd.dask import functions as F
from bpd.dask.pipelines import *
# Import a sample dataset
df = DataFrame({"filename": fs.listfiles(gdrivezip(cfg.gdrive.google_mini)[0], [".wav"])})
df.compute()      
filename
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919...
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2...
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682...
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb...
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37...
# Register a user-defined function
@udf
def word(f):
    return fs.name(fs.parent(f))

@udf
def initial(classe):
    return classe[0]

@udf
def lists(classes):
    return list(set(classes))
    

df.run_pipelines(
    [
        {
            select_cols: ("filename",),
            pipeline: (
                ("classe", word(F.col("filename"))),
                ("name", udf(fs.name)(F.col("filename"))),
            ),
        },
        {
            group_on: "classe",
            select_cols: ("name", ),
            pipeline: (
                ("initial", initial(F.col("classe"))),
            ),
        },
        {
            group_on: "initial",
            select_cols: ("classe", ),
            pipeline: (
                ("_initial", lists(F.col("classe"))),
            ),
        },
    ]
)\
.withColumnRenamed("_initial", "initial")\
.compute()
filename initial
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... [wow]

Sequential calls

from gnutools import fs
from bpd.dask import DataFrame, udf
from bpd.dask import functions as F
from gnutools.remote import gdrivezip
# Import a sample dataset
gdrivezip("gdrive://1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE")
df = DataFrame({"filename": fs.listfiles("/tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE", [".wav"])})
df.compute()      
filename
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919...
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2...
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682...
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb...
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37...
... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a...
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3...
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68...
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7...
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65...

150 rows × 1 columns

# Register a user-defined function
@udf
def word(f):
    return fs.name(fs.parent(f))

# Apply a udf function
df\
.withColumn("classe", word(F.col("filename")))\
.compute()    
filename classe
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... wow
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... wow
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... wow
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... wow
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... wow
... ... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... left
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... left
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... left
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... left
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... left

150 rows × 2 columns

# You can use inline udf functions
df\
.withColumn("name", udf(fs.name)(F.col("filename")))\
.display()
filename name
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... 919d3c0e_nohash_2
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... 6a27a9bf_nohash_0
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... 6823565f_nohash_2
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... beb49c22_nohash_1
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... d37e4bf1_nohash_0
... ... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... 6a27a9bf_nohash_0
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... e32ff49d_nohash_0
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... 6823565f_nohash_2
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... e77d88fc_nohash_0
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... 659b7fae_nohash_2

150 rows × 2 columns

# Retrieve the first 3 filename per classe
df\
.withColumn("classe", word(F.col("filename")))\
.aggregate("classe")\
.withColumn("filename", F.top_k(F.col("filename"), 3))\
.explode("filename")\
.compute()
filename
classe
wow /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919...
wow /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2...
wow /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682...
nine /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/nine/0f...
nine /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/nine/6a...
... ...
yes /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/yes/0a9...
yes /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/yes/0a7...
left /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a...
left /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3...
left /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68...

90 rows × 1 columns

# Add the classe column to the original dataframe
df = df\
.withColumn("classe", word(F.col("filename")))

# Display the modified dataframe
df.display()
filename classe
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... wow
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... wow
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... wow
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... wow
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... wow
... ... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... left
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... left
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... left
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... left
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... left

150 rows × 2 columns

# Display the dataframe
# Retrieve the first 3 filename per classe
@udf
def initial(classe):
    return classe[0]
    

_df = df\
.aggregate("classe")\
.reset_index(hard=False)\
.withColumn("initial", initial(F.col("classe")))\
.select(["classe", "initial"])\
.set_index("classe")

# Display the dataframe grouped by classe
_df.compute()
    
initial
classe
bed b
bird b
cat c
dog d
down d
eight e
five f
four f
go g
happy h
house h
left l
marvin m
nine n
no n
off o
on o
one o
right r
seven s
sheila s
six s
stop s
three t
tree t
two t
up u
wow w
yes y
zero z
_df_initial = _df.reset_index(hard=False).aggregate("initial")
_df_initial.compute()
classe
initial
b [bed, bird]
c [cat]
d [dog, down]
e [eight]
f [five, four]
g [go]
h [happy, house]
l [left]
m [marvin]
n [nine, no]
o [off, on, one]
r [right]
s [seven, sheila, six, stop]
t [three, tree, two]
u [up]
w [wow]
y [yes]
z [zero]
# Join the dataframes
df\
.join(_df, on="classe").drop_column("classe")\
.join(_df_initial, on="initial")\
.display()
filename initial classe
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... w [wow]
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... w [wow]
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... w [wow]
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... w [wow]
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... w [wow]
... ... ... ...
13 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... l [left]
14 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... l [left]
15 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... l [left]
16 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... l [left]
17 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... l [left]

150 rows × 3 columns

Ressources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

bpd-0.1.0-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file bpd-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bpd-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.4

File hashes

Hashes for bpd-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f08f3bc2ecc1b2490dbfda1759a5026b0208f7cf9b96e767a9a112dcacfb9240
MD5 b4e978079315b5ec5a8178c3d383db5c
BLAKE2b-256 735ffeb8f4f8cc31e1509b13667737e22c5cd082bf66ea46a5cfa67ab4650e04

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page