Skip to main content

bpd

Project description



bpd

ModulesCode structureInstalling the applicationMakefile commandsEnvironmentsRunning the applicationNotebookPipelineRessources

Code structure

from setuptools import setup
from bpd import __version__


setup(
    name="bpd",
    version=__version__,
    short_description="bpd",
    packages=[
        "bpd",
        "bpd.dask",
        "bpd.dask.types",
        "bpd.pandas",
        "bpd.pyspark",
        "bpd.pyspark.udf",
        "bpd.tests",
    ],
    long_description="".join(open("README.md", "r").readlines()),
    long_description_content_type="text/markdown",
    include_package_data=True,
    package_data={"": ["*.yml"]},
    url="https://github.com/zakuro-ai/bpd",
    license="MIT",
    author="ZakuroAI",
    python_requires=">=3.6",
    install_requires=[r.rsplit()[0] for r in open("requirements.txt")],
    author_email="git@zakuro.ai",
    description="bpd",
    platforms="linux_debian_10_x86_64",
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
    ],
)

Installing the application

To clone and run this application, you'll need the following installed on your computer:

Install bpd:

# Clone this repository and install the code
git clone https://github.com/zakuro-ai/bpd

# Go into the repository
cd bpd

Makefile commands

Exhaustive list of make commands:

install_wheels
sandbox_cpu
sandbox_gpu
build_sandbox
push_environment
push_container_sandbox
push_container_vanilla
pull_container_vanilla
pull_container_sandbox
build_vanilla
build_wheels
auto_branch 

Environments

Docker

Note

Running this application by using Docker is recommended.

To build and run the docker image

make pull
make sandbox

PythonEnv

Warning

Running this application by using PythonEnv is possible but not recommended. If you wish to install locally make sure that you are using python>=3.6 and that JAVA_HOME is set properly.

  • To install java:
sudo apt install openjdk-8-jre-headles
  • To install bpd.
make install_wheels

Running the application

make tests
=1= TEST PASSED : bpd
=1= TEST PASSED : bpd.dask
=1= TEST PASSED : bpd.dask.types
=1= TEST PASSED : bpd.pandas
=1= TEST PASSED : bpd.pyspark
=1= TEST PASSED : bpd.pyspark.udf
=1= TEST PASSED : bpd.tests
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|
|          5|    116|           74|            0|      0|25.6|                   0.201| 30|      0|
|          3|     78|           50|           32|     88|  31|                   0.248| 26|      1|
|         10|    115|            0|            0|      0|35.3|                   0.134| 29|      0|
|          2|    197|           70|           45|    543|30.5|                   0.158| 53|      1|
|          8|    125|           96|            0|      0|   0|                   0.232| 54|      1|
|          4|    110|           92|            0|      0|37.6|                   0.191| 30|      0|
|         10|    168|           74|            0|      0|  38|                   0.537| 34|      1|
|         10|    139|           80|            0|      0|27.1|                   1.441| 57|      0|
|          1|    189|           60|           23|    846|30.1|                   0.398| 59|      1|
|          5|    166|           72|           19|    175|25.8|                   0.587| 51|      1|
|          7|    100|            0|            0|      0|  30|                   0.484| 32|      1|
|          0|    118|           84|           47|    230|45.8|                   0.551| 31|      1|
|          7|    107|           74|            0|      0|29.6|                   0.254| 31|      1|
|          1|    103|           30|           38|     83|43.3|                   0.183| 33|      0|
|          1|    115|           70|           30|     96|34.6|                   0.529| 32|      1|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 20 rows

.
----------------------------------------------------------------------
Ran 1 test in 2.701s

OK

Notebook

Pipeline

from gnutools import fs
from gnutools.remote import gdrivezip
from bpd import cfg
from bpd.dask import DataFrame, udf
from bpd.dask import functions as F
from bpd.dask.pipelines import *
# Import a sample dataset
df = DataFrame({"filename": fs.listfiles(gdrivezip(cfg.gdrive.google_mini)[0], [".wav"])})
df.compute()      
filename
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919...
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2...
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682...
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb...
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37...
# Register a user-defined function
@udf
def word(f):
    return fs.name(fs.parent(f))

@udf
def initial(classe):
    return classe[0]

@udf
def lists(classes):
    return list(set(classes))
    

df.run_pipelines(
    [
        {
            select_cols: ("filename",),
            pipeline: (
                ("classe", word(F.col("filename"))),
                ("name", udf(fs.name)(F.col("filename"))),
            ),
        },
        {
            group_on: "classe",
            select_cols: ("name", ),
            pipeline: (
                ("initial", initial(F.col("classe"))),
            ),
        },
        {
            group_on: "initial",
            select_cols: ("classe", ),
            pipeline: (
                ("_initial", lists(F.col("classe"))),
            ),
        },
    ]
)\
.withColumnRenamed("_initial", "initial")\
.compute()
filename initial
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... [wow]
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... [wow]

Sequential calls

from gnutools import fs
from bpd.dask import DataFrame, udf
from bpd.dask import functions as F
from gnutools.remote import gdrivezip
# Import a sample dataset
gdrivezip("gdrive://1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE")
df = DataFrame({"filename": fs.listfiles("/tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE", [".wav"])})
df.compute()      
filename
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919...
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2...
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682...
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb...
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37...
... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a...
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3...
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68...
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7...
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65...

150 rows × 1 columns

# Register a user-defined function
@udf
def word(f):
    return fs.name(fs.parent(f))

# Apply a udf function
df\
.withColumn("classe", word(F.col("filename")))\
.compute()    
filename classe
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... wow
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... wow
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... wow
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... wow
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... wow
... ... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... left
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... left
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... left
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... left
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... left

150 rows × 2 columns

# You can use inline udf functions
df\
.withColumn("name", udf(fs.name)(F.col("filename")))\
.display()
filename name
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... 919d3c0e_nohash_2
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... 6a27a9bf_nohash_0
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... 6823565f_nohash_2
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... beb49c22_nohash_1
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... d37e4bf1_nohash_0
... ... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... 6a27a9bf_nohash_0
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... e32ff49d_nohash_0
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... 6823565f_nohash_2
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... e77d88fc_nohash_0
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... 659b7fae_nohash_2

150 rows × 2 columns

# Retrieve the first 3 filename per classe
df\
.withColumn("classe", word(F.col("filename")))\
.aggregate("classe")\
.withColumn("filename", F.top_k(F.col("filename"), 3))\
.explode("filename")\
.compute()
filename
classe
wow /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919...
wow /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2...
wow /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682...
nine /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/nine/0f...
nine /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/nine/6a...
... ...
yes /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/yes/0a9...
yes /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/yes/0a7...
left /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a...
left /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3...
left /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68...

90 rows × 1 columns

# Add the classe column to the original dataframe
df = df\
.withColumn("classe", word(F.col("filename")))

# Display the modified dataframe
df.display()
filename classe
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... wow
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... wow
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... wow
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... wow
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... wow
... ... ...
145 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... left
146 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... left
147 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... left
148 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... left
149 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... left

150 rows × 2 columns

# Display the dataframe
# Retrieve the first 3 filename per classe
@udf
def initial(classe):
    return classe[0]
    

_df = df\
.aggregate("classe")\
.reset_index(hard=False)\
.withColumn("initial", initial(F.col("classe")))\
.select(["classe", "initial"])\
.set_index("classe")

# Display the dataframe grouped by classe
_df.compute()
    
initial
classe
bed b
bird b
cat c
dog d
down d
eight e
five f
four f
go g
happy h
house h
left l
marvin m
nine n
no n
off o
on o
one o
right r
seven s
sheila s
six s
stop s
three t
tree t
two t
up u
wow w
yes y
zero z
_df_initial = _df.reset_index(hard=False).aggregate("initial")
_df_initial.compute()
classe
initial
b [bed, bird]
c [cat]
d [dog, down]
e [eight]
f [five, four]
g [go]
h [happy, house]
l [left]
m [marvin]
n [nine, no]
o [off, on, one]
r [right]
s [seven, sheila, six, stop]
t [three, tree, two]
u [up]
w [wow]
y [yes]
z [zero]
# Join the dataframes
df\
.join(_df, on="classe").drop_column("classe")\
.join(_df_initial, on="initial")\
.display()
filename initial classe
0 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... w [wow]
1 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... w [wow]
2 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... w [wow]
3 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... w [wow]
4 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... w [wow]
... ... ... ...
13 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... l [left]
14 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... l [left]
15 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... l [left]
16 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... l [left]
17 /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... l [left]

150 rows × 3 columns

Ressources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bpd-0.1.2a1-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file bpd-0.1.2a1-py3-none-any.whl.

File metadata

  • Download URL: bpd-0.1.2a1-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for bpd-0.1.2a1-py3-none-any.whl
Algorithm Hash digest
SHA256 3b29b6f28fba899d062782c12e82f429370f8f0c3b53be50c6b93888c4b91a5e
MD5 ada941db6e9eff14b2eb4eb44819c602
BLAKE2b-256 4ef280a728c0585afda2fdadbdbdc6d94122e17419c13f75171a005be5dc60b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page