bpd
Project description
bpd
Modules • Code structure • Installing the application • Makefile commands • Environments • Running the application• Notebook• Pipeline• Ressources
Code structure
from setuptools import setup
from bpd import __version__
setup(
name="bpd",
version=__version__,
short_description="bpd",
packages=[
"bpd",
"bpd.dask",
"bpd.dask.types",
"bpd.pandas",
"bpd.pyspark",
"bpd.pyspark.udf",
"bpd.tests",
],
long_description="".join(open("README.md", "r").readlines()),
long_description_content_type="text/markdown",
include_package_data=True,
package_data={"": ["*.yml"]},
url="https://github.com/zakuro-ai/bpd",
license="MIT",
author="CADIC Jean-Maximilien",
python_requires=">=3.6",
install_requires=[r.rsplit()[0] for r in open("requirements.txt")],
author_email="git@zakuro.ai",
description="bpd",
platforms="linux_debian_10_x86_64",
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
],
)
Installing the application
To clone and run this application, you'll need the following installed on your computer:
Install bpd:
# Clone this repository and install the code
git clone https://github.com/JeanMaximilienCadic/bpd
# Go into the repository
cd bpd
Makefile commands
Exhaustive list of make commands:
install_wheels
sandbox_cpu
sandbox_gpu
build_sandbox
push_environment
push_container_sandbox
push_container_vanilla
pull_container_vanilla
pull_container_sandbox
build_vanilla
build_wheels
auto_branch
Environments
Docker
Note
Running this application by using Docker is recommended.
To build and run the docker image
make build
make sandbox
PythonEnv
Warning
Running this application by using PythonEnv is possible but not recommended.
make install_wheels
Running the application
make tests
=1= TEST PASSED : bpd
=1= TEST PASSED : bpd.dask
=1= TEST PASSED : bpd.dask.types
=1= TEST PASSED : bpd.pandas
=1= TEST PASSED : bpd.pyspark
=1= TEST PASSED : bpd.pyspark.udf
=1= TEST PASSED : bpd.tests
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
| 6| 148| 72| 35| 0|33.6| 0.627| 50| 1|
| 1| 85| 66| 29| 0|26.6| 0.351| 31| 0|
| 8| 183| 64| 0| 0|23.3| 0.672| 32| 1|
| 1| 89| 66| 23| 94|28.1| 0.167| 21| 0|
| 0| 137| 40| 35| 168|43.1| 2.288| 33| 1|
| 5| 116| 74| 0| 0|25.6| 0.201| 30| 0|
| 3| 78| 50| 32| 88| 31| 0.248| 26| 1|
| 10| 115| 0| 0| 0|35.3| 0.134| 29| 0|
| 2| 197| 70| 45| 543|30.5| 0.158| 53| 1|
| 8| 125| 96| 0| 0| 0| 0.232| 54| 1|
| 4| 110| 92| 0| 0|37.6| 0.191| 30| 0|
| 10| 168| 74| 0| 0| 38| 0.537| 34| 1|
| 10| 139| 80| 0| 0|27.1| 1.441| 57| 0|
| 1| 189| 60| 23| 846|30.1| 0.398| 59| 1|
| 5| 166| 72| 19| 175|25.8| 0.587| 51| 1|
| 7| 100| 0| 0| 0| 30| 0.484| 32| 1|
| 0| 118| 84| 47| 230|45.8| 0.551| 31| 1|
| 7| 107| 74| 0| 0|29.6| 0.254| 31| 1|
| 1| 103| 30| 38| 83|43.3| 0.183| 33| 0|
| 1| 115| 70| 30| 96|34.6| 0.529| 32| 1|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 20 rows
.
----------------------------------------------------------------------
Ran 1 test in 2.701s
OK
Notebook
Pipeline
from gnutools import fs
from gnutools.remote import gdrivezip
from bpd import cfg
from bpd.dask import DataFrame, udf
from bpd.dask import functions as F
from bpd.dask.pipelines import *
# Import a sample dataset
df = DataFrame({"filename": fs.listfiles(gdrivezip(cfg.gdrive.google_mini)[0], [".wav"])})
df.compute()
filename | |
---|---|
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... |
1 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... |
2 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... |
3 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... |
4 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... |
# Register a user-defined function
@udf
def word(f):
return fs.name(fs.parent(f))
@udf
def initial(classe):
return classe[0]
@udf
def lists(classes):
return list(set(classes))
df.run_pipelines(
[
{
select_cols: ("filename",),
pipeline: (
("classe", word(F.col("filename"))),
("name", udf(fs.name)(F.col("filename"))),
),
},
{
group_on: "classe",
select_cols: ("name", ),
pipeline: (
("initial", initial(F.col("classe"))),
),
},
{
group_on: "initial",
select_cols: ("classe", ),
pipeline: (
("_initial", lists(F.col("classe"))),
),
},
]
)\
.withColumnRenamed("_initial", "initial")\
.compute()
filename | initial | |
---|---|---|
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... | [wow] |
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... | [wow] |
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... | [wow] |
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... | [wow] |
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... | [wow] |
Sequential calls
from gnutools import fs
from bpd.dask import DataFrame, udf
from bpd.dask import functions as F
from gnutools.remote import gdrivezip
# Import a sample dataset
gdrivezip("gdrive://1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE")
df = DataFrame({"filename": fs.listfiles("/tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE", [".wav"])})
df.compute()
filename | |
---|---|
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... |
1 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... |
2 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... |
3 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... |
4 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... |
... | ... |
145 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... |
146 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... |
147 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... |
148 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... |
149 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... |
150 rows × 1 columns
# Register a user-defined function
@udf
def word(f):
return fs.name(fs.parent(f))
# Apply a udf function
df\
.withColumn("classe", word(F.col("filename")))\
.compute()
filename | classe | |
---|---|---|
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... | wow |
1 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... | wow |
2 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... | wow |
3 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... | wow |
4 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... | wow |
... | ... | ... |
145 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... | left |
146 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... | left |
147 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... | left |
148 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... | left |
149 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... | left |
150 rows × 2 columns
# You can use inline udf functions
df\
.withColumn("name", udf(fs.name)(F.col("filename")))\
.display()
filename | name | |
---|---|---|
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... | 919d3c0e_nohash_2 |
1 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... | 6a27a9bf_nohash_0 |
2 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... | 6823565f_nohash_2 |
3 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... | beb49c22_nohash_1 |
4 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... | d37e4bf1_nohash_0 |
... | ... | ... |
145 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... | 6a27a9bf_nohash_0 |
146 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... | e32ff49d_nohash_0 |
147 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... | 6823565f_nohash_2 |
148 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... | e77d88fc_nohash_0 |
149 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... | 659b7fae_nohash_2 |
150 rows × 2 columns
# Retrieve the first 3 filename per classe
df\
.withColumn("classe", word(F.col("filename")))\
.aggregate("classe")\
.withColumn("filename", F.top_k(F.col("filename"), 3))\
.explode("filename")\
.compute()
filename | |
---|---|
classe | |
wow | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... |
wow | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... |
wow | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... |
nine | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/nine/0f... |
nine | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/nine/6a... |
... | ... |
yes | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/yes/0a9... |
yes | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/yes/0a7... |
left | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... |
left | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... |
left | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... |
90 rows × 1 columns
# Add the classe column to the original dataframe
df = df\
.withColumn("classe", word(F.col("filename")))
# Display the modified dataframe
df.display()
filename | classe | |
---|---|---|
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... | wow |
1 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... | wow |
2 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... | wow |
3 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... | wow |
4 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... | wow |
... | ... | ... |
145 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... | left |
146 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... | left |
147 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... | left |
148 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... | left |
149 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... | left |
150 rows × 2 columns
# Display the dataframe
# Retrieve the first 3 filename per classe
@udf
def initial(classe):
return classe[0]
_df = df\
.aggregate("classe")\
.reset_index(hard=False)\
.withColumn("initial", initial(F.col("classe")))\
.select(["classe", "initial"])\
.set_index("classe")
# Display the dataframe grouped by classe
_df.compute()
initial | |
---|---|
classe | |
bed | b |
bird | b |
cat | c |
dog | d |
down | d |
eight | e |
five | f |
four | f |
go | g |
happy | h |
house | h |
left | l |
marvin | m |
nine | n |
no | n |
off | o |
on | o |
one | o |
right | r |
seven | s |
sheila | s |
six | s |
stop | s |
three | t |
tree | t |
two | t |
up | u |
wow | w |
yes | y |
zero | z |
_df_initial = _df.reset_index(hard=False).aggregate("initial")
_df_initial.compute()
classe | |
---|---|
initial | |
b | [bed, bird] |
c | [cat] |
d | [dog, down] |
e | [eight] |
f | [five, four] |
g | [go] |
h | [happy, house] |
l | [left] |
m | [marvin] |
n | [nine, no] |
o | [off, on, one] |
r | [right] |
s | [seven, sheila, six, stop] |
t | [three, tree, two] |
u | [up] |
w | [wow] |
y | [yes] |
z | [zero] |
# Join the dataframes
df\
.join(_df, on="classe").drop_column("classe")\
.join(_df_initial, on="initial")\
.display()
filename | initial | classe | |
---|---|---|---|
0 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/919... | w | [wow] |
1 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/6a2... | w | [wow] |
2 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/682... | w | [wow] |
3 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/beb... | w | [wow] |
4 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/wow/d37... | w | [wow] |
... | ... | ... | ... |
13 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/6a... | l | [left] |
14 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e3... | l | [left] |
15 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/68... | l | [left] |
16 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/e7... | l | [left] |
17 | /tmp/1y4gwaS7LjYUhwTex1-lNHJJ71nLEh3fE/left/65... | l | [left] |
150 rows × 3 columns
Ressources
- Vanilla: https://en.wikipedia.org/wiki/Vanilla_software
- Sandbox: https://en.wikipedia.org/wiki/Sandbox_(software_development)
- All you need is docker: https://www.theregister.com/2014/05/23/google_containerization_two_billion/
- Dev in containers : https://code.visualstudio.com/docs/remote/containers
- Delta lake partitions: https://k21academy.com/microsoft-azure/data-engineer/delta-lake/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
bpd-0.1.0-py3-none-any.whl
(18.5 kB
view hashes)