A tool for end2end data tests

These details have not been verified by PyPI

Project links

Project description

DDataFlow

DDataFlow is an end2end tests and local development solution for machine learning and data pipelines using pyspark. Check out this blogpost if you want to understand deeper its design motivation.

ddataflow overview

You can find our documentation under this link.

Features

Read a subset of our data so to speed up the running of the pipelines during tests
Write to a test location our artifacts so you don't pollute production
Download data for enabling local machine development

Enables to run on the pipelines in the CI

1. Install DDataflow

pip install ddataflow

ddataflow --help will give you an overview of the available commands.

Getting Started (<5min Tutorial)

1. Setup some synthetic data

See the examples folder.

2. Create a ddataflow_config.py file

The command ddtaflow setup_project creates a file like this for you.

from ddataflow import DDataflow

config = {
    # add here your tables or paths with customized sampling logic
    "data_sources": {
        "demo_tours": {
            "source": lambda spark: spark.table('demo_tours'),
            "filter": lambda df: df.limit(500)
        }
        "demo_locations": {
            "source": lambda spark: spark.table('demo_locations'),
            "default_sampling": True,
        }
    },
    "project_folder_name": "ddataflow_demo",
}

# initialize the application and validate the configuration
ddataflow = DDataflow(**config)

3. Use ddataflow in a pipeline

from ddataflow_config import ddataflow

# replace spark.table for ddataflow source will return a spark dataframe
print(ddataflow.source('demo_locations').count())
# for sql queries replace only the name of the table for the sample data source name provided by ddataflow
print(spark.sql(f""" SELECT COUNT(1) from {ddataflow.name('demo_tours')}""").collect()[0]['count(1)'])

Now run it twice and observe the difference in the amount of records: python pipeline.py

ENABLE_DDATAFLOW=True python pipeline.py

You will see that the dataframes are sampled when ddataflow is enabled and full when the tool is disabled.

You completed the short demo!

How to develop

The recommended approach to use ddataflow is to use the offline mode, which allows you to test your pipelines without the need for an active cluster. This is especially important for development and debugging purposes, as it allows you to quickly test and identify any issues with your pipelines.

Alternatively, you can use Databricks Connect to test your pipelines on an active cluster. However, our experience with this approach has not been great, memory issues are common and there is the risk of overriding production data, so we recommend using the offline mode instead.

If you have any questions or need any help, please don't hesitate to reach out. We are here to help you get the most out of ddataflow.

Support

In case of questions feel free to reach out or create an issue.

Check out our FAQ in case of problems

Contributing

We welcome contributions to DDataFlow! If you would like to contribute, please follow these guidelines:

Fork the repository and create a new branch for your contribution.
Make your changes and ensure that the code passes all tests.
Submit a pull request with a clear description of your changes and the problem it solves.

Please note that all contributions are subject to review and approval by the project maintainers. We appreciate your help in making DDataFlow even better!

If you have any questions or need any help, please don't hesitate to reach out. We are here to assist you throughout the contribution process.

License

DDataFlow is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.16

Jun 19, 2024

1.1.15

May 2, 2023

1.1.14

May 2, 2023

1.1.12

Mar 23, 2023

1.1.11

Mar 23, 2023

1.1.10

Mar 23, 2023

1.1.9

Nov 7, 2022

1.1.8

Oct 31, 2022

1.1.7

Oct 20, 2022

1.1.6

Oct 19, 2022

1.1.5

Oct 12, 2022

1.1.4

Oct 12, 2022

1.1.3

Oct 11, 2022

1.1.2

Oct 5, 2022

1.1.1

Oct 5, 2022

1.0.0

Aug 26, 2022

0.2.0

Aug 26, 2022

0.1.12

Jul 27, 2022

0.1.11

Jul 7, 2022

0.1.10

Jul 4, 2022

0.1.9

Jul 4, 2022

0.1.8

Jun 29, 2022

0.1.7

Jun 29, 2022

0.1.6

Jun 29, 2022

0.1.5

Jun 29, 2022

0.1.4

Jun 28, 2022

0.1.3

Jun 17, 2022

0.1.2

Jun 17, 2022

0.1.1

Jun 17, 2022

0.1.0

Jun 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ddataflow-1.1.16.tar.gz (17.0 kB view details)

Uploaded Jun 19, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ddataflow-1.1.16-py3-none-any.whl (20.3 kB view details)

Uploaded Jun 19, 2024 Python 3

File details

Details for the file ddataflow-1.1.16.tar.gz.

File metadata

Download URL: ddataflow-1.1.16.tar.gz
Upload date: Jun 19, 2024
Size: 17.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1022-azure

File hashes

Hashes for ddataflow-1.1.16.tar.gz
Algorithm	Hash digest
SHA256	`d65f295cae0910e2a5cddf30d2f164d3ae6b6ebdb7bdfad3295e4e6641ab7de3`
MD5	`294640ee12cb148d09e20e9141a95cc6`
BLAKE2b-256	`41c5eb9c33e26fdadc64910e4967d9eb6e35804da4a4e506f3aa4f0631f79ddb`

See more details on using hashes here.

File details

Details for the file ddataflow-1.1.16-py3-none-any.whl.

File metadata

Download URL: ddataflow-1.1.16-py3-none-any.whl
Upload date: Jun 19, 2024
Size: 20.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1022-azure

File hashes

Hashes for ddataflow-1.1.16-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72586f43267e017578da52df6fb2129fac57df4c904c6cce271d2d1aaecf699b`
MD5	`f09185b270f9fc31f313ec17946428f8`
BLAKE2b-256	`e23a5273fc1ebcc4f5c82459b4fde3cbf117fdc05870c5c02dabcc33f18f41a2`

See more details on using hashes here.

DDataFlow 1.1.16

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DDataFlow

Features

1. Install DDataflow

Getting Started (<5min Tutorial)

1. Setup some synthetic data

2. Create a ddataflow_config.py file

3. Use ddataflow in a pipeline

How to develop

Support

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes