Skip to main content

Package for working with pandas Dataset, but with specialized functions used for Energinet

Project description

Datamazing

The Datamazing package provides an interface for various transformations of data (filtering, aggregation, merging, etc.)

Interface

The interface is very similar to those of most DataFrame libraries (pandas, pyspark, SQL, etc.). For example, a group-by is implemented as group(df, by=["..."]), and a merge is implemented as merge([df1, df2], on=["..."], how="inner"). So, why not just use native pandas, pyspark, etc.?

  1. The native libraries have some parts, with a little annoying interface (such as pandas inconsistent use of indexing)
  2. Ability to add custom operations, used specifically for the Energinet domain.

Backends

The package contains methods with the same interface, but for different backends. Currently, 2 backends are supported: pandas and pyspark (though not all methods are available for both). For example, when working with pandas DataFrames, one would use

import pandas as pd
import datamazing.pandas as pdz

df = pd.DataFrame([
    {"animal": "cat", "time": pd.Timestamp("2020-01-01"), "age": 1.0},
    {"animal": "cat", "time": pd.Timestamp("2020-01-02"), "age": 3.0},
    {"animal": "dog", "time": pd.Timestamp("2020-01-01"), "age": 5.0},
])

pdz.group(df, by="animal") \
    .resample(on="time", resolution=pd.Timedelta(hours=12)) \ 
    .agg("interpolate")

whereas, when working with pyspark DataFrame, one would instead use

import datetime as dt
import pyspark.sql as ps
import datamazing.pyspark as psz

spark = ps.SparkSession.getActiveSession()

df = spark.createDataFrame([
    {"animal": "cat", "time": dt.datetime(2020, 1, 1), "age": 1.0},
    {"animal": "cat", "time": dt.datetime(2020, 1, 2), "age": 3.0},
    {"animal": "dog", "time": dt.datetime(2020, 1, 1), "age": 5.0},
])

psz.group(df, by="animal") \
    .resample(on="time", resolution=pd.Timedelta(hours=12)) \ 
    .agg("interpolate")

Development

To setup the Python environment, run

$ pip install poetry
$ poetry install

To run test locally one needs java. This can be installed using the following:

$ sudo apt install default-jdk

To execute unit tests, run

$ pytest .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamazing-8.0.1.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datamazing-8.0.1-py3-none-any.whl (26.7 kB view details)

Uploaded Python 3

File details

Details for the file datamazing-8.0.1.tar.gz.

File metadata

  • Download URL: datamazing-8.0.1.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.10.20 Linux/6.14.0-1017-azure

File hashes

Hashes for datamazing-8.0.1.tar.gz
Algorithm Hash digest
SHA256 2812fa7e8d912734d814b2d18db5026e4955180bbf4f532524cfb9ce630a3b74
MD5 54c8f4d0b4b1e46be4b930db5693fdfc
BLAKE2b-256 a8fa212b0c0be71020bbf5ba075d79c48275517172532d4151fa6d6bc6b3e4f5

See more details on using hashes here.

File details

Details for the file datamazing-8.0.1-py3-none-any.whl.

File metadata

  • Download URL: datamazing-8.0.1-py3-none-any.whl
  • Upload date:
  • Size: 26.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.10.20 Linux/6.14.0-1017-azure

File hashes

Hashes for datamazing-8.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6d59647a5f655fc5a4c2546a1e48e0f374e20d513b61a107b713b6d4d8b8dd3
MD5 32b4eec5ccf9604e32f9a8c58918d4b7
BLAKE2b-256 ea608072a8e1f4dba02624e922cc3bc1fb7fda14e162268328800c95cad1a50e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page