Data helper package

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

data-toolz

This repository contains reusable python code for data projects.

The motivation for this project was to create a package which allows to abstract dataset read/write operations from

destination type (local, s3, <tbd...>) and
target file type (delimiter-separated values, jsonlines, parquet)

This would allow to write code easily transferable between local and cloud applications.

installation

pip install data-toolz

usage

datatoolz.filesystem.FileSystem class gives you an abstraction for accesing both local and remote object using the well know pythonic open() interface.

from datatoolz.filesystem import FileSystem

for fs_type in ("local", "s3"):
    fs = FileSystem(name=fs_type)

    # common pythonic interface for both local and remote file systems
    with fs.open("my-folder-or-bucket/my-file", mode="wt") as fo:
        fo.write("Hello World!")

datatoolz.io.DataIO class gives you a versatile Reader/Writer interface for handling of typical data files (jsonlines, dsv, parquet)

import pandas as pd
from datatoolz.io import DataIO

df = pd.DataFrame({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})

dio = DataIO()  # defaults to "local" FileSystem

# write as parquet
dio.write(dataframe=df, path="my-file.parquet", filetype="parquet")
dio.read(path="my-file.parquet", filetype="parquet")

# write as gzip-compressed jsonlines
dio.write(dataframe=df, path="my-file.json.gz", filetype="jsonlines", gzip=True)
dio.read(path="my-file.json.gz", filetype="jsonlines", gzip=True)

# write as delimiter-separated-values in multiple partitions
dio.write(dataframe=df, path="my-file.tsv", filetype="dsv", sep="\t", partition_by=["col1"])
dio.read(path="my-file.tsv", filetype="dsv", sep="\t")

# write output in multiple chunks per partition
dio.write(dataframe=df, path="my-prefix", filetype="dsv", sep="\t", partition_by=["col1"], suffix=["chunk01.tsv", "chunk02.tsv"])
dio.read(path="my-prefix", filetype="dsv", sep="\t")

datatoolz.logging.JsonLogger is a wrapper logger for outputting JSON-structured logs

from datatoolz.logging import JsonLogger

logger = JsonLogger(name="my-custom-logger", env="dev")
logger.info(msg="what is my purpose?", meaning_of_life=42)

{"logger": {"application": "my-custom-logger", "environment": "dev"}, "level": "info", "timestamp": "2020-11-03 18:31:07.757534", "message": "what is my purpose?", "extra": {"meaning_of_life": 42}}

It can also be used to decorate functions and log their execution details

from datatoolz.logging import JsonLogger

logger = JsonLogger(name="my-custom-logger", env="dev")

@logger.decorate(msg="my-custom-log", duration=True, memory=True, my_value="my-value", output_length=lambda x: len(x))
def my_func(x, y):
    return x + y, x * y

print(my_func(42, 2))

{"logger": {"application": "my-custom-logger", "environment": "dev"}, "level": "info", "timestamp": "2021-03-24 18:10:47.054703", "message": "my-custom-log", "extra": {"function": "my_func", "memory": {"current": 432, "peak": 432}, "duration": 2.5980000000203063e-06, "my_value": "my-value", "output_length": 2}}
(44, 84)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.11

Aug 30, 2023

0.1.10

Feb 2, 2023

0.1.9

Apr 4, 2022

0.1.8

Oct 5, 2021

0.1.7

May 18, 2021

0.1.6

Mar 24, 2021

0.1.5

Mar 24, 2021

0.1.4

Feb 9, 2021

0.1.3

Dec 8, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-toolz-0.1.11.tar.gz (21.6 kB view hashes)

Uploaded Aug 30, 2023 Source

Built Distribution

data_toolz-0.1.11-py3-none-any.whl (22.7 kB view hashes)

Uploaded Aug 30, 2023 Python 3

Hashes for data-toolz-0.1.11.tar.gz

Hashes for data-toolz-0.1.11.tar.gz
Algorithm	Hash digest
SHA256	`e797f0debb194b1e8f4c82b6fd2d989370f0498eb16c47bb60077a2451faa333`
MD5	`416fef45cadcd1e48804de241f5fd44a`
BLAKE2b-256	`b4d96c6a0ba91e4695efe104ae13c3d2b15df0250c7f664f634a1595fea216db`

Hashes for data_toolz-0.1.11-py3-none-any.whl

Hashes for data_toolz-0.1.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`53af2e4b9a0d8884a2ff705869ee7926acf73d70b6ca3aaed4f068cc15fed1c8`
MD5	`c5a6e40ff1e4a898e1575283cdb92ae0`
BLAKE2b-256	`941fe5e4291cee90a05d22640fe778ebcfe123dc8c9b4218f3ce03dbe3f9e873`