Data helper package
Project description
data-toolz
This repository contains reusable python code for data projects.
The motivation for this project was to create a package which allows to abstract dataset read/write operations from
- destination type (
local
,s3
,<tbd...>
) and - target file type (
delimiter-separated values
,jsonlines
,parquet
)
This would allow to write code easily transferable between local and cloud applications.
installation
pip install data-toolz
usage
datatoolz.filesystem.FileSystem
class gives you an abstraction for accesing both local and remote object using the well know pythonic open()
interface.
from datatoolz.filesystem import FileSystem
for fs_type in ("local", "s3"):
fs = FileSystem(name=fs_type)
# common pythonic interface for both local and remote file systems
with fs.open("my-folder-or-bucket/my-file", mode="wt") as fo:
fo.write("Hello World!")
datatoolz.io.DataIO
class gives you a versatile Reader/Writer interface for handling of typical data files (jsonlines
, dsv
, parquet
)
import pandas as pd
from datatoolz.io import DataIO
df = pd.DataFrame({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
dio = DataIO() # defaults to "local" FileSystem
# write as parquet
dio.write(dataframe=df, path="my-file.parquet", filetype="parquet")
dio.read(path="my-file.parquet", filetype="parquet")
# write as gzip-compressed jsonlines
dio.write(dataframe=df, path="my-file.json.gz", filetype="jsonlines", gzip=True)
dio.read(path="my-file.json.gz", filetype="jsonlines", gzip=True)
# write as delimiter-separated-values in multiple partitions
dio.write(dataframe=df, path="my-file.tsv", filetype="dsv", sep="\t", partition_by=["col1"])
dio.read(path="my-file.tsv", filetype="dsv", sep="\t")
# write output in multiple chunks per partition
dio.write(dataframe=df, path="my-prefix", filetype="dsv", sep="\t", partition_by=["col1"], suffix=["chunk01.tsv", "chunk02.tsv"])
dio.read(path="my-prefix", filetype="dsv", sep="\t")
datatoolz.logging.JsonLogger
is a wrapper logger for outputting JSON-structured logs
from datatoolz.logging import JsonLogger
logger = JsonLogger(name="my-custom-logger", env="dev")
logger.info(msg="what is my purpose?", meaning_of_life=42)
{"logger": {"application": "my-custom-logger", "environment": "dev"}, "level": "info", "timestamp": "2020-11-03 18:31:07.757534", "message": "what is my purpose?", "extra": {"meaning_of_life": 42}}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for data_toolz-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56fc3b124974f8e3b73d114af1a776ae92579dc651c5485389ec11c474c1ec34 |
|
MD5 | ad34d6b70c213a83ea476e461508a09a |
|
BLAKE2b-256 | 0be8447f7bb2d944ae6e0412dc9f8490304ad825b9eb2376b6dca103ba5723e6 |