Skip to main content

No project description provided

Project description

dcraft License: MIT deployment workflow

Data management library based on data lake concept especially for data science and machine leaning.
This helps your daily job's data management by raw, trusted and refined layer concept from data lake. The data is versioned and saved for each layer on specified storages and tables.

Concept

For daily individual work and for team work, we need to manage and organize our datasets to keep clean workflow. This library is to help that based on the data lake's layer concept.
For each layer, you can save the data and metadata to several places such as local file system, GCP and MongoDB which you choose.

Covered Data type

  • pd.DataFrame
  • Dict
  • List of Dict

Covered Format

  • csv
  • parquet
  • json

Covered Storage and Table

You can save the metadata and data on several places. The list below is the present coverage.

Metadata

  • Local File System
  • BigQuery
  • MongoDB

Data

  • Local File System
  • Google Cloud Storage
  • MinIO

Installation

pip install dcraft

To use GCP resources.

pip install dcraft[gcp]

Example

Create layer's data. There are create_trusted and create_refined too.

from dcraft import create_raw
import pandas as pd

data = pd.DataFrame({"a": [1,2], "b": [None, 4]})
raw_layer_data = create_raw(
    data,
    "fake-project",
    "Shuhei Kishi",
    "This is fake project",
    {"version": "0.0.1"}
)

You can choose where the data and metadata should be saved. On this example, it saves both on local.

import os
from dcraft import LocalDataRepository, LocalMetadataRepository

CURRENT_DIR = os.getcwd()
DATA_DIR_PATH = os.path.join(CURRENT_DIR, "data")
METADATA_DIR_PATH = os.path.join(CURRENT_DIR, "metadata")

data_repository = LocalDataRepository(DATA_DIR_PATH)
metadata_repository = LocalMetadataRepository(DATA_DIR_PATH)
raw_layer_data.save("parquet", data_repository, metadata_repository)

The data was saved to raw layer and information were saved as metadata.
You can read the saved data from metadata's id. The format is kept.

from dcraft import read_layer_data
loaded_raw_layer_data = read_layer_data(<id-from-metadata>, data_repository, metadata_repository)

If you want to save the metadata and data on different places such as BigQuery and Google Cloud Storage, you can use different Repository class.

from dcraft import BqMetadataRepository, GcsDataRepository

GCP_PROJECT = "your-project-id"
GCS_BUCKET = "your-bucket-name"

data_repository = GcsDataRepository(GCP_PROJECT, GCS_BUCKET)
metadata_repository = BqMetadataRepository(GCP_PROJECT, "test_dataset", "test_table")

raw_layer_data.save("csv", data_repository, metadata_repository)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcraft-0.5.2.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dcraft-0.5.2-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file dcraft-0.5.2.tar.gz.

File metadata

  • Download URL: dcraft-0.5.2.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for dcraft-0.5.2.tar.gz
Algorithm Hash digest
SHA256 0597b6cb08ccca5c60a460c34fe3888e09ddf6e4bf2fc6b4a20f57b1b67ec11a
MD5 4098f693d79763ca5ff43b017a9bcada
BLAKE2b-256 d674a7ba0415b0c9955b1242a4dc405380fd1b77b70d990adcc41f209accf238

See more details on using hashes here.

File details

Details for the file dcraft-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: dcraft-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 21.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for dcraft-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 03989cacdcc50a9a530c0e667899074acd6b019d4d4239c8280d71200124b0bf
MD5 06972d12fd2cf5ae7cd6a91e37e70309
BLAKE2b-256 75aad238e517f5c578965e59d303ddbf22fa0c4d0dccbbaa6673c883c6f032df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page