Skip to main content

exchange data between database or parquet files

Project description

delibird: a transformer between database and Parquet file

Introduction:

delibird is a python tool library based on Python pyarrow which supports multithread and asynchronous calls. It can help users transform data between database and Parquet files.

Features:

  • Multithread: support batch reading/writeing and multithread functions an database table and Parquet files.
  • Read directory: reading all Parquet files in the giving directory and transform into database. One directory maps to one database table.
  • Mock data: create Parquet files or database tables in a customized schema.
  • Workflow: giving a yaml file including your customized configurations, delibird can create a workflow to execute multiple jobs.

Limits:

  • Only support Postgresql DB and Oracle DB by now.

Installation

source code

git clone https://gitee.com/lipicoder/delibird.git
cd delibird
pip install -e .

Pypi

$ python -m build

pip

$ pip install delibird

Usage

Input 'delibird' in command line. The usage lint will be displayed.

(.context) % delibird
Usage: delibird [OPTIONS] COMMAND [ARGS]...

  delibird command line interface.

Options:
  -h, --help  Show this message and exit.

Commands:
  mock      Mock data to directory , file, or database.
  parquet   Write or read Parquet file or directory
  workflow  Database and parquet data transform workflow.

mock

Example:

# mock data workflow

# workflow lists
mocks:
  - name: "mock-to-directory"
    row-number: 2048
    direction: "directory" # directory ,file or  table
    directory: "./datasets/mock_data/mock_stocks"
    columns: {
      # stock code as a type
      "sec_code": "code",  # "600001"
      "date": "date",  # 2022-08-24
      "close": "float",  # 16.87
      "open": "float",  # 16.65
      "high": "float",  # 16.95
      "low": "float",  # 16.55
      "hold": "decimal(10,5)",  # 123.25515
      "time": "timestample(unit=s,tz=Asia/Shanghai)",
      "volume": "int",  # 1530231
      "amount": "int",  # 2571196416
      "memo": "string", # hello
    }

  - name: "mock-to-file"
    row-number: 2048
    direction: "file" # directory, file or table
    filepath: "./datasets/mock_data/mock_stocks.parquet"
    columns: {
      # stock code as a type
      "sec_code": "code",  # "600001"
      "date": "date",  # 2022-08-24
      "close": "float",  # 16.87
      "open": "float",  # 16.65
      "high": "float",  # 16.95
      "low": "float",  # 16.55
      "hold": "decimal(10,5)",  # 123.25515
      "time": "timestample(unit=s,tz=Asia/Shanghai)",
      "volume": "int",  # 1530231
      "amount": "int",  # 2571196416
      "memo": "string", # hello
    }

  - name: "mock-to-table"
    row-number: 204800
    direction: "table" # directory ,file or table
    engine: "postgresql"
    dsn: "postgresql://test:test123@localhost:5432/delibird"
    table-name: "mock_stocks"
    columns: {
      # stock code as a type
      "sec_code": "code",  # "600001"
      "date": "date",  # 2022-08-24
      "close": "float",  # 16.87
      "open": "float",  # 16.65
      "high": "float",  # 16.95
      "low": "float",  # 16.55
      "hold": "decimal(10,5)",  # 123.25515
      # datetime.datetime(2022,10,25).timestamp()
      "time": "timestample(unit=s,tz=Asia/Shanghai)",
      "volume": "int",  # 1530231
      "amount": "int",  # 2571196416
    }

direction transform to which format. 'directory': a directory path. 'file': a file path. 'table': a database table name.

columns defination of the database table. Support standard data types of Postgresql or Oracle db, based on which database you choose. delibird will auto map the database data type to pyarrow row data type. 'code' means stock code, which would be removed later.

execute mock workflow:

(.context) % delibird mock tests/yaml/mock_file.yaml
write directory finished
write parquet finished

parquet

Read data in database table and write data into a Parquet file or Parquet files in a directory. Or read data in a Parquet file or Parquet files in a directory and write data into a database table.

(.context) % delibird parquet
Usage: delibird parquet [OPTIONS] COMMAND [ARGS]...

  Write or read Parquet file or directory.

Options:
  -h, --help  Show this message and exit.

Commands:
  read   Read parquet file and write to database.
  write  Read from database and write to parquet file.

parquet read

Read data in a Parquet file or Parquet files in a directory and write data into a database table.

(.context) % delibird parquet read -h
Usage: delibird parquet read [OPTIONS] [-e ENGINE] PATH DSN TABLE_NAME

  Read parquet file, write to database.

  dsn sample:postgresql://user:password@host:port/dbname.

  engine [postgresql/oracle]

Options:
  -h, --help  Show this message and exit.

Example:

delibird parquet read datasets/mock_data/mock_stocks.parquet postgresql://test:test123@localhost:5432/delibird mock_stocks -e postgresql

parquet write:

Read data in database table and write data into a Parquet file or Parquet files in a directory.

directory.

(.context) % delibird parquet write -h
Usage: delibird parquet write [OPTIONS]  [-e ENGINE] PATH DSN TABLE_NAME

  Read from database and write to parquet file.

  dsn sample:postgresql://user:password@host:port/dbname.

  engine [postgresql/oracle]

Options:
  -s, --batch_size INTEGER
  -h, --help                Show this message and exit.

Example:

delibird parquet write datasets/mock_data/mock_stocks_tmp postgresql://test:test123@localhost:5432/delibird mock_stocks -e postgresql

parquet write supports configuration of batch size

delibird parquet write -s 1024 -e postgresql datasets/mock_data/mock_stocks postgresql://test:test123@localhost:5432/delibird mock_stocks

In this case, the max row number of a single parquet file is 1024, we can see four files in the directory.

(.context) % ls datasets/mock_data/mock_stocks
ea6c445914824cae8ef171bbafd3a58f.parquet
604a63ccf14343c39bcc5bc0d1b3907d.parquet
9c7150d9821c46c78054d87ae23d900f.parquet
2ba1952316344b01a2a2f8e6faf41c31.parquet

file

delibird parquet write -e postgresql datasets/mock_data/mock_stocks_tmp.parquet postgresql://test:test123@localhost:5432/delibird mock_stocks;

Consider of reducing the memory usage and speed up the writing efficiency. write file can also support configuration of batch size.

workflow

create and exectue a workflow using a yaml configuration file.

(.context) % delibird workflow  -h
Usage: delibird workflow [OPTIONS] YAML_FILE

  Execute yaml workflow.

Options:
  -h, --help  Show this message and exit.

Example:

workflows:
  - name: "read-workflow" # workflow name
    direction: "table" # table or file or directory
    table-name: "mock_stocks" # table name
    engine: "postgresql"
    dsn: "postgresql://test:test123@localhost:5432/delibird"
    read-type: "file" # file or directory
    filepath: "./datasets/mock_data/mock_stocks.parquet" # filepath

  - name: "write-directory-workflow" # workflow name
    direction: "directory"
    table-name: "mock_stocks" # table name
    engine: "postgresql"
    dsn: "postgresql://test:test123@localhost:5432/delibird"
    directory: "./datasets/mock_data/mock_stocks" # directory path
    batch-size: 1024 # batch size

  - name: "write-file-workflow" # workflow name
    direction: "file"
    table-name: "mock_stocks" # table name
    engine: "postgresql"
    dsn: "postgresql://test:test123@localhost:5432/delibird"
    filepath: "./datasets/mock_data/mock_stocks_rewrite.parquet"

TODO

  • remove 'code' type from delibird mock. add new supported types such as random string and random digit string.

Dependency

pyarrow >=9.0.0

python >= 3.10

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delibird-0.0.2.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

delibird-0.0.2-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file delibird-0.0.2.tar.gz.

File metadata

  • Download URL: delibird-0.0.2.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for delibird-0.0.2.tar.gz
Algorithm Hash digest
SHA256 5fd3a4a67384d6e9806368f6798f8e57dbc3aacec8cef7ba15afced3157f87f1
MD5 092a3895b0443789f162e575b98df21e
BLAKE2b-256 2fa54319192f8a07110d255eed48b62bd489609db7b3d848e0809b6c73a8abc1

See more details on using hashes here.

File details

Details for the file delibird-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: delibird-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for delibird-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7cb5b2264c457620fd8a2f52256265b27689573b0cb60077b4b656fd2c053e0a
MD5 2a03422abb2d6ba1cb7271e07a3ba2ea
BLAKE2b-256 fc711f4ec8173f269069c856ce43134bb40da39f16dba7849f619a1684d88d61

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page