A simple way to use Dataset. for dsm
Project description
DSM Library
DataNode
- init DataNode
from dsmlibrary.datanode import DataNode
data = DataNode(token)
- upload file
data.upload_file(directory_id=<directory_id>, file_path='<file_path>', description="<description(optional)>")
- download file
data.download_file(file_id=<file_id>, download_path="<place download file save> (default ./dsm.tmp)")
- get file
meta, file = data.get_file(file_id="<file_id>")
# meta -> dict
# file -> io bytes
# example read csv pandas
meta, file = data.get_file(file_id="<file_id>")
df = pd.read_csv(file)
...
- read df
df = data.read_df(file_id="<file_id>")
# df return as pandas dataframe
- read ddf
.parquet must use this function
ddf = data.read_ddf(file_id="<file_id>")
# ddf return as dask dataframe
- write parquet file
df = ... # pandas dataframe or dask dataframe
data.write(df=df, directory=<directory_id>, name="<save_file_name>", description="<description>", replace=<replace if file exists. default False>, profiling=<True or False default False>, lineage=<list of file id. eg [1,2,3]>)
- writeListDataNode
df = ... # pandas dataframe or dask dataframe
data.writeListDataNode(df=df, directory_id=<directory_id>, name="<save_file_name>", description="<description>", replace=<replace if file exists. default False>, profiling=<True or False default False>, lineage=<list of file id. eg [1,2,3]>)
- get file id
file_id = data.get_file_id(name=<file name>, directory_id=<directory id>)
# file_id return int fileID
- get directory id
directory_id = data.get_directory_id(parent_dir_id=<directory id>, name=<file name>)
# directory_id return int directoryID
- get get_file_version
use for listDataNode
fileVersion = data.get_file_version(file_id=<file id>)
# return dict `file_id` and `timestamp`
Clickhouse
- imoprt data to clickhouse
from dsmlibrary.clickhouse import ClickHouse
ddf = ... # pandas dataframe or dask dataframe
## to warehouse
table_name = <your_table_name>
partition_by = <your_partition_by>
connection = {
'host': '',
'port': ,
'database': '',
'user': '',
'password': '',
'settings':{
'use_numpy': True
},
'secure': False
}
warehouse = ClickHouse(connection=connection)
tableName = warehouse.get_or_createTable(ddf=ddf, tableName=table_name, partition_by=partition_by)
warehouse.write(ddf=ddf, tableName=tableName)
- query data from clickhouse
query = f"""
SELECT * FROM {tableName} LIMIT 10
"""
warehouse.read(sqlQuery=query)
- drop table
warehouse.dropTable(tableName=table_name)
- optional
use for custom config insert data to clickhouse
config = {
'n_partition_per_block': 10,
'n_row_per_loop': 1000
}
warehouse = ClickHouse(connection=connection, config=config)
- truncate table
warehouse.truncateTable(tableName=table_name)
API
dsmlibrary
dsmlibrary.datanode.DataNode
- upload_file
- download_file
- read_df
- read_ddf
- write
- get_file_id
dsmlibrary.clickhouse.ClickHouse
- get_or_createTable
- write
- read
- dropTable
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dsmlibrary-1.0.30.tar.gz
(16.2 kB
view hashes)