A simple way to use Dataset. for dsm
Project description
DSM Library
DataNode
- init DataNode
from dsmlibrary.datanode import DataNode
data = DataNode(
token="<token>",
apikey="<apikey>",
dataplatform_api_uri="<dataplatform_api_uri>",
object_storage_uri="<object_storage_uri>",
use_env=<True/False (default(True))>
)
- upload file
data.upload_file(directory_id=<directory_id>, file_path='<file_path>', description="<description(optional)>")
- download file
data.download_file(file_id=<file_id>, download_path="<place download file save> (default ./dsm.tmp)")
- get file
meta, file = data.get_file(file_id="<file_id>")
# meta -> dict
# file -> io bytes
# example read csv pandas
meta, file = data.get_file(file_id="<file_id>")
df = pd.read_csv(file)
...
- read df
df = data.read_df(file_id="<file_id>")
# df return as pandas dataframe
- read ddf
.parquet must use this function
ddf = data.read_ddf(file_id="<file_id>")
# ddf return as dask dataframe
- write parquet file
df = ... # pandas dataframe or dask dataframe
data.write(df=df, directory=<directory_id>, name="<save_file_name>", description="<description>", replace=<replace if file exists. default False>, datadict=<True or False default False>, profiling=<True or False default False>, lineage=<list of file id. eg [1,2,3]>)
- writeListDataNode
df = ... # pandas dataframe or dask dataframe
data.writeListDataNode(df=df, directory_id=<directory_id>, name="<save_file_name>", description="<description>", replace=<replace if file exists. default False>, datadict=<True or False default False>, profiling=<True or False default False>, lineage=<list of file id. eg [1,2,3]>)
- get file id
file_id = data.get_file_id(name=<file name>, directory_id=<directory id>)
# file_id return int fileID
- get directory id
directory_id = data.get_directory_id(parent_dir_id=<directory id>, name=<file name>)
# directory_id return int directoryID
- get get_file_version
use for listDataNode
fileVersion = data.get_file_version(file_id=<file id>)
# return dict `file_id` and `timestamp`
Clickhouse
- imoprt data to clickhouse
from dsmlibrary.clickhouse import ClickHouse
ddf = ... # pandas dataframe or dask dataframe
## to warehouse
table_name = <your_table_name>
partition_by = <your_partition_by>
connection = {
'host': '',
'port': ,
'database': '',
'user': '',
'password': '',
'settings':{
'use_numpy': True
},
'secure': False
}
warehouse = ClickHouse(connection=connection)
tableName = warehouse.get_or_createTable(ddf=ddf, tableName=table_name, partition_by=partition_by)
warehouse.write(ddf=ddf, tableName=tableName)
- query data from clickhouse
query = f"""
SELECT * FROM {tableName} LIMIT 10
"""
warehouse.read(sqlQuery=query)
- drop table
warehouse.dropTable(tableName=table_name)
- optional
use for custom config insert data to clickhouse
config = {
'n_partition_per_block': 10,
'n_row_per_loop': 1000
}
warehouse = ClickHouse(connection=connection, config=config)
- truncate table
warehouse.truncateTable(tableName=table_name)
API
dsmlibrary
dsmlibrary.datanode.DataNode
- upload_file
- download_file
- read_df
- read_ddf
- write
- get_file_id
dsmlibrary.clickhouse.ClickHouse
- get_or_createTable
- write
- read
- dropTable
Use for pipeline
data = DataNode(apikey="<APIKEY>")
use api key for authenticate
MDM
- semantic similarity
pip install "dsmlibrary[mdm]"
see example here
Gendatadict PDF
from dsmlibrary.datadict import GenerateDatadict
gd = GenerateDatadict(
token="<token>",
apikey="<apikey>",
dataplatform_api_uri="<dataplatform_api_uri>",
object_storage_uri="<object_storage_uri>"
)
gd.generate_datadict(name="<NAME>", directory_id=<DIR_ID for datadict file>, file_ids=[<FILE_ID>, <FILE_ID>, ...])
- use token or apikey
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dsmlibrary-1.0.54.tar.gz
(18.5 kB
view details)
File details
Details for the file dsmlibrary-1.0.54.tar.gz
.
File metadata
- Download URL: dsmlibrary-1.0.54.tar.gz
- Upload date:
- Size: 18.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59fcecaf4e1255bc3d00987cbe876e6617206f4394c47d567677bb05bc3073d4 |
|
MD5 | 2fff0b5d5cb2551199c79c00a3d4bdfc |
|
BLAKE2b-256 | 4ef3c118abd5dc278ecccf83f992dc8ce82cf4238bf25b83a5d0b08477be5c38 |