Treasure Data Driver for Python
Project description
pytd provides user-friendly interfaces to Treasure Data’s REST APIs, Presto query engine, and Plazma primary storage.
The seamless connection allows your Python code to efficiently read/write a large volume of data from/to Treasure Data. Eventually, pytd makes your day-to-day data analytics work more productive.
Installation
pip install pytd
Usage
Set your API key and endpoint to the environment variables, TD_API_KEY and TD_API_SERVER, respectively, and create a client instance:
import pytd
client = pytd.Client(database='sample_datasets')
# or, hard-code your API key, endpoint, and/or query engine:
# >>> pytd.Client(apikey='1/XXX', endpoint='https://api.treasuredata.com/', database='sample_datasets', default_engine='presto')
Query in Treasure Data
Issue Presto query and retrieve the result:
client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
# {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}
In case of Hive:
client.query('select hivemall_version()', engine='hive')
# {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)
It is also possible to explicitly initialize pytd.Client for Hive:
client_hive = pytd.Client(database='sample_datasets', default_engine='hive')
client_hive.query('select hivemall_version()')
Write data to Treasure Data
Data represented as pandas.DataFrame can be written to Treasure Data as follows:
import pandas as pd
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})
client.load_table_from_dataframe(df, 'takuti.foo', writer='bulk_import', if_exists='overwrite')
For the writer option, pytd supports three different ways to ingest data to Treasure Data:
Bulk Import API: bulk_import (default)
Convert data into a CSV file and upload in the batch fashion.
Presto INSERT INTO query: insert_into
Insert every single row in DataFrame by issuing an INSERT INTO query through the Presto query engine.
Recommended only for a small volume of data.
td-spark: spark
Local customized Spark instance directly writes DataFrame to Treasure Data’s primary storage system.
Enabling Spark Writer
Since td-spark gives special access to the main storage system via PySpark, follow the instructions below:
Contact support@treasuredata.com to activate the permission to your Treasure Data account.
Install pytd with [spark] option if you use the third option: pip install pytd[spark]
If you want to use existing td-spark JAR file, creating SparkWriter with td_spark_path option would be helpful.
from pytd.writer import SparkWriter
writer = SparkWriter(apikey='1/XXX', endpoint='https://api.treasuredata.com/', td_spark_path='/path/to/td-spark-assembly.jar')
client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite')
How to replace pandas-td
pytd offers pandas-td-compatible functions that provide the same functionalities more efficiently. If you are still using pandas-td, we recommend you to switch to pytd as follows.
First, install the package from PyPI:
pip install pytd
# or, `pip install pytd[spark]` if you wish to use `to_td`
Next, make the following modifications on the import statements.
Before:
import pandas_td as td
In [1]: %%load_ext pandas_td.ipython
After:
import pytd.pandas_td as td
In [1]: %%load_ext pytd.pandas_td.ipython
Consequently, all pandas_td code should keep running correctly with pytd. Report an issue from here if you noticed any incompatible behaviors.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.