Skip to main content

Treasure Data Driver for Python

Project description

Build status PyPI version docs status

pytd provides user-friendly interfaces to Treasure Data’s REST APIs, Presto query engine, and Plazma primary storage.

The seamless connection allows your Python code to efficiently read/write a large volume of data from/to Treasure Data. Eventually, pytd makes your day-to-day data analytics work more productive.

Installation

pip install pytd

Requirements

  • Python 3.10 or later

  • pandas 2.1 or later

Usage

Set your API key and endpoint to the environment variables, TD_API_KEY and TD_API_SERVER, respectively, and create a client instance:

import pytd

client = pytd.Client(database='sample_datasets')
# or, hard-code your API key, endpoint, and/or query engine:
# >>> pytd.Client(apikey='1/XXX', endpoint='https://api.treasuredata.com/', database='sample_datasets', default_engine='presto')

Query in Treasure Data

Issue Presto query and retrieve the result:

client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
# {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}

In case of Hive:

client.query('select hivemall_version()', engine='hive')
# {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)

It is also possible to explicitly initialize pytd.Client for Hive:

client_hive = pytd.Client(database='sample_datasets', default_engine='hive')
client_hive.query('select hivemall_version()')

Here is an example of generator-based iterative retrieval using DB-API. For details, please refer to Documentation

from pytd.dbapi import connect

conn = connect(pytd.Client(database='sample_datasets'))
# or, connect with Hive:
# >>> conn = connect(pytd.Client(database='sample_datasets', default_engine='hive'))

def iterrows(sql, connection):
   cur = connection.cursor()
   cur.execute(sql)
   index = 0
   columns = None
   while True:
      row = cur.fetchone()
      if row is None:
         break
      if columns is None:
         columns = [desc[0] for desc in cur.description]
      yield index, dict(zip(columns, row))
      index += 1

for index, row in iterrows('select symbol, count(1) as cnt from nasdaq group by 1 order by 1', conn):
   print(index, row)

When you face unexpected timeout error with Presto, you can try iterative way to retrieve data.

Write data to Treasure Data

Data represented as pandas.DataFrame can be written to Treasure Data as follows:

import pandas as pd

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})
client.load_table_from_dataframe(df, 'takuti.foo', writer='bulk_import', if_exists='overwrite')

For the writer option, pytd supports three different ways to ingest data to Treasure Data:

  1. Bulk Import API: bulk_import (default)

    • Convert data into a CSV file and upload in the batch fashion.

  2. Presto INSERT INTO query: insert_into

    • Insert every single row in DataFrame by issuing an INSERT INTO query through the Presto query engine.

    • Recommended only for a small volume of data.

  3. td-spark: spark (No longer available)

    • Local customized Spark instance directly writes DataFrame to Treasure Data’s primary storage system.

Characteristics of each of these methods can be summarized as follows:

bulk_import

insert_into

spark (No longer available)

Scalable against data volume

Write performance for larger data

Memory efficient

Disk efficient

Minimal package dependency

Enabling Spark Writer

Since td-spark gives special access to the main storage system via PySpark, follow the instructions below:

  1. Contact support@treasuredata.com to activate the permission to your Treasure Data account. Note that the underlying component, Plazma Public API, limits its free tier at 100GB Read and 100TB Write.

  2. Install pytd with [spark] option if you use the third option: pip install pytd[spark]

If you want to use existing td-spark JAR file, creating SparkWriter with td_spark_path option would be helpful.

from pytd.writer import SparkWriter

writer = SparkWriter(td_spark_path='/path/to/td-spark-assembly.jar')
client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite')

Comparison between pytd, td-client-python, and pandas-td

Treasure Data offers three different Python clients on GitHub, and the following list summarizes their characteristics.

  1. td-client-python

  2. pytd

    • Access to Plazma via td-spark as introduced above.

    • Efficient connection to Presto based on trino-python-client.

    • Multiple data ingestion methods and a variety of utility functions.

  3. pandas-td (deprecated)

    • Old tool optimized for pandas and Jupyter Notebook.

    • pytd offers its compatible function set (see below for the detail).

An optimal choice of package depends on your specific use case, but common guidelines can be listed as follows:

  • Use td-client-python if you want to execute basic CRUD operations from Python applications.

  • Use pytd for (1) analytical purpose relying on pandas and Jupyter Notebook, and (2) achieving more efficient data access at ease.

  • Do not use pandas-td. If you are using pandas-td, replace the code with pytd based on the following guidance as soon as possible.

How to replace pandas-td

pytd offers pandas-td-compatible functions that provide the same functionalities more efficiently. If you are still using pandas-td, we recommend you to switch to pytd as follows.

First, install the package from PyPI:

pip install pytd
# or, `pip install pytd[spark]` if you wish to use `to_td`

Next, make the following modifications on the import statements.

Before:

import pandas_td as td
In [1]: %%load_ext pandas_td.ipython

After:

import pytd.pandas_td as td
In [1]: %%load_ext pytd.pandas_td.ipython

Consequently, all pandas_td code should keep running correctly with pytd. Report an issue from here if you noticed any incompatible behaviors.

Development

For contributors, please see Contributing Guide.

This project uses uv for fast Python package management:

# Install uv
pip install uv

# Sync dependencies
uv sync

# Run tests with nox
uvx nox

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytd-2.4.0.tar.gz (43.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytd-2.4.0-py3-none-any.whl (44.1 kB view details)

Uploaded Python 3

File details

Details for the file pytd-2.4.0.tar.gz.

File metadata

  • Download URL: pytd-2.4.0.tar.gz
  • Upload date:
  • Size: 43.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pytd-2.4.0.tar.gz
Algorithm Hash digest
SHA256 a6547c87ff0c8e55fb22f6a3926540e0d835479c354e2236a57444f8e6f0477a
MD5 54c16225ca86ccb82d8bd0ecc266cf72
BLAKE2b-256 1ab659113792dd42da662ab2f94a1db1d046c6f743bfdd78105612e8c173e7d3

See more details on using hashes here.

File details

Details for the file pytd-2.4.0-py3-none-any.whl.

File metadata

  • Download URL: pytd-2.4.0-py3-none-any.whl
  • Upload date:
  • Size: 44.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pytd-2.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b119a1f30b6f30c39f5b9659c859ed65d05c49b34951a2206718e2ca8b6c596e
MD5 a4a70a7f6eb7307e3d7280220214e65e
BLAKE2b-256 172eb762b4daf106a5912bb6eb3160d0bac36f960ffb907544fea75f87a3f35b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page