Python Table Functions for DuckDB
Project description
DuckTables: Python Functions for DuckDB
DuckTables is Python library that provides out of the box functions for the DuckDB PyTables extension.
Installation
pip install ducktables
Authentication
DuckTables interacts with a number of different services, and each has their own authentication and configuration mechanism. Note that you only have to configure the services that you plan on using.
AWS
Follow the auth instructions for Boto3, the library that ducktables uses to interface with AWS apis. Note that if you're already using boto3 or the AWS CLI, your existing authentication configuration will be utilized, and you don't need to do anything.
Github
When starting your DuckDB client (such as the DuckDB cli), set an environment variable named GITHUB_ACCESS_TOKEN
. From the command line, this will look like:
GITHUB_ACCESS_TOKEN=<token> duckdb -unsigned
To obtain an API token, log into github.com, and create one here.
OpenAI
To use OpenAI functions you will need a valid API key and organization id. If you're logged in, you can create an API key here, and your organization id can be found here.
To specify these values, you'll need to set two environment variables when running DuckDB: OPENAI_ORG_ID
and OPENAI_API_KEY
. This may look something like:
OPENAI_ORG_ID=<orid> OPENAI_API_KEY=<apikey> duckdb -unsigned
Google related functions assume you have a json applications credentials file on disk where you're running DuckDB. The user or service account assoicated with the file will need to have the necessary permissions to interact with services associated with any functions you may use.
To specify the path to the credentials file, set an evironment variable named GOOGLE_APPLICATION_CREDENTIALS
when running DuckDB. That may look something like:
GOOGLE_APPLICATION_CREDENTIALS=~/.gcloud/some-file.json duckdb -unsigned
Example Queries
These queries assume you've already installed the PyTables extension in a DuckDB session. See the extensions installation guide if you have not.
AWS
Note that all queries will assume to run in the region you have specified in your configuration file. Set the AWS_DEFAULT_REGION
to override this behavior.
EC2 Instances
Returns the results of a DescribeInstances API call.
SELECT * FROM pytable('aws:ec2_instances',
columns = {
'instance_id': 'VARCHAR',
'name': 'VARCHAR',
'instance_type': 'VARCHAR',
'state': 'VARCHAR',
'key_pair': 'VARCHAR',
'platform': 'VARCHAR',
'architecture': 'VARCHAR',
'vpc_id': 'VARCHAR',
'subnet_id': 'VARCHAR',
'public_dns': 'VARCHAR',
'public_ip': 'VARCHAR',
'private_dns': 'VARCHAR',
'private_ip': 'VARCHAR'
});
S3 Buckets
Returns the results of a ListBuckets API call.
SELECT * FROM pytable('aws:s3_buckets',
columns={'name': 'VARCHAR', 'creation_date': 'VARCHAR'}
);
S3 Objects
This only includes metadata about the objects themselves, not their contents. See the httpfs DuckDB extension if you're interested in loading data from objects on S3.
Note that the second argument, the prefix path, is optional and can be omitted.
SELECT * FROM pytable('aws:s3_objects', 'bucket-name', 'foo/bar/prefix',
columns = { 'key': 'VARCHAR', 'last_modified': 'VARCHAR', 'size': 'INT', 'storage_class': 'VARCHAR'}
);
Github
Enumerates all repositories for the named user or organization.
SELECT * FROM pytable('ghub:repos_for', 'duckdb',
columns = {'repo': 'VARCHAR', 'description': 'VARCHAR', 'language': 'VARCHAR'}
);
Open AI
ChatGPT
Given a prompt and a number of desired responses, will generate a table with the model's response.
SELECT * FROM pytable('chatgpt:prompt', 'Write a limerick poem about how much you love SQL', 2,
columns = {'index': 'INT','message': 'VARCHAR', 'message_role': 'VARCHAR', 'finish_reason': 'VARCHAR'},
);
Sheets
Pulls data from a Google Sheet. Note that you'll either need user account authorization, or if you use a service account, the account will need read permissions on the sheet in question if it is private.
This example query pulls from a Google provided example spreadsheet. You can see the sheet here. Note that despite this sheet being public, you will still need Google Cloud authorization configured on your local machine.
SELECT *
FROM pytable('google:sheet', '1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms', 'Class Data!A2:F31',
columns = {
'name': 'VARCHAR', 'gender': 'VARCHAR', 'class_level': 'VARCHAR',
'state': 'VARCHAR', 'major': 'VARCHAR', 'extracurricular': 'VARCHAR'}
);
Analytics
This query has three required parameters. The ViewID you wish to query, and the start/end date range for your query. See this post for an explanation of finding your view ID. Start and end dates should be strings in a 'YYYY-MM-DD' format. Note you can alos override the set of Dimensions and Metrics, but this requires understanding how to adjust the columns
argument to account for the change in schema that will occur as a result.
SELECT * FROM pytable('ducktables.google.analytics:metrics', '<view-id>', '<start-date>', '<end-date>',
columns = {
-- Dimensions
'date': 'VARCHAR', 'hour': 'VARCHAR', 'pagePath': 'VARCHAR', 'source': 'VARCHAR', 'medium': 'VARCHAR',
-- Metrics
'sessions': 'INT', 'users': 'INT', 'pageviews': 'INT', 'avgSessionDuration': 'FLOAT', 'bounceRate': 'FLOAT'
});
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ducktables-0.1.3.tar.gz
.
File metadata
- Download URL: ducktables-0.1.3.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 291912e4e0c542504e5f0d83876900e44e86628c09be973abc1c3d674268297c |
|
MD5 | a8893bd64cfb328dc32e83c6b95ce894 |
|
BLAKE2b-256 | d5142128e25c60cc33a12ddb8a22ba838bbfa0df8af22a6ca82363dd0dbdc421 |
File details
Details for the file ducktables-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: ducktables-0.1.3-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23e8eb8d52f90e871eb05a8fd238881ba309b89cde7d181536e084bf09127db4 |
|
MD5 | ae1d7674a816ec6a2803865025a5cace |
|
BLAKE2b-256 | bf705fdc721d7e1b38bd4642a81af777ea3af87b95257a901839df4b7dc6305d |