Multi-source/engine querying tool
Project description
honeycomb
honeycomb
serves as an abstraction around connections to a Hadoop/Hive-based
data lake and other databases, built with extensibility to other services in
mind. It handles connections, type conversions, a degree of query generation,
and more for users, in order to greatly reduce the degree of technical
expertise and Python knowledge required to interact with the NHDS data lake
and other data repositories. In doing this, we aim to increase data
availability throughout the organization, and facilitate any individual
member feeling empowered to interact with data.
Usage
Much of the functionality of honeycomb
is completely under-the-hood, and as
a result the average use case of it is very straightforward.
General
- Supported engines are currently:
- Runs against data lake:
- Hive - Runs against data lake. Queries are slightly slower than Presto, but has effectively has no limit on query size and has advanced recovery/reliability functionality. This is the suggested engine for most queries. Must provide a schema in queries.
- Presto - Runs against data lake. Presto runs queries more quickly than other engines, but has query size limitations. Returns all columns as strings. Used for quick, ad-hoc queries, but Hive is recommended over Presto in almost all situations. Must provide a schema in queries.
- Does not run against data lake
- Google BigQuery - Runs against data stored in Google BigQuery. Currently, the only data stored there is clickstream data and cost information on our GA/GBQ accounts. Must provide a project ID, dataset, and table name in queries. NOTE: BigQuery uses different SQL syntax than the lake, and this must be reflected in your queries.
from honeycomb.bigquery import run_gbq_query query = "SELECT * from `places-clickstream`.9317736.ga_sessions_20200728 LIMIT 5" df = run_gbq_query(query, project_id='places-clickstream')
- Salesforce - Runs against data stored in the Salesforce Object Manager. NOTE: Salesforce uses SOQL, which has different syntax than the lake.
from honeycomb import salesforce as hcsf query = "SELECT Id FROM Lead LIMIT 5" df = hcsf.run_sf_query(query) df_star = hcsf.sf_select_star(object_name='Lead', where_clause='WHERE CreatedDate <= 2020-08-01T00:00-00:00', limit_clause='LIMIT 5')
- Runs against data lake:
Running Queries
The run_lake_query
function allows for running queries through either the
presto
or hive
querying engines.
import honeycomb as hc
df0 = hc.run_lake_query('SELECT COUNT(*) FROM experimental.test_table',
engine='presto')
df1 = hc.run_lake_query('SELECT COUNT(*) FROM experimental.test_table',
engine='hive')
Table Creation
honeycomb
only supports table creation using hive
as the engine. To create
a table in the data lake, all that is required is a dataframe, a table name -
it defaults to writing to the 'experimental' zone.
create_table_from_df
will infer the file type to create the table with based
on the extension of the provided filename
. If filename
is not provided or
does not contain an extension, it will default to Parquet.
NOTE: If your dataframe contains a high degree of nesting, generating DDL for its table can be very time-consuming. It may be faster to generate the table with a subset of the dataframe's rows, and append the remaining rows after the fact.
An empty directory is expected for a table's underlying storage. If files are already present in the S3 path designated to store files for the new table, table creation will fail.
Additional parameters:
schema
: The schema to write to. Most users will only have access to write to the 'experimental' zone.dtypes
: A dictionary from column names to data types.honeycomb
automatically converts Python/pandas data types into Hive-compliant data types, but this parameter allows for manual override of this. For example,pd.DateTime
s are normally converted tohive
datetimes, but if specified in this parameter, it could be saved as a string instead.path
: Where the files that this table references should be stored in S3. It defaults to the table's name, and under most circumstances should be kept this way.filename
: The name to store the file containing the DataFrame. If writing to the experimental zone, this is optional, and a name based on a timestamp will be generated. However, when writing to any other lake zone, a specified name is required, in order to ensure that the table's underlying filesystem will be navegable in the future.table_comment
: A plaintext description of what data the table contains and what its purpose is. When writing to the experimental zone, this is optional, but with any other zone it is required.column_comments
: A dictionary from column names to a plaintext description of what the column contains. This is optional when writing to the experimental zone, but with any other zone it is required.timezones
: A dictionary from datetime columns to the timezone they represent. If the column is timezone-naive, it will have the timezone added to its metadata, leaving the times themselves unmodified. If the column is timezone-aware and is in a different timezone than the one that is specified, the column's timezone will be converted, modifying the original times.copy_df
: Whether the operations performed on df should be performed on the original or a copy. Keep in mind that if this is set to False, the original df passed in will be modified as well - twice as memory efficient, but may be undesirable if the df is needed again laterpartitioned_by
: Dictionary or list of tuples containing a partition name and type. Cannot be a vanilla dictionary if using Python version < 3.6partition_values
: Required if 'partitioned_by' is used and 'auto_upload_df' is True. List of tuples containing partition name and value to store the dataframe underoverwrite
: In the case of a table already existing with the specified name, states whether to drop a table and completely remove its underlying files or throw errors. In any other case, this parameter does nothing.auto_upload_df
: Whether the df that the table's structure will be based off of should be automatically uploaded to the table, or just used to generate and execute the DDL.
import pandas as pd
import honeycomb as hc
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
hc.create_table_from_df(df, table_name='test_table')
Table Appending
honeycomb
only supports using hive
as the engine for table appending.
To append a DataFrame to a table, all that is needed is the table name.
Schema can optionally be provided and defaults to 'experimental'.
import pandas as pd
import honeycomb as hc
df = pd.DataFrame({'col1': [7, 8, 9], 'col2': [10, 11, 12]})
hc.append_table(df, table_name='test_table')
As with create_table_from_df
, a filename can be provided as well. If one is not,
a name is generated based on a timestamp. However, if writing anywhere other than
the experimental zone, a specified filename is required.
Table Describing
honeycomb
can be used to obtain information on tables in the lake, such
as column names and dtypes, and if include_metadata
is set to true,
a table description and column comments.
import honeycomb as hc
hc.describe_table('test_table', schema='curated', include_metadata=True)
Session-Level Configuration
honeycomb
outputs certain messages to the screen to help interactive users
maintain awareness of what is being performed behind-the-scenes. If this
is not desirable (as may be the case for notebooks, pipelines, usage of
honeycomb
within other packages, etc.), all non-logging output can be
disabled with hc.set_option('verbose', False)
. This also sets the
corresponding verbose
option in rivet
to False.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file honeycomb-1.7.2.tar.gz
.
File metadata
- Download URL: honeycomb-1.7.2.tar.gz
- Upload date:
- Size: 43.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5881999d90e6dcb25d6f50c08c2efee6d23b4f539cae2ad19745503b5fc1e4fb |
|
MD5 | 8caed6eb52df6193d3f07da2830aa25a |
|
BLAKE2b-256 | f4193c4d64092af7ea51d5f86798b47f01f12b87a8d62a9d5d3a988123de3336 |
File details
Details for the file honeycomb-1.7.2-py2.py3-none-any.whl
.
File metadata
- Download URL: honeycomb-1.7.2-py2.py3-none-any.whl
- Upload date:
- Size: 48.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 848e1d46105f17f44af50448d3fe2b5a4bec079013e01d0107d06bb73d0405b5 |
|
MD5 | 050afd10be87a9cce10ffc0707697751 |
|
BLAKE2b-256 | 23952516a30a603d7631d3ebfffd876f9a953dd2d458e3489c0719713f306a4d |