Skip to main content
Join the official Python Developers Survey 2018 and win valuable prizes: Start the survey!

Functions and Wrappers for Google Cloud Projects

Project description

# da_gcp [![PyPI Package](https://img.shields.io/pypi/v/da_gcp.svg)](https://pypi.org/project/da\_gcp/)

da_gcp Python package.

This allows easy reading and writing of data between Python environment (i.e. Google Cloud DataLab) and Google Cloud Project. The classes in this package utilize best practices for moving large amounts of data, by leveraging Google Cloud Storage over direct streaming.

## Installation
This package is available on PyPI as `da_gcp`: <br>

`pip install da_gcp`

## Using in Google Datalab Notebook

`!pip install da_gcp`

## Documentation

To import into your environment:

` from da_gcp import to_gbq, read_gbq ` or <br>
` import da_gcp as da `

### Table of contents
- [to_gbq](#to_gcp)
- [read_gbq](#read_gbq)
- [Dependencies](#dependencies)

### to_gbq() ###
Initializes the commen information needed to write data to Google Cloud Project, including Google Cloud Storage and Google BigQuery. The only argument to pass is the `bucket` name.

`to_gbq(bucket= 'bucket name')`

#### Methods ####

#### from_df(df, dataset\_table, project\_name, if\_exists, dates\_formats) ####
Load data from a DataFrame into a BigQuery Table via Google Cloud Storage.

Args:
df (DataFrame object): the DataFrame that you want to add to a BQ Table.
dataset_table (string): name of the dataset and table where data will be loaded into; format is - 'DATASET\_NAME.TABLE\_NAME'
project_name (string):
if_exists (string): - default is 'overwrite'
'create' - create a new table
'overwrite' - overwrite the current table with data from DataFrame
'append' - append data from DataFrame to table
dates_formats (dict): dictionary of column names and their format; ex- {'col' : 'DATE'}
date format options are 'DATE','DATETIME','TIME',and 'TIMESTAMP'

Returns:
A Results object consisting of:
.restore - SQL statment to revert table back to prior state
.table_name - Full table name in the format of `project_id.dataset_id.table_id`
.rows - the number of rows in the table; with append, the number of rows will be the entire table.
.extract - the extract folder located in GCS

Raises:
ProjectExistError: if project does not exist; or user does not have access to write to table
CreateError: if user inputs 'create' and table already exists.
SchemaError: if schema of DataFrame and Table do not match.
OverwriteError: if table does not exists and user inputs 'overwrite'.

##### Example #####
df = pd.DataFrame()
table = da.to_gbq(bucket= 'bucket_name').from_df(df, 'DATASET_NAME.TABLE_NAME', 'analytics-project-thd',
if_exists= 'create', dates_formats= {'Date_column':'DATE', 'Datetime_column':'DATETIME'})

By assigning `to_gbq().from_df()` to the variable `table`, the user can retrieve attributes such as: schema, restore SQL statement, and GCS folder

#schema of DataFrame
table.df_schema

BigQuery Schema - Fields:
[{'mode': 'NULLABLE',
'name': 'int',
'type': 'INTEGER'},
{'mode': 'NULLABLE',
'name': 'float',
'type': 'FLOAT'},
{'mode': 'NULLABLE',
'name': 'datetime',
'type': 'TIMESTAMP'},
{'mode': 'NULLABLE',
'name': 'date',
'type': 'DATE'},
{'mode': 'NULLABLE',
'name': 'string',
'type': 'STRING'},
{'mode': 'NULLABLE',
'name': 'bool',
'type': 'BOOLEAN'},
{'mode': 'NULLABLE',
'name': 'int_2',
'type': 'INTEGER'}]

#SQL statement to restore table
table.restore

To undo this action, input the following SQL statement to restore the table:
'SELECT * FROM [analytics-finance-thd:MXT5604_TEST.CLASS_TEST@1532373747088]'

#GCS Folder for DataFrame
table.extract

'DataFrames/CLASS_TEST_df-1532373799688'

### read_gbq() ###

Initializes the common information needed to get data from Google BigQuery into an environment, such as Google Cloud DataLab, via Google Cloud Storage. The only argument to pass is the `bucket` name.

`read_gbq(bucket= 'bucket name')`

#### Methods ####

#### from_sql(sql, extract\_name, sample\_type, dtype\_srce) ####
Construct a DataFrame by querying tables/views in Google BigQuery, similar to using the GUI. The Class checks for available space by executing a dry run of the SQL statment, and comparing the results size to the available space in the enviornment.

Args:
sql (string): a valid SQL statement (#STANDARDSQL)
extract_name (string): sub-folder name for the extract, this will be combined with a Timestamp
sample_type (string):
'whole' - returns all the data, if space is available;
'random' - a random ordered sample based on a percentage determined by available space;
'default' - sample based on row count, determined by available space;
dtype_srce (boolean): Option to return DataFrame with dtypes matching the source data, default is True; False, will enable pandas to infer dtypes, which can improve performance

Returns:
a Pandas DataFrame

Raises:
SQLStatementError: Provide a valid Standard SQL statement
DataSizeError: Query results are larger than the available space in the environment
NestedDataError: Query results are in a nested structure which cannot be placed into a CSV file

##### Example #####
sql = """ SELECT * FROM `analytics-project-thd.DATASET_NAME.TABLE_NAME` """
df = read_gbq(bucket= 'bucket_name').from_sql(sql= sql, extract_name= 'PROJECT_NAME', sample_type= 'whole', dtype_srce= True)

A Pandas DataFrame object enables additional attributes to be assigned, the user can retrieve the attribute `extract` by typing

#extract folder
df.extract

'EXTRACTNAME_TIMESTAMP'

#### from_extract(extract\_folder) ####
Construct a DataFrame by reading CSV files that were previously extracted to Google Cloud Storage

Args:
extract_folder (string): sub-folder name for the extract, should be in the form of "AAAAA_#############"

Returns:
a Pandas DataFrame

Raises:


### Dependencies ###

[datalab](https://github.com/googledatalab) - Google Cloud DataLab <br>
[pandas](https://pandas.pydata.org/) - pandas <br>
[NumPy](http://www.numpy.org/) - NumPy <br>
[gcsfs](https://github.com/dask/gcsfs) - gcsfs <br>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
da_gcp-0.4.13.tar.gz (20.0 kB) Copy SHA256 hash SHA256 Source None Oct 8, 2018

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page