Python SDK for Trifacta
Python SDK for Trifacta
Integrate your Python-based environment with Trifacta to rapidly transform your datasets. Please complete the following steps to install and set up the Python SDK for Trifacta.
- Alpha release - software and supported capabilities may be changed without notice. Do not deploy in product environments.
- Available for the following product editions:
- Trifacta Enterprise
- Trifacta Professional
- Trifacta Premium
- Some Wrangle functions and transformations are not supported by Python Pandas. Known limitations:
- NUMFORMAT function
- String comparison functions
- Transformations that use Array or Map data types are not supported for Python Pandas generation.
- Uploaded files must be in CSV file format.
- Listed commands are for Mac OSX.
- Examples below assume that you are using Jupyter Notebooks for Python flow development.
- A valid account to a project or workspace for one of the above product editions.
- A valid access token to the project or workspace. Instructions are provided below.
- To export your Trifacta recipe as Python code, a workspace administrator must enable the
Wrangle to Python Conversionfeature in the application. For more information, please visit Workspace Settings Page.
- Python 3.7, Python 3.8
- For version requirements of specific Python components, please see
requirements.txtin this package.
pip install trifacta
Enable access to your Trifacta workspace
- Login to your Trifacta workspace.
- In the left menu, select User menu > Preferences > Access tokens.
- To create a new access token, click Generate new token. Copy the token to the clipboard. You cannot retrieve a token after you exit the modal.
- Paste this token into a text file. Instructions for using it with the SDK are provided later.
Configure Trifacta package
Before you can use it to interact with your Trifacta environment, the Python SDK for Trifacta requires the following configuration:
- In your home directory, create the following configuration file:
- Open the file in a text editor, and insert following configuration. Replace values as needed:
[CONFIGURATION] username = <username_for_trifacta_account> # example: email@example.com endpoint = <uri_for_your_trifacta_worskapce> # example: https://test-workspace.example.com token = <copied_token_from_steps_above>
- Save the file.
For this release, you can use the Python SDK to upload a CSV file for transformation in a new file. Additional file formats and workflows may be supported in future releases.
Upload to new flow
Create a new python3 notebook and import the
import trifacta as tf
tfis your handler for interacting with your Trifacta workspace.
Insert the following code, which uploads a specified CSV for transformation in Trifacta:
import pandas as pd df = pd.read_csv(<path_to_csv_dataset>) wf = tf.wrangle(df)
wranglefunction lets you upload a dataset to Trifacta and create a flow for it. This flow is then available through the Trifacta application, where you can transform the dataset through the user interface.
wfis returned as a handle for the created flow with which you can perform other operations on your dataset.
Run the notebook.
Launch Trifacta in browser
- After the upload completes, execute the following to open Trifacta in a browser window.
- In the Trifacta window, navigate to the flow that was created. This flow is likely to be named
Untitledand to be listed in the Flows page at the top when sorted by timestamp.
- In the created flow, create a recipe connected to your imported dataset.
- Edit the recipe.
- In the Transformer page, you can identify issues with your dataset and add transformations to your recipe through a point-and-click interface. Click Add to add the corresponding transformation step to your recipe. For more information on using Trifacta, please visit Trifacta Documentation.
- When you have finished defining your recipe steps, return to your Python notebook window.
Generate Pandas code
- In the Python SDK, you use the
get_pandas()method to export the Wrangle recipe steps to Python code.
Wrangle to Python Conversionsetting must be enabled in Trifacta by your workspace administrator. See above.
- Use the following to get pandas code for the recipe that you created in Trifacta. This code can be applied to transform
get_pandastranslates Trifacta's transform recipe into pandas code.
Trueensures that the generated code is added to the next cell of notebook.
<recipe_name>can be specified to generate pandas code for a specific recipe. If not specified, code is generated for the default recipe. To retrieve a list of available recipes, use
- Execute the generated code.
- In a new cell perform the following actions to transform the dataframe using
above generated Pandas code.
wrangled_df = run_transforms(df) wrangled_df
- Above returns the output of your cleansed/transformed pandas dataframe.
Wrangle multiple datasets
The following example describes how to wrangle multiple datasets. In this example,
violations_actions are reference names for
violations_actions_df respectively. This mapping enables users to correctly map Pandas DataFrames to arguments/variables in generated Pandas code for the Wrangle recipe.
import pandas as pd import trifacta as tf violations_df = pd.read_csv('../test/data/violations.csv') violations_actions_df = pd.read_csv('../test/data/violations_actions.csv') wrangle_flow = tf.wrangle((violations_df, 'violations'), (violations_actions_df, 'violations_actions'), flow_name='Example Flow')
Wrangle existing flow
From your notebook, you can also begin wrangling an existing flow. The following example launches the Trifacta application in a flow whose internal identifier (
import trifacta as tf import pandas as pd flow_id = 13 wf = tf.wrangle_existing(flow_id) # Following call opens the flow corresponding to 'flow_id' in Flow View. If a 'recipe_name' arg is provided, the recipe is opened in the Transformer page. wf.open()
For additional examples, please see the
notebooks directory in this package.
Wrangle function reference
The following wrangling functions are available through the SDK.
Trifacta module functions
tf is an alias to the Trifacta module.
||Upload one ore more datasets to the Trifacta application and create a flow for it. This flow is then available through the Trifacta application, where you can transform the dataset through the user interface.||*datasets: Pandas DataFrames to be wrangled. It could also be a tuple, where the first element in the tuple is a Pandas DataFrame, and second element is the reference name (string) for the DataFrame.|
WrangleFlow module functions
All of the below functions are available for the
WrangleFlow object in your Python environment. So, you must call them using a
wf is a reference to the
||Add Pandas DataFrames to a flow, where
||*datasets: Pandas DataFrames to be added to a flow. It could also be a tuple, where the first element in the tuple is a Pandas DataFrame, and second element is the reference name (string) for the DataFrame.|
||Generates Python Pandas code for your Wrangle recipe.||add_to_next_cell: Set it to True, if you're using Jupyter Notebook and would like to add the generated Pandas code to be added to next cell. If False, the Pandas code is returned as string. recipe_name: Recipe for which you want to get the Pandas code. If not specified, the default recipe is used. Use
||Run a job for a specified recipe.||pbar: can be ignored. execution: Running environment in the Trifacta platform where you want to execute the job. Possible values:
||Generate a profile for a specified recipe.||recipe_name: Recipe for which you want to generate profile. If set to
||Lists the recipe names for the recipe present in the Trifacta application.||N/A|
||Open a profile that you have previously generated for the specified recipe.||recipe_name: Recipe for which you want to open the profile. If set to
Data profiling functions
The SDK enables generation of data profiles based on the output of your Trifacta recipe:
||Returns a table of summary statistics per column||recipe_name: Recipe name for which you want to generate the summary. If set to
||Returns the valid/invalid/missing ratio per column||show_types: Show column types information along with data quality bars for the column. recipe_name: Recipe name for which you want to generate the data quality bar. If set to
||Lists the inferred data type for each column||recipe_name: Recipe name for which you want to infer data types for each column. If set to
||Returns a list of dataframes, one per column, representing a bar-chart for that column||N/A|
||Returns a snazzy PDF report with all the statistics||filename: Name of the file to which PDF profile results are written. If set to
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.