DO NOT USE - This is a sample program
Project description
dataframez
Extension to pandas to allow for simple cross-cloud-platform interactions with data, use of data versioning tools, and much more. The idea is to make it very simple for pandas users to interact with named data sources.
A named data source is a source where the name can be used to retrieve the data without giving additional access criteria - such as would be necessary when accessing data through a database connection, connection to a cloud resource, and so on. This can be bothersome for a data scientist. Who wants to track where their data resides!!!
In modern environment a data catalog is often used to track data assets. But interacting with these catalogs is also bothersome. The use of a named asset abstracts the interface with such catalogs by providing all the necessary interactions with this 'catalog' to identify and retrieve teh data. Cataloging in this sense can also mean a data versioning utility. In gereal, however, this means that the catalog interactions of dataframez can work across many catalogs in tandem; that is, with enterprise catalog and data scientist catalogs at the same time.
##About dataframez is pandas wrapper designed to provide an abstraction between a data catalog, or catalog for short, and the users standard interaction with pandas.
Intent
The purpose for the catalog is two fold.
- To abstract away the need to know where data is being stored and to simplify reading without having to necessarily know what kind of data you are reading.
- Enable enterprise governance controls to mandate where data is stored, what kind of data persistence is allowed. They will also work with IT to make sure the correct interface is available if it does not currently exist.
Configuration
Configuration will identify what kind of catalog is being used (name of catalog class - preferably lowercase). It will also identify for each type of persistence where the data will be persisted (or other appropriate abstraction) and whether a specific kind of persistence is allowed, or not.
The configuration file is a YAML file with the following format
version: VERSION_NUMBER
configurations:
catalog:
type: CATALOG_TYPE
conf: SOME_CONFIGURATION
writers:
csv:
type: csv
conf:
enabled: BOOLEAN
OTHER_CONF: values
parquet:
type: parquet
conf:
enabled: BOOLEAN
OTHER_CONF: values
#etc...
The values that are in all CAPS are to be filled in with appropriate values. At this time there is only one configuration version.
Example Configuration
version: 1
configurations:
catalog:
type: local
conf:
location: $HOME/.dataframez
name: default_catalog.dfz
writers:
csv:
type: csv
conf:
path: $HOME/.dataframez
allowed: true
parquet:
type: parquet
conf:
allowed: false
##API The intent has been to keep the API as simple as possible by minimally extending the pandas API and supporting, for the most part, the same functionality in terms of saving data outputs as is done in pandas.
Reading from a Catalog
pandas.from_catalog(name: str, version: int, **kwargs) -> pandas.DataFrame This method extends the read capabilities of pandas to read from a 'cataloged' asset.
###Extended Write Capabilities
The write capabilities - to cataloged entrypoints - of pandas has been extended by providing capabilities in the pandas name space extension 'dataframez'. In this namespace standard pandas write methods are added - with the addition of an asset registration name in place of common persistence identifies like a path. In some cases default parameters are changed to make the seemless integration of read & write smooth.
In addition to the norm - additional methods have been added for specialized data source interactions.
Also, in order to discover cataloged resources, you can call the list_assets() method to retrieve a list of all asset names.
Supported Methods
- pandas.DataFrame.dataframez.to_csv
- pandas.DataFrame.dataframez.to_parquet
- pandas.DataFrame.dataframez.to_pickle
NOTE: Through all of the write methods it should be noted that entry_name is used both in the name of the source and as the name of the entry in the catalog.
to_csv(entry_name: str, **kwargs)
This will write the data out to a persistence storage as CSV format while logging the asset to a catalog with entry_name. kwargs represents the standard write parameters in pandas which can be used here in the same.
Make note that the default value of index_col has been changed to 0 to make sure that the write & read defaults are as seamless as possible.
to_parquet(entry_name: str, **kwargs)
This will write the data out to a persistence storage as CSV format while logging the asset to a catalog with entry_name. kwargs represents the standard write parameters in pandas which can be used here in the same.
to_pickle(entry_name: str, **kwargs)
This will write the data out to a persistence storage as CSV format while logging the asset to a catalog with entry_name. kwargs represents the standard write parameters in pandas which can be used here in the same.
Examples
Reading and Writing
import pandas as pd
import dataframez
df_to_write = pd.DataFrame.from_dict({'a': [1, 2, 3], 'b': [2, 3, 5]})
df_to_write.dataframez.to_parquet(entry_name='my_asset')
df_read_from_catalog = pd.from_catalog(entry_name='test_data_parquet')
Getting list of Assets
import pandas as pd
import dataframez
asset_list = pd.list_assets()
Future Features
- Extended support of read/write IO types
- Extension to Dask
- Extension to pySpark
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file dataframez-0.0.2.2.tar.gz
.
File metadata
- Download URL: dataframez-0.0.2.2.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 128e7b6c7d5db72cdccf6deecf68da6e74c21c5bbbce2e11f8c49453945b4869 |
|
MD5 | d1562f4143347bd0167d4a22d94058b3 |
|
BLAKE2b-256 | 4ca0bfd506a0e248ce52421f1c79b0c551f1a42daf056addc1a98b3abc88f506 |