The version of this library and document is V 0.0.2
Project description
datup
The version of this library and document is V 0.0.2 This library has 3 methods and 1 Class
How is it work?
import datup as dt
To Instance the Class
job = dt.DataIO("aws_acces_key_id","aws_secret_access_key","datalake")
Can I test my updates?
Yes, there is a file called _testing.ipynb where you can test your changes. The variables, must be initialized always for modularity.
Class DataIO:
A group of methods for I/O data from a AWS S3 Datalake
Parameters
----------
aws_acces_key_id : str
The class must be intialized with aws_acces_key_id
aws_secret_access_key : str
The class must be intialized with aws_secret_access_key
datalake : str
The class must be intialized with the name of the datalake
prefix_s3 : str, default "s3://"
The s3 prefix used to denote a s3 address
local_path : str, default "/tmp/"
The local path used by boto3 for upload from Temp folder to S3 bucket.
Methdos
-------
download_csv(self,stage=None,filename=None,datecols=False,sep=",",encoding="ISO-8859-1",infer_datetime_format=True,low_memory=False,indexcol=None,ts_csv=False,freq=None)
Return a dataframe downloaded from a specified datalake
This function takes the aws credentials from DataIO class and use it for download
the required data.
Parameters
----------
stage : str, default None
It is the set of folders after the datalake to the file that is required to download
filename : str, default None
Is the name of the filename to download without csv suffix
datecols: bool or list of int or names or list of lists or dict, default False
Took it from Pandas read_csv parse_dates description
The behavior is as follows:
* boolean. If True -> try parsing the index.
* list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
* dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index cannot be represented as an array of datetimes, say because of an unparseable value
or a mixture of timezones, the column or index will be returned unaltered as an object data type. For
non-standard datetime parsing, use pd.to_datetime after pd.read_csv. To parse an index or column with
a mixture of timezones, specify date_parser to be a partially-applied pandas.to_datetime() with
utc=True. See Parsing a CSV with mixed timezones for more.
Note: A fast-path exists for iso8601-formatted dates.
sep: str, default ‘,’
Took it from Pandas read_csv sep description
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python
parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s
builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+'
will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note
that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
encoding: str, default ‘ISO-8859-1’
Took it from Pandas read_csv encoding description
Encoding to use for UTF when reading/writing (ex. ‘utf-8’).
infer_datetime_format: bool, default True
Took it from Pandas read_csv infer_datetime_format description
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the
columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can
increase the parsing speed by 5-10x.
low_memory: bool, default True
Took it from Pandas read_csv low_memory description
Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed
type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter.
Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator
parameter to return the data in chunks. (Only valid with C parser).
indexcol : int, str, sequence of int / str, or False, default None
Took it from Pandas read_csv index_col description
Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a
sequence of int / str is given, a MultiIndex is used.
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when
you have a malformed file with delimiters at the end of each line.
ts_csv : bool, default False
If ts_csv is True then activates the ts_csv downloaded. If True, indexcol and freq, both must
be different to None
freq : str, default W-MON
Is the frequency of getting data
Returns
-------
DataFrame
A DataFrame is return as two-dimensional data structure
Examples
--------
>>> object.download_csv(stage='stage',filename='filename') # doctest: +SKIP
download_csvm(self,uris)
Return a set of dataframes downloaded from a specified datalake in a list
This function takes the aws credentials from DataIO class and use it for download
the required data through download_csv method.
Parameters
----------
uris : dict
Is the dictionary with all parameters necessary for the download_csv method
Returns
-------
List of DataFrames
A list of DataFrames are returned
List of DataFrames Names
A list of DataFrames Names are return in string type
Examples
--------
>>> uris = {
"uri_1":{
"stage":"stage",
"filename":"filename",
"datecols":False,
"sep":";",
"encoding":"ISO-8859-1"
},
"uri_2":{
"stage":"stage",
"filename":"filename",
"datecols":False,
"sep":";",
"encoding":"ISO-8859-1"
}
}
>>> df, df_names =object.download_csvm(uris=uris) # doctest: +SKIP
upload_csv(self,df,stage=None,filename=None,index=False,header=True,date_format="%Y-%m-%d",ts_csv=False)
Return a uri where the dataframes was uploaded in csv format
This function takes the aws credentials from DataIO class and use it for upload
the required data.
Parameters
----------
df : DataFrame
Is the DataFrame for uploading to S3 datalake
stage : str, default None
It is the set of folders after the datalake to the file that is required to upload
filename : str, default None
Is the name of the filename to upload without csv suffix
index : bool, default False
Took it from Pandas to_csv index description
Write row names (index).
header : bool or list of str, default True
Took it from Pandas to_csv index description
Write out the column names. If a list of strings is given it is assumed to be aliases for
the column names.
date_format : str, default %Y-%m-%d
Took it from Pandas to_csv index description
Format string for datetime objects.
ts_csv : bool, default False
If ts_csv is True then activates the ts_csv upload. If True, index must be different to False
Returns
-------
Str
The uri where DataFrame was uploaded into S3
Examples
--------
>>> object.upload_csv(df,stage="stage",filename="filename") # doctest: +SKIP
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
datup-0.0.2.tar.gz
(6.0 kB
view hashes)
Built Distribution
datup-0.0.2-py3-none-any.whl
(8.5 kB
view hashes)