Create partition netcdf files on s3
Project description
s3-netcdf
S3-NetCDF is a Python library to read / write NetCDF files to S3. This library partitions large NetCDF files into smaller chunks to retrieve data from s3 cost-effectively.
Installation
pip install meracan-s3netcdf
From local folder
git clone https://github.com/meracan/s3-netcdf.git
pip install -e ./s3-netcdf
With conda env and testing
conda create -n s3netcdf python=3.8
conda activate s3netcdf
git clone https://github.com/meracan/s3-netcdf.git
pip install -e ./s3-netcdf
Methodology
S3-NetCDF creates a master file ".nca" from an input object. The input contains s3 info, metadata, dimensions, partition group, variables, etc. Data is stored in the partition files (.nc) (no data is stored in the master file).
Variables need to be stored in a partition group. Each partition group has unique variable's dimensions. Multiple variables can be stored under the same partition group (if they have the same dimensions).
The maximum size of partition file (umcompressed) is set using the option input ncSize=1.0
(MB). The size is approximative depending on the shape of the array. The partional files are automatically compressed (~100 smaller). The attribute least_significant_digit={number}
can be added in the variable object to further reduce file size. Remember f4
and f8
contains 7 digits 16 digits, respectively. S3 http compression (gzip) is not used since partition files are already compressed.
Input
The input for creating a master file contains s3 info, metadata, dimensions, partition group, variables, etc.
Metadata attributes are stored in the metadata
object. It is recommended to use title
, institution
, source
, history
, references
, and comment
.
Dimensions, groups and variables are stored in the nca
object.
Input JSON file needs to be converted into a python object import json; json.loads(filePath)
. Input example to create a master file:
{
"name":"input1",
"cacheLocation":"../s3",
"localOnly":true,
"bucket":"merac-dev",
"cacheSize":10.0,
"ncSize":1.0,
"metadata":{"title":"title-input1"},
"nca": {
"dimensions" : {"npe":3,"nelem":500,"nnode":1000,"ntime":2},
"groups":{
"elem":{"dimensions":["nelem","npe"],"variables":{
"elem":{"type":"i4", "units":"" ,"standard_name":"Elements" ,"long_name":"Connectivity table (mesh elements)"}
}
},
"time":{"dimensions":["ntime"],"variables":{
"time":{"type":"f8", "units":"hours since 1970-01-01 00:00:00.0","calendar":"gregorian" ,"standard_name":"Datetime" ,"long_name":"Datetime"}
}
},
"nodes":{"dimensions":["nnode"],"variables":{
"bed":{"type":"f4", "units":"m" ,"standard_name":"Bed Elevation, m" ,"long_name":"Description of data source"},
"friction":{"type":"f4", "units":"" ,"standard_name":"Bed Friction (Manning's)" ,"long_name":"Description of data source"}
}
},
"s":{"dimensions":["ntime","nnode"],"variables":{
"a":{"type":"f4", "units":"m" ,"standard_name":"a variable" ,"long_name":"Description of a"}
}
},
"t":{"dimensions":["nnode","ntime"],"variables":{
"a":{"type":"f4", "units":"m" ,"standard_name":"a variable" ,"long_name":"Description of a"}
}
}
}
}
}
The input for opening a master file can be simplified. As a minimum, the input file should contain name
,cacheLocation
and bucket
(if using S3).Input example to open a master file:
{
"name":"input1",
"cacheLocation":"../s3",
"bucket":"merac-dev",
"localOnly":true,
"cacheSize":10.0,
"ncSize":1.0
}
S3, caching and localOnly
Partition files are saved locally (caching) while reading and writing. By default, the cacheLocation={path}
is the current working directory.
The input option cacheSize=1.0
defines the maximum cache size in MB. If exceeded, oldest cached partition files are removed.
The input option localOnly=True
will ignore all S3 & caching commands. This is used for testing.
The name of the bucket={str}
in the input if files are uploaded to S3.
Usage
Basic
from s3netcdf import NetCDF2D
# Create/Open master file
netcdf2d=NetCDF2D(input)
# Writing
netcdf2d["{groupname}","{variablename}",{...indices...}]= np.array(...)
# Reading
netcdf2d["{groupname}","{variablename}",{...indices...}]
Assigning values to indexed arrays is the same as numpy. Note: string values was not tested.
Commands
# Get information inside the master file
netcdf2d.info()
# Get group dimensional shape
netcdf2d.groups["{groupname}"].shape
# Get group dimensional partition shape
netcdf2d.groups["{groupname}"].child
# Get variable's attributes
netcdf2d.groups["{groupname}"].attributes["{variablename}")
Caching commands
# List partition files locally
netcdf2d.cache.getNCs()
# Clear/Delete all partition files locally
# Warning!
netcdf2d.cache.clearNCs()
# Delete NetCDF locally
# Warning!
# Delete master file and partitions files
netcdf2d.cache.delete()
S3 commands
# List master and partition files, including metedata
netcdf2d.s3.list()
# Clear/Delete all partition files in S3
# Warning!
netcdf2d.s3.clearNCs()
# Delete NetCDF in S3
# Warning!
# Delete master file and partitions files
netcdf2d.s3.delete()
Testing
conda install pytest
mkdir ../s3
pytest
For developers and debugging:
mkdir ../s3
PYTHONPATH=../s3-netcdf/ python3 test/test_netcdf2d_func.py
PYTHONPATH=../s3-netcdf/ python3 test/test_netcdf2d1.py
PYTHONPATH=../s3-netcdf/ python3 test/test_netcdf2d2.py
AWS S3 Credentials
Credentials (for example), AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION needs to be save in environment variables. For more information, check link.
The credentials needs access to get
, put
and delete
(if deleting is required) to the bucket.
Performance and Benchmark
TODO
-
Revise code on the value parsing side: compare shape, value type etc, Should be in different function and not in dataWrapper.
-
Check operation when index assigning: + - * /
-
Fix bench folder and create better performance tests
-
Find optimize shape to upload
-
travis-ci and encryption keys
-
Complete documentation in code
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file meracan-s3netcdf-0.0.2.tar.gz
.
File metadata
- Download URL: meracan-s3netcdf-0.0.2.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200308 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f08ecf7003a248b4f4e77e5504e9a8e0e74b81436cbe0da893aa631019636050 |
|
MD5 | 9d928014b84d23465854c1eed443a9d3 |
|
BLAKE2b-256 | b2e839f6ff58dc01a3957edccad95366922def0e3e5ed09569b256e68a6adc1b |