Utility for aggregation of NetCDF data.
Project description
NetCDF Aggregation (ncagg)
So... you want to aggregate time series NetCDF files?
TL;DR
Install the utility with with pip:
pip install ncagg
On the command line, use ncagg
:
Usage: ncagg [OPTIONS] DST [SRC]...
Aggregate NetCDF files.
Options:
-v, --version Show the version and exit.
--generate_template FILE Print the default template generated for
PATH and exit.
-u TEXT Give an Unlimited Dimension Configuration as
udim:ivar[:hz[:hz]]
-c TEXT Give an Chunksize Configuration as
udim:chunksize to chunk the ulimited
dimension udim by chunksize
-b TEXT If -u given, specify bounds for ivar as
min:max or Tstart[:[T]stop]. min and max are
numerical, otherwise T indicates start and
stop are times.start and stop are of the
form YYYY[MM[DD[HH[MM]]]] and of stop is
omitted,it will be inferred to be the least
significantly specified date + 1.
-l [DEBUG|INFO|WARNING|ERROR|CRITICAL]
log level
-t FILENAME Specify a configuration template
--help Show this message and exit.
Notes:
- DST is the filename for the NetCDF output and should not already exist, or will be overwritten.
- SRC is a list of input NetCDF files to aggregate, can be passed on the command line or piped to ncagg.
-u
can specify an Unlimited Dimension Configuration. See below for details.- Taking tens of minutes for a day is normal, a progress bar will indicate time remaining.
- For fine grained control over the output, specify a configuration template (
-t
). See below for details. - Set the ulimited dimension chunksize with
-c
for smaller output filesize.
Examples:
- Explicitly list files to aggregate:
ncagg output_filename.nc file_0.nc file_02.nc #...
- Aggregate by globbing files:
ncagg output_filename.nc path_to_files/*.nc
- Sort the unlimited dimension record_number, according to the variable time:
ncagg -u record_number:time output_filename.nc path_to_files/*.nc
- Sort the unlimited dimension record_number, according to the variable time, and insert or remove
fill values to ensure time occurrs at 10hz:
ncagg -u record_number:time:10 output_filename.nc path_to_files/*.nc
- Sort the unlimited dimension record_number, according to the variable time, and insert or remove
fill values to ensure time occurs every minute (note: 0.016666 ~ 1.0/60.0)
ncagg -u record_number:time:0.0166666 output_filename.nc path_to_files/*.nc
- Only include time values between 2017-06-01 to 2017-06-02 (bounds), including sorting and
filling, as above:
ncagg -u record_number:time:10 -b T20170601:T20170602 output_filename.nc path_to_files/*.nc
- Or equivalently, if only one bound is specified, the end is inferred to be most significant + 1:
ncagg -u record_number:time:10 -b T20170601 output_filename.nc path_to_files/*.nc
- Aggregate more files than fit on the command line... (in case of: Argument list too long):
find /path/to/files -type f -name "*.nc" | ncagg output.nc
For more information, see the Unlimited Dimension Configuration below. The
ncagg
Command Line Interface (CLI) builds a Config based on the arguments
specified. Fine grained control over the config can be exercised by providing a
config template.
High level overview
Aggregation works in two stages:
- Create a Aggregation List describing steps and order of aggregation.
- Evaluate the Aggregation List.
The Aggregation List object is just a list that describes the order in which to combine components of an aggregation. The objects within the list represent source files, or segments of fill values. Source files are associated with sorting and filling instructions within the file. Fill values indicate where, and how many fill values to create.
During stage 1, the Aggregation List is generated. The level of configuration given determines how much is done here. At most, each file is inspected according to it's unlimited dimension and the variable that indexes it to determine sorting and filling. No data except for index_by variables are read and none written to disk during this stage. If an expected cadence is not provided, filling is not done. If bounds are provided, the unlimited dimension is clipped to ensure data is included only within the bounds. For the minimum configuration given, files are simply assembed in order of sorted filename.
During stage 2, the Aggregation List is evaluated. Evaluating the Aggregation List means simply iterating over the components contained and copying data from these into the output aggregation file, while keeping track of global attributes.
Reasons for using this approach:
- Possible to aggregate more data than fits in memory.
- Sort once per unlimited dimension.
- Modular code, easier to maintain, extend, and debug.
Configuration
The sophistication of the aggregation is determine by how much configuration information is given on generation of the Aggregation List.
- No Config -> agg files along unlimited dims, sorted by filename.
- Config with index_by -> agg such that index_by is in ascending order.
- Config with index_by and bounds -> agg such that index_by is in ascending order within bounds.
- Config with index_by and expected_cadences -> agg and regularize, removing duplicates/inserting fills if needed.
The Config contains information that a NetCDF CDL specification would, but in json format, extended with aggregation configuration information. If not provided, a default version will be created using the first file in the list to aggregate.
The Config contains three properties (keys):
- dimensions
- variables
- attributes
Each property is associated with a list of objects so to preserve ordering. The order in the objects corresponds to the order of appearence in the output. Objects of all sections have a "name" property.
Dimensions specify the dimensions of the file and has at minimum a "name", and a "size" which can be null for an unlimited dimension. Unlimited dimensions may also have an Unlimited Dimension Configuration which will be described in a dedicated section below.
Variable objects contain a "name", "dimensions", "datatype", "attributes", and "chunksizes". The dimensions property is a list of dimension names on which the variable depends, each must be configured in the dimensions section. datatype is something like int8, float32, string, etc. Finally, attributes is another property containing key and values corresponding to variable attributes commonly including "units", "valid_min", "_FillValue", etc.
Attributes objects contain "name", "strategy", and optionally "value" for NetCDF Global Attributes. The strategies are described below.
Unlimited Dimension Configuration
The Unlimited Dimension Configuration associates a particular unlimited dimension with a variable by which it can be indexed. Commonly, a dimension named time is associated with a variable also named time which indicates some epoch value for all data associated with that index of the dimension.
For example, a file may have a dimension "record_number" which is indexed by a variable "time". Using the Unlimited Dimension Configuration, we can specify to aggregate record_number such that the variable "time" forms a monotonic sequence increasing at some expected frequency.
Here is what a typical GOES-R L1b product aggregation output looks like:
{
"name": "report_number",
"size": null,
"index_by": "time",
"expected_cadence": {"report_number": 1},
}
In English, the configuration above says "Order the dimension report_number by the values in the variable time, where
time values are expected to increase along the dimension report_number incrementing at 1hz." This would be specified
to the ncagg CLI using ncagg -u report_number:time:1 output.nc in1.nc in2.nc
.
The configuration allows to even index by multidimensional time (ehem, mag with 10 samples per report). On the command
line specified as -u report_number:OB_time:1:10
, or as json:
{
"name": "report_number",
"size": null,
"index_by": "OB_time",
"other_dim_indicies": {"samples_per_record": 0},
"expected_cadence": {"report_number": 1, "number_samples_per_report": 10},
}
One design constraint was to not reshape the data, so above, we order the data by looking at index 0 of samples_per_record for every value along the report_number dimension. We assume that the other timestamps along samples_per_record are correct. Also, given the configuration above, we only insert fill records of OB_time if a full report_number record is missing (all 10 values along the number_samples_per_report dimension missing).
Indexing an unlimited dimension was described above. In addition to simply indexing by a variable, in the case that the variable represents time, a common operation would be to restrict value to some range, to, for example, create a day file. The Unlimited Dimension Configuration would look like:
{
"name": "report_number",
"size": null,
"index_by": "time",
"min": 14000000, # in units of the variable "time", expected
"max": 14000060, # something like "seconds since 2000-01-01 12:00:00"
"expected_cadence": {"report_number": 1}
}
Which would be specified on the command line as ... -u report_number:time:1 -b1400000:14000060 ...
where the -b
option stands for "bounds".
As min and max almost exclusively indicate datetime values, for convenience, they
are accepted as types: numerical, string, or python datetime. In string representation, they must start with "T" and
can be of the form "TYYYY[MM[DD[HH[MM]]]]" where brackets indicate optional and if omitted, will be inferred to be
minimum valid value, ie: 01 for MM (month). A units attribute must available for the index_by variable in the
form of " since ". On the command line, string time can be given as
... -u report_number:time:1 -bT20170101:T20170102 ...
or equivalently the end bound can be omitted and will be
inferred to be the rightmost specified of the beginning YYYY[MM[DD[HH[MM]]]] incremented by one: ie:
... -u report_number:time:1 -bT20170101 ...
.
Consider the suvi-l2-flloc (flare location) product which has two unlimited dimensions, time and feature_number.
At any time record, there can exist an arbitrary number of features. Consider a variable reporting the flux from
each feature at each time: flux(time, feature_number)
. Although feature_number is unlimited, it is unique to
each time and thus needs to be "flattened":
flux([0], [0]) -> [[3.2e-6]]
flux([0], [0, 1]) -> [[3.3e-6, 5.4e-7]]
undesired_aggregated_flux(time, feature_number):
[[3.2e-6, _, _],
[ _, 3.3e-6, 5.4e-7]]
desired_aggregated_flux(time, feature_number):
[[3.2e-6, _],
[3.3e-6, 5.4e-7]]
The desired_aggregated_flux
is achieved by setting {"flatten": true} within an the unlimited dimension configuration for feature_number.
[{
"name": "time",
"size": null,
"index_by": "time",
}, {
"name": "feature_number",
"size": null,
"flatten": true,
}]
Specify Global Attribute Aggregation Strategies
The aggregated netcdf file contains global attributes formed from the constituent granules. A number of strategies exist to aggregate Global Attributes across the granules. Most are quite self explanatory:
- "static": use the configured "value" in the template, ignoring any values that may be in the file.
- "first": first value seen will be taken as the output value for this global attribute
- "last": the last value seen will be taken as global attribute
- "unique_list": compile values into a unique list "first, second, etc"
- "int_sum": resulting in integer sum of the inputs
- "float_sum": StratFloatSum
- "constant": StratAssertConst, similar to first, but raises an error if value changes among input files.
- "date_created": simply yeilds the current date when finalized, standard dt fmt
- "time_coverage_start": start bound, if specified, standard dt fmt
- "time_coverage_end": end bound, if specified, standard dt fmt
- "filename": StratOutputFilename, set attribute to name of output file
- "remove": remove/do not include this global attribute
- "first_input": Filename of first file included in aggregate
- "last_input": Filename of last file included in aggregate
- "input_count": Number of files included in aggregate
- "ncagg_version": Version number for the ncagg software running
The configuration format expects a key "global attributes" associated with a list of objects each containing a global attribute name, strategy, and possible value (for static). A list is used to preserve order, as the order in the configuration will be the resulting order in the output NetCDF.
{
"global attributes": [
{
"name": "production_site",
"strategy": "unique_list"
}, {
"name": "creator",
"strategy": "static",
"value": "Stefan Codrescu"
}, {
...
}
]
}
Specify Dimension Indecies to Extract and Flatten
NOT IMPLEMENTED. IN PROGRESS. SUBJECT TO CHANGE.
Consider SEIS SGPS files which contain the data from two sensor units, +X and -X. Most variables are of the form var[record_number, sensor_unit, channel, ...]. It is possible to create an aggregate file for the +X and -X sensor units individually using the take_dim_indicies configuration key.
{
"take_dim_indicies": {
"sensor_unit": 0
}
}
With the above configuration, sensor_unit must be removed from the dimensions configuration. Please also ensure that variables do not list sensor_unit as a dimension, and also update chunk sizes accordingly. Chunk sizes must be a list of values of the same length as dimensions.
Configuration Template
ncagg
can be configured to output files into a format specified by a configuration template file. It is expected
that this is a json format file. A generic template can be created using the ncagg --generate_template [SAMPLE_NC]
command. The output of the template command is the default template that is used internally if no template is specified.
Example usage
Use ncagg --generate_template example_netcdf.nc > my_template.json
to save the default template for an example_netcdf.nc file
into my_template.nc. Edit my_template.json to your liking, then run aggregation using ncagg -t my_template.json [...]
.
Template syntax
The template syntax is verbose, but hopefully straightforward and clear. The incoming template will be validated upon initiating an aggregation, but some issues may only be found at runtime.
Attributes
The attributes section is a list of objects contianing global attributes:
- name: name of global attribute
- strategy: aggregation strategy to use for attribute.
- value: value used by strategy, if required. Eg. constant, where the value is "test".
Dimensions
The dimensions section is a list of objects containing the dimensions of the file. Most configuration options are covered in Unlimited Dimension Configuration section, but to clarify:
- size: integer if dimension has a fixed size. null if it's unlimited.
Variables
Similarly, variables section is a list of objects configuring output variables. Remove the object corresponding to some variable to remove it from the output.
Important notes:
- The dimensions referenced must exist.
- Chunksizes must be the same number of elements as dimensions.
- Use
copy_from_alt
to specify a list of alternative variables to copy data from if a variable withname
isn't found.
Take care that everything is consistent when doing heavy modifications.
Use from code
In addition to the CLI, ncagg
exposes an API which makes it possible to call from Python code:
from ncagg import aggregate
aggregate(["file1.nc", "file2.nc"], "output.nc")
aggregate
optionally accepts as a third argument a configuration template. If none is given,
the default template created from the first input file is used. Thus code above is equivalent to:
from ncagg import aggregate, Config
config = Config.from_nc("file1.nc")
aggregate(["file1.nc", "file2.nc"], "output.nc", config)
This allows for the possibility of programatically manipulating the configuration at runtime before performing aggregation.
Limitations
- Does not support netCDF4 enum types.
Technical and Implementation details
An Aggregation List is composed of two types of objects, InputFileNode and FillNode objects. These inherit in common
from an AbstractNode and must implement the get_size_along(unlimited_dim)
and data_for(var, dim)
methods. Evaluating an aggregation list is simply going though the Aggregation List and calling something like:
nc_out.variable[var][write_slice] = node.data_for(var)
The data_for
must return data consistent with the shape promised from node.get_size_along(dim)
.
The complixity of aggregation comes in handling the dimensions and building the aggregation list. In addition to the interface exposed by an AbstractNode, each InputFileNode and FillNode implement their own specific functionality.
A FillNode is simpler, and needs to be told how many fills to insert along a certain unlimited dimension and
optionally, can be configured to return values from data_for
that are increasing along multiple dimensions
according to configured expected_cadence
values from a certain start value.
An InputFileNode is more complicated and exposes methods to find the time bounds of the file, and additionally,
is internally capable of sorting itself and inserting fill values into itself. Of course, it doesn't modify the
actual input file, this is all done on the fly as data is being read out through data_for
. Implementation wise,
an InputFileNode may contain within itself a mini aggregation list containing two types of objects: slice and
FillNode objects. Similarly to the large scale process of aggregating, an InputFileNode returns data that has
been assembled according to it's internal aggregation list and internal sorting.
Testing
This software is written for aggregation of GOES-R series Space Weather data products (L1b and L2+). As such, it contains extensive tests against real GOES-16 satellite data. Many "features" in this code are intended to address "quirks" in the ground processing (implemented by a certain contractor...).
Tests are in the test
subdirectory. Run all tests with
python -m unittest discover
The code is compatible with Python2 (2.7) and Python3, so unittests should be run with both. One interesting thing I've noticed is the test suite appears to be about 20% faster in Python3 than in Python2.
Note: currently it is expected that 1 test(s) fail.
- test.seis.SEISL1bSGPSEAST_5min.test_SEISL1bSGPS fails because dimension subsetting has not been reimplemented after a refactor that removed the feature.
Development
Setting up a virtualenv is recommended for development.
virtualenv venv
. venv/bin/activate
pip install --editable .
Deploy to pip, after running unittest with both with python2 and python3. The git stash
is important so that
the build is from a clean repo! We don't want any dev or debug changes that are sitting unstaged to be included.
git stash
rm -r dist/
python setup.py bdist_wheel --universal
twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file ncagg-0.8.18-py2.py3-none-any.whl
.
File metadata
- Download URL: ncagg-0.8.18-py2.py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1526b6ee39f39bd996dc5bbbb5a141559ffb805f0021952aca85e8980ab5140e |
|
MD5 | e33b5ac83df8a8cf945312445bbc87f1 |
|
BLAKE2b-256 | 893446f5391dd32e4c413680ccf06696dd7ac5046302173fcac9c89fbd978d0e |