Skip to main content

A python library containing automated data validation tools for the Atmos Data Service

Project description

atmos-validation

SCM Compliance Build Code style: black License Coverage badge

A library containing validation checks to be run on hindcast or measurement data to ensure API compliance and standardization.

Atmos is a project to streamline discoverability and data access for weather data sources through APIs. The first step to building a lean and efficient API is converting data to the standard format.

Data Validation

In order to ingest measurement or hindcast data into the Atmos Data Store, each source file needs to pass validation. Validation ensures that the file is compliant with the data conventions specified in Conventions. The standard file formats for source files are NetCDF files for hindcasts and ASCII or NetCDF for measurements.

Measurements by definition contain data for a single geographical location, whereas hindcasts are bigger models containing time series for a multitude of locations in (possibly rotated) grids. A measurement file is therefore expected to contain all of its data in a single file. A hindcast is comprised of a set of NetCDF files - all with the same coordinates, attributes, and variables, - where a single file contains data for a unique time period. Depending on the size of the files, the time separation should be either monthly or yearly. A rule-of-thumb is that if the file size grows more than 4 GB, it should be split into smaller files. See docs or example for how to name hindcast files.

To run validation on NetCDF and ASCII source files, we have built the atmos_validation CLI/library. The documentation below describes these checks, the standard format, and how to run validation using the CLI tool.

Documentation

Examples

Contributing

We welcome different types of contributions, including code, bug reports, issues, feature requests, and documentation. The preferred method of submitting a contribution is either to make an issue on GitHub or to fork the project on GitHub and make a pull request. In the case of bug reports, please provide a detailed explanation describing how to reproduce before submitting.

Conventions

Content

1 Dimensions and Coordinate Variables

  • All dimensions shall be attached to at least one variable
  • Dimensions on variables shall match the following naming conventions and sequences:
    • ["south_north", "west_east"]
    • ["Time", "south_north", "west_east"]
    • ["height_[variable_name]", "south_north", "west_east"]
    • ["Time", "height_[variable_name]", "south_north", "west_east"]

1.1 Time

  • Shall be named exactly "Time"
  • Shall not be empty (length > 0)
  • Timestamps shall appear in increasing order
  • Shall only contain unique values
  • The "units" attribute shall be "microseconds since 1900-01-01"
  • If source data is a hindcast or single point hindcast, then:
    • The file shall be named according to the time values it contains and contain the length of the time index, in the following format: "datasetName_startTime_endTime_TtimeLenght.nc" with the corresponding time format: "YYYYMMDD". For example, a Nora10 file covering the whole year of 1982 and containing 2920 timestamps (a whole year with 3hr resolution) will be named e.g.: "Nora10_19820101_19821231_T2920.nc". Not adhering to startTime and endTime convention would yield a WARNING.
    • Strictly enforced: The files must be named such that they can be sorted in chronological order by timestamp values. For example, naming two files "Nora10_part1.nc" and "Nora10_part2.nc" would yield an error IF part2 contains data for a period before part1. The time-length marker "T[time_length]" is non-optional and not adhering to this convention also yields an error.
  • If source data is a measurment, then:
    • the file shall include values for the whole measurment period

1.2 Height / Depth

  • Shall be named according to the data variable it is attached to. For example, air pressure (with corresponding key "P") will be named: "height_P".
  • Depths will still be named "height_XX" but shall be signified by negative values.
  • Attribute "units" shall be "m"
  • Attribute "CF_standard_name" shall be "height"
  • Attribute "long_name" shall be "Height for parameter XX", where XX is e.g. CS, WS, WD etc.

1.3 Spatial coordinates

  • All data variables shall have "south_north" at the second-to-last index of it's dimensions list, in accordance with the accepted dimensions seen in Dimensions
  • All data variables shall have "east_west" at the last index of it's dimensions list, in accordance with the accepted dimensions seen in Dimensions
  • Attribute "units" shall be "degree_north" / "degree_east"
  • Attribute "CF_standard_name" shall be "latitude" / "longitude"
  • Attribute "long_name" shall be "latitude" / "longitude"

1.4 Latitude, Longitude

  • Datasets that contains data with spatial coordinates shall contain "LAT" as a coordinate variable
  • Datasets that contains data with spatial coordinates shall contain "LON" as a coordinate variable
  • "LAT" and "LON" shall be both be defined by dimensions ["south_north", "west_east"]

2 Data Variables

2.1 Naming

  • Shall have valid "key", i.e. be listed in the official database document: [https://atmos.app.radix.equinor.com/config/parameters]

2.2 Attributes for data variables

  • Shall have attribute "units", "CF_standard_name" and "long_name" with values according to database document.

2.3 Measurments

If data source is "measurement", then:

  • Shall have attribute "instruments".
  • "instruments" shall be a stringified dictionary, i.e. a string that can be evaluated in python as a dictionary, with keys given by "{instrument_type}, {instrument_specification}" to ensure uniqueness. This key shall map to a list of numbers that describe the heights at which the following instrument was being used. (The reason for using stringified dict is that NetCDF does not allow for objects on attributes. Carrying information over from ASCII files to NetCDF for requires some "nesting" of the information, hence the choice to use a dict with the described keys)
  • Example value for key WS (wind speed): '{"PROPELLER ANEMOMETER, some spec": [10.0], "PULSE LIDAR, some spec": [100.0, 150.0, 200.0]}'.
  • All instrument types are validated with respect to the "allowed_instruments" entry from the database document.
  • Instrument specifications are not validated and can therefore be any string value as long as the keys in the dictionary are unique.

2.4 Data integrity

  • Shall contain no values outside the range given in the database document by entries "min" and "max". E.g. "AT" (air temperature) shall have values between -50 and 50 degC.

3 Global attributes

General instruction: When information is not available "NA" shall be used in place.

3.1 Common

The required common attributes can be seen underneath,

# ../atmos_validation/schemas/metadata.py#L20-L30

class CommonMetadata(BaseModel, use_enum_values=True):
    """Common required attributes for all data types"""

    comments: Union[List[str], str]
    contractor: str
    classification_level: ClassificationLevel = Field(default="Internal")
    data_type: DataType
    data_history: str
    final_reports: List[str]
    project_name: str
    qc_provider: str

comments: Any relevant comments related to how the data has been treated shall be provided. It could be basic preprocessing steps etc.

contractor: Name of data provider

classification_level: Signifies data access according to classification level

data_type: If source data is hindcast, single point hindcast, or measurement

data_history: Any information about the origin of the data (if not measured directly by the contractor) or changes made to the data (if there has been previous versions of the same data) shall be stated here. If the data has been measured/created directly by the contractor and it is the first version delivered “Original data” shall be stated

final_reports: A list of report file names, separated by comma, shall be provided. All stated report files shall follow the data

project_name: Name of the project requesting data

qc_provider: Company responsible for the QC. It can be different from the contractor.

where data_type should take either value from the enum:

# ../atmos_validation/schemas/metadata.py#L9-L12

class DataType(str, Enum):
    HINDCAST = "Hindcast"
    MEASUREMENT = "Measurement"
    SP_HINDCAST = "SinglePointHindcast"

The data_type value defines secondary requirements on the global attributes on the data file.

The classifcation level should take either value from the enum:

# ../atmos_validation/schemas/classification_level.py#L36-L39

class ClassificationLevel(OrderedEnum):
    OPEN = "Open"
    INTERNAL = "Internal"
    RESTRICTED = "Restricted"

3.2 Hindcast

Single point hindcast and hindcast both use the hindcast metadata schema.

# ../atmos_validation/schemas/metadata.py#L33-L49

class HindcastMetadata(CommonMetadata, UnprotectedNamespaceModel):
    """Extra global attributes required if data_type == "Hindcast" or data_type == "SinglePointHindcast"."""

    calibration: str
    delivery_date: str
    forcing_data: str
    memos: Union[str, List[str]]
    modelling_software: str
    model_name: str
    nests: Union[str, List[str]]
    setup: str
    spatial_resolution: Union[str, List[str]]
    sst_source: str
    task_manager_external: Union[str, List[str]]
    task_manager_internal: Union[str, List[str]]
    time_resolution: str
    topography_source: str

calibration: Indicate whether calibration is applied to he data ‘yes’/ ‘no’

delivery_date: Date of the hindcast delivery

forcing_data: Data used as the boundary conditions

memos: Filenames of memos shall be specified

model_name: Name of the model. It shall be unique in the project

modelling software: Software and version used in hindcast computation

nests: Nests used to create given data

setup: Setup storage place in the cold storage. Valid for internal hindcasts only. For external hindcasts ‘NA’ shall be specified

spatial_resolution: Spatial resolution in km

sst_source: Source of SST data

task_manager_external: Hindcast provider task manager

*task_manager_internal: Equinor task manager handling the project

time_resolution: Temporal resolution

3.3 Measurement

# ../atmos_validation/schemas/metadata.py#L52-L65

class MeasurementMetadata(CommonMetadata):
    """Extra global attributes if data_type == "Measurement"."""

    asset: Optional[str] = Field(default=None)
    averaging_period: str
    country: str = Field(default="NA")
    data_usability: str
    instrument_types: str
    instrument_specifications: str
    installation_type: str
    location: Union[str, List[str]]
    mooring_name: str
    source_file: str
    total_water_depth: Union[str, float]

asset: Name of the asset which paid for the data. In case of sharing data to the third party, permission from the asset is required.

averaging_period: Averaging period of measurements in minutes

country: Country name on which territory data are acquired. In case of sharing data to the third party, one need obey to country regulation rules related to data sharing.

data_usability: Level of the data readiness

location: Latitude and longitude of the measurements in degrees (at least three decimals are required after ‘.’) and corresponding reference datum.
Format of the location: lat lon, reference

instrument_specifications: Instrument specifications for given measurement locations. Instrument specifications shall be listed in the same order as the corresponding instrument types.

installation_types: Measurement installation type.

instrument_types: Types of instruments used for the measurement location.

mooring_name: The mooring name shall be unique for each delivered measurement file (across projects, instruments, data deliveries etc) and shall be constructed as follows: project_name + mooring_name + instrument + phase. FOXTROT_MOOR1_GPS_Ph1. Only put single instrumentation in the name in cases where there are multiple instruments.

total_water_depth: Total water depth in meters. For wind data total water depth is NA (this parameter is not in the list, it should be included)

source_file: A reference to the original data file the NetCDF file was generated from. "NA" can be used if not applicable.

To avoid ambiguous terminology, "data_usability" and "installation_types" are validated against database documents, respectively:

  • [https://atmos.app.radix.equinor.com/config/data-usability]
  • [https://atmos.app.radix.equinor.com/config/installation-types]

3.4 Extras

Extra attributes relevant for the data source can be added using snake_case.

ASCII format

The ASCII format is only applicable for measurements. Hindcasts or single point hindcasts shall be delivered in netCDF (See conventions for details).

ASCII format contains three sections: Common Metadata, Parameters Related Metadata and Data. See example for a template to follow.

Common Metadata

Common Metadata shall be given in the header lines of the timeseries. All header lines must start with a percentage symbol, %. There shall be one space between the percentage sign and the following text. There shall be minimum two spaces after ‘:’ in each line of the header. In case of missing data "NA" should be used.

The following metadata shall be included in all Metocean timeseries:

  • Contractor/Data Responsible: Name of data provider
  • Project name: Name of the project requesting data
  • Location: Latitude and longitude of the measurements in degrees (at least three decimals are required after ‘.’) and corresponding reference datum. Format of the location: "% Location: lat lon, reference"
  • Duration of data: Start and end dates of the measurements in format dd.mm.yyyy -dd.mm.yyyy (Dates shall be in UTC and agree with Start and End dates of the data).
  • Averaging period: Averaging period of measurements in minutes
  • TotalWaterDepth: Total water depth in meters. For wind data total water depth is NA
  • Mooring Name: The mooring name shall be unique for each delivered measurement file (across projects, instruments, data deliveries etc) and shall be constructed as follows: project_name + mooring_name + instrument + phase. FOXTROT_MOOR1_GPS_Ph1. Only put single instrumentation in the name in cases where there are multiple instruments
  • Type of instrument: Types of instruments used for the measurement location. If there are several instrument types, please, list them with comma separator. If there are no instruments available, use NA. Instrument types shall be chosen from the database list https://atmos.app.radix.equinor.com/config/instrument-types. If appropriate type is not available, please, contact CR to include it in the database
  • Specification of Instrument: Instrument specifications for given measurement locations. Instrument specifications shall be listed in the same order as the corresponding instrument types
  • Measurement installation type: Measurement installation type. One of the options listed in https://atmos.app.radix.equinor.com/config/installation-types database shall be specified
  • QC provider: Company responsible for the QC. It can be different from the contractor.
  • Data Usability Level: Level of the data readiness. One or more options from https://atmos.app.radix.equinor.com/config/data-usability shall be chosen. These metadata reflect how good the data are.
  • Data history: Any information about the origin of the data (if not measured directly by the contractor) or changes made to the data (if there has been previous versions of the same data) shall be stated here. If the data has been measured directly by the contractor and it is the first version delivered “Original data” shall be stated.
  • Missing data: Element reflecting missing data in the measurements shall be stated. NaN is preferred.
  • Final Reports: A list of report file names, separated by comma, shall be provided. All stated report files shall follow the data
  • Data type: word ‘Measurement’ shall be stated in this field
  • Classification level: Signifies data access according to classification level: Open, Internal or Restricted
  • Comments: Any relevant comments related to how the data has been treated shall be provided. It could be basic preprocessing steps etc
  • Asset: Name of the asset which paid for the data. In case of sharing data to the third party, permission from the asset is required. NA can be used if asset is not applicable
  • Country: Country name on which territory data are acquired. In case of sharing data to the third party, one need obey to country regulation rules related to data sharing
  • Auxilary information: Other information about measurements shall be specified in the section below line % Comments . Format shall be similar as in the section above: Each line starts with a % sign and the metadata name shall be separated from the content by ‘:’ The metadata names in this section can be specified by the data provider.

Parameter Related Metadata

Parameter Related Metadata shall be given in a table. Each line of the table shall be started from % sign.  The content of the table shall list all the measurement parameters and their metadata, therefore the number of rows of this table should be equal to the number of data columns in the in the Data section.  The metadata of each parameter except time quantities shall contain the following attributes indicated as a header of the table (See ASCII example):

  • Parameters: Description of the measured quantity.
  • Abbrev: Short name of the measured quantity. This name will be used in the data table as a header. Therefore, each abbreviation shall be unique in the entire file. Abbrev for the measured parameters shall be constructed as the key from the database with standardized parameter names (see Section 5.3) and absolute value of height/depth. For example, WS100, WD100, CS10, CD10 etc.
  • Unit: The units of all the corresponding measurements. All units shall follow the units given in Section 5.3 for the corresponding parameter.
  • Height: Height of the corresponding measurement in m (only number) above/below MSL. For currents, sea temperature etc the depth below mean sea level shall be given with sign ‘-‘.
  • Base: Key from the database with standardized parameter names (see https://atmos.app.radix.equinor.com/config/parameters)
  • Instrument: Instrument type used to measure given parameter at given height. If given parameter is the merge of measurements from several instrument types, please, list them with comma separator. If there are no instruments available, use NA. Instrument types shall be chosen from the list in Section 5.4.
  • InstrSpec: Instrument specifications for given quantity. Instrument specifications shall be listed in the same order as the corresponding instrument types.

First 5 rows of the table shall contain time quantities, where only the columns Parameters and Abbrev are filled. Time quantities shall be represented as Year, Month, Day, Hour, Minute in Parameters with corresponding abbreviations YY, MM, DD, HH and Min in Abbrev column.

Data

The measured data shall be provided in columns below the metadata. In the first row of the data section the column names shall be given. The number of columns with data shall be the same as the number of parameters listed in Parameter Related Metadata section. The first 5 columns shall provide the time as YY (year), MM (month), DD (day), HH (hour), Min (minute). The following columns shall be given names corresponding to the Abbrev name from the header. All parameters shall be in the units specified in the header. Missing values shall be represented by the value given in the header line Missing data.

Running CLI

Installation

Prerequisites: Python >=3.8.

Run in your preferred environment:

pip install atmos_validation

Or, if using poetry as package manager, replace pip install with poetry add.

If using conda, run before the pip install command above:

conda install git pip

Run

After installing, run python -m atmos_validation to see docstring for available commands and options.

Example usage using the example datasets (need to clone/download repository and run from root for this to work):

  • Validating hindcast NetCDF format: python -m atmos_validation validate-netcdf examples/hindcast_example
  • Validating measurement NetCDF format: python -m atmos_validation validate-netcdf examples/example_netcdf_measurement.nc
  • Validating measurement ascii format: python -m atmos_validation validate-ascii examples/example_ascii_measurement.dat
  • Convert an ascii file to NetCDF: python -m atmos_validation convert-ascii examples/example_ascii_measurement.dat

All commands can be run without arguments to trigger docstring output to list args and options documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atmos_validation-1.6.2.tar.gz (49.8 kB view details)

Uploaded Source

Built Distribution

atmos_validation-1.6.2-py3-none-any.whl (73.2 kB view details)

Uploaded Python 3

File details

Details for the file atmos_validation-1.6.2.tar.gz.

File metadata

  • Download URL: atmos_validation-1.6.2.tar.gz
  • Upload date:
  • Size: 49.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for atmos_validation-1.6.2.tar.gz
Algorithm Hash digest
SHA256 af08802fef3e0c7a18b4caf1bb9315c9ebe17c7666289a6d1c739b2c2627a963
MD5 e672620d7c78b7a5d99df4519599d205
BLAKE2b-256 f773128a0c5d2b02aca92b810e99323cbbd92b9a3aaa59b33fd96a53210d0ed7

See more details on using hashes here.

Provenance

The following attestation bundles were made for atmos_validation-1.6.2.tar.gz:

Publisher: pypi-publisher.yml on equinor/atmos-validation

Attestations:

File details

Details for the file atmos_validation-1.6.2-py3-none-any.whl.

File metadata

File hashes

Hashes for atmos_validation-1.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d71f01124392e0328958f10118c8a862929a51f7ceaad53677228665a56aee36
MD5 822638fd04d38f3b3e68d91423958d33
BLAKE2b-256 b99ffb5710e20d81b97a191093efcd4b71b90c115314a589366ffa7d6ca82865

See more details on using hashes here.

Provenance

The following attestation bundles were made for atmos_validation-1.6.2-py3-none-any.whl:

Publisher: pypi-publisher.yml on equinor/atmos-validation

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page