Skip to main content

A package to spatially join and concatenate Satellite data into time series.

Project description

Satellite Data Preprocessor

A package built for preprocessing netCDF, netCDF4, GeoJSON, or atomic CSV files using spatial joins and concatenation to create time-series numpy and csv datasets.

Keyword Arguments

Kwarg Description
extra_config Path to the configuration file used for loading nc files. (Only provide this if you are loading netCDF or CSV files.
record_out A function to generate a label from a file name.
region_out A function to generate a label from a region file name.
suffix_out A function that returns a string to be appended to the output file.

Usage:

  • Define a directory structure and create the required configuration files. (see Configuration Files)
  • Create an instance of Preprocessor and provide the path to a config json file along with keyword arguments.
  • Call Preprocessor.preprocess to preprocess data according to the provided config.
  • To define configurations for different data sources, use Preprocessor.preprocess_multi. Config_list[i] and Kwargs_list[i] will apply to the ith data source.
import mlossp as sp

config_path = "config.json"
extra_config_path = "nc_config.json"
f = lambda : "_1"

preprocessor = sp.Preprocessor(
    config_path,
    extra_config=extra_config_path,
    suffix_out=f
)
preprocessor.preprocess()

config_list = ["config_1.json", "../configs/config_2.json"]
kwargs_list = [{"extra_config": extra_config_path, "out_out": f}, {}]
preprocessor.preprocess_multi(config_list, kwargs_list)

Configuration Files

Every category of data, i.e., data within a directory should have a json configuration file. The following keys should be defined within the configuration file:

Key Description Optional
regions_dir The path to the directory where the geographical data is stored No
data_dir The path to the directory where the satellite data is stored No
out_dir The path to the directory where the output files will be stored No
crs A geopandas supported coordinate reference system Yes
regions_file_map A mapping of region name to their file path No
selected_regions The list of regions to extract from the satellite data. Must be in regions_file_map. No
join_on The feature that duplicate data entries in a timestamp will be aggregated on No
joins The aggregate metrics that will be stored from join_on's aggregation. No
file_extension The format of all data being layered. Can be one of *.pkl, *.json, *.geojson, *.nc4, or *.nc No
compress_to The path to the directory where visualization files will be stored. Yes
chunk This will aggregate every chunk files together. E.g. 365 chunked into 7 will result in 53 chunks. Yes

Given the following directory structure:

- Data
    - Visualizations
    - EO
        - data1.nc
        - data2.nc
    - EO_OUT
    - Countries
        - gadm41_LKA_1.json
        - gadm41_USA_1.json
- preprocesser.py

And region and satellite data with the following features:

Region: NAME, NAME_1, NAME_2, POINT, ...
Satellite Data: longitude, latitude, precipitationCal, ...

A sample configuration would be defined as follows:

{
  "compress_to": "Data/Visualizations",
  "crs": "EPSG:4326",
  "regions_dir": "Data/Countries",
  "data_dir": "Data/EO",
  "out_dir": "Data/EO_OUT",
  "regions_file_map":  {
    "Sri Lanka": "gadm41_LKA_1.json",
    "USA": "gadm41_USA_1.json"
  },
  "selected_regions": ["Sri Lanka"],
  "join_on": "NAME_1",
  "joins": {
    "precipitationCal": ["mean", "min", "max"]
  },
  "file_extension": "*.nc",
  "chunk": 5
}

Preprocessing netCDF data

To use this package with netCDF data, a separate configuration must be provided for each data source. A sample configuration is shown below.

{
	"latVar":"latitude",
	"lonVar":"longitude",
	"is360":false,
	"extraVars": ["NDVI"]
}

Configuration Keys:

Key Description
latVar The name of the latitude column
lonVar The name of the longitude column
is360 Set to true if the netCDF contains data within 0 - 360 degrees
extraVars Desired features to be retrieved from the netcdf file.

Preprocessing CSV data

To use this package to preprocess CSV data, a configuration file must be defined for each data source. CSV data provided will be aggregated across the given time period; Information across higher temporal resolutions will be lost.

The number of csv files provided will be the final number of timestamps. If you would like to atomicize your csv files, see [Formatting CSVs](#Formatting CSVs).

{
	"latVar":"Lattitude",
	"lonVar":"Longitude",
	"is360":false,
	"filter": [["Disease Name", "Dengue Fever", "e"], ["Location Name", "SRILANKA", "n"], ["Location Name", "Kalmune", "n"]],
	"extraVars": ["Cases", "Location Name"]
}

Configuration Keys:

Key Description
latVar The name of the latitude column
lonVar The name of the longitude column
is360 Set to true if the longitude and latitude are measured in 360 degrees
filter A list of length 3 arrays of [Column Name, Target Value, Equals (e) or Not Equals (n)]. All rows whose column name has a value that is either equal or not equal to the target will be excluded.
extraVars Features to keep in the output.

Formatting CSVs

CSVs are inherently designed to support tabular, 2D data. Hence, storing higher-dimensional data such as tensors and ndarrays as CSVs results in inconsistent formats.

We offer mlossp.formatters as a solution to parse a (or a list of) 3D CSV files into a 3D numpy tensor, which is formatted then written into a .npy file and a .csv file that is supported by align_csv and align_npy. We also support vertical alignment of pandas dataframes and csvs through mlossp.formatters. These must be manually sliced to have an equal number of timestamps.

Citations

This package incorporates modified code from https://github.com/podaac/netcdf_to_geojson_vectors, which is licensed under Apache 2.0. You may obtain a copy of the license at: http://www.apache.org/licenses/LICENSE-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlossp-2.3.post1.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlossp-2.3.post1-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file mlossp-2.3.post1.tar.gz.

File metadata

  • Download URL: mlossp-2.3.post1.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.1

File hashes

Hashes for mlossp-2.3.post1.tar.gz
Algorithm Hash digest
SHA256 f144b2592bfff63fd816cac145ff0b5ccb57dce868fbd55897e30f7128405244
MD5 ccbf521b3e691731c4089880ba35b655
BLAKE2b-256 9f5765c2ef6338911b0a42e55b85db7d5bb2f30b79374a34083094678f978b05

See more details on using hashes here.

File details

Details for the file mlossp-2.3.post1-py3-none-any.whl.

File metadata

  • Download URL: mlossp-2.3.post1-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.1

File hashes

Hashes for mlossp-2.3.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 c2645ae5f6381d1fa3f6a9326a9f6a66d872d4138a9dfcc25a82fbc539af2417
MD5 275abab75615e08559afe8e6869ff3e4
BLAKE2b-256 962385f2e413d9f9242b9934607407d9ce3475db9349605d17f5240eef94cd83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page