Skip to main content

Generate non-overlapping DART obs_seq files from pluggable observation sources.

Project description

dartobsgen

A pip-installable Python package that generates non-overlapping DART obs_seq files from pluggable observation data sources.

Install

cd /path/to/dartobsgen
pip install -e .

Quick Start

import datetime
from dartobsgen import ObsGenConfig, CrocLakeSource, generate_obs_sequences

config = ObsGenConfig(
    start=datetime.datetime(2010, 5, 1),
    end=datetime.datetime(2010, 5, 3),
    lat_min=5,   lat_max=60,
    lon_min=-100, lon_max=-30,
    obs_types=["ARGO_TEMPERATURE", "ARGO_SALINITY"],
    assimilation_frequency=datetime.timedelta(hours=6),
    output_dir="./obs_output",
)

source = CrocLakeSource(
    crocolake_path="/path/to/crocolake/",
    dart_path="/path/to/DART/",
)

# Sequential
written_files = generate_obs_sequences(config, source)

# Parallel (all CPUs)
written_files = generate_obs_sequences(config, source, max_workers=None)

# Parallel (fixed number of workers)
written_files = generate_obs_sequences(config, source, max_workers=4)

print(written_files)

Package Structure

dartobsgen/
├── pyproject.toml
├── README.md
└── src/
    └── dartobsgen/
        ├── __init__.py           # Public API
        ├── config.py             # ObsGenConfig dataclass
        ├── generate.py           # generate_obs_sequences(), _make_windows()
        └── sources/
            ├── __init__.py
            ├── base.py           # DataSource ABC + ObsSeqSource stub
            └── crocolake.py      # CrocLakeSource + DEFAULT_OBS_TYPE_MAP

Output file naming

Files are named {output_prefix}.{timestamp}.out where the timestamp is formatted using output_timestamp_format (default: "%Y-%m-%d-{S}").

The special token {S} is replaced with seconds-of-day (0–86400, zero-padded to 5 digits), matching DART's standard obs_seq naming convention. All other tokens follow Python strftime format.

Window start Default filename
2010-05-01 00:00 obs_seq.2010-05-01-00000.out
2010-05-01 06:00 obs_seq.2010-05-01-21600.out
2010-05-01 12:00 obs_seq.2010-05-01-43200.out
2010-05-01 18:00 obs_seq.2010-05-01-64800.out

To use a custom format (e.g. DART's compact YYYYMMDDHH):

config = ObsGenConfig(..., output_timestamp_format="%Y%m%d%H")
# produces: obs_seq.2010050100.out, obs_seq.2010050106.out, ...

Observation types

obs_types accepts three styles — they can be freely mixed:

Style Example Meaning
DART compound name "ARGO_TEMPERATURE" TEMP from ARGO only
DART variable name "TEMPERATURE" TEMP from all sources
CrocoLake var name "TEMP" TEMP from all sources

Supported obs types

DART compound name CrocoLake var DB source
ARGO_TEMPERATURE TEMP ARGO
ARGO_SALINITY PSAL ARGO
ARGO_OXYGEN DOXY ARGO
BOTTLE_TEMPERATURE TEMP GLODAP
BOTTLE_SALINITY PSAL GLODAP
BOTTLE_OXYGEN DOXY GLODAP
BOTTLE_ALKALINITY TOT_ALKALINITY GLODAP
BOTTLE_INORGANIC_CARBON TCO2 GLODAP
BOTTLE_NITRATE NITRATE GLODAP
BOTTLE_SILICATE SILICATE GLODAP
BOTTLE_PHOSPHATE PHOSPHATE GLODAP
GLIDER_TEMPERATURE TEMP SprayGliders
GLIDER_SALINITY PSAL SprayGliders
TEMPERATURE TEMP all
SALINITY PSAL all
OXYGEN DOXY all

Pass a custom obs_type_map dict to ObsGenConfig to override or extend:

my_map = {
    "MY_CUSTOM_TEMP": {"crocolake_var": "TEMP", "db_name": "MyDB"},
}
config = ObsGenConfig(..., obs_type_map=my_map)

Time windows

Windows are half-open: [t0, t0 + freq). Adjacent windows share no observations. The last window may extend beyond end to keep all window widths uniform.

assimilation_frequency accepts any datetime.timedelta, so sub-hourly windows are fully supported:

import datetime
from dartobsgen import ObsGenConfig

# 6-hour windows (default)
config = ObsGenConfig(..., assimilation_frequency=datetime.timedelta(hours=6))

# 30-minute windows
config = ObsGenConfig(..., assimilation_frequency=datetime.timedelta(minutes=30))

Parallel generation

generate_obs_sequences runs windows in parallel using concurrent.futures.ProcessPoolExecutor. Control parallelism with the max_workers argument:

# All available CPUs (default)
written = generate_obs_sequences(config, source)

# Fixed number of worker processes
written = generate_obs_sequences(config, source, max_workers=4)

# Sequential (useful for debugging)
written = generate_obs_sequences(config, source, max_workers=1)

Each worker process independently opens the CrocoLake parquet database and writes its own output file, so there are no shared-state conflicts.

Note: scripts that call generate_obs_sequences with max_workers != 1 must be run under a if __name__ == "__main__": guard (standard Python multiprocessing requirement on macOS / Windows).

Adding a new data source

Subclass dartobsgen.DataSource and implement write_obs_seq():

from dartobsgen import DataSource

class MySource(DataSource):
    def write_obs_seq(self, output_file, date0, date1,
                      lat_min, lat_max, lon_min, lon_max,
                      obs_types, obs_type_map) -> bool:
        # fetch data, write output_file, return True if written
        ...

ObsSeqSource in dartobsgen.sources.base is a pre-wired stub for a future data source backed by a bank of existing obs_seq files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dartobsgen-0.1.1.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dartobsgen-0.1.1-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file dartobsgen-0.1.1.tar.gz.

File metadata

  • Download URL: dartobsgen-0.1.1.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for dartobsgen-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bb1d10d0bedd3bef697e49eee59b393fa94d7f2a9d2076c57676934e071e54a8
MD5 4c2665b4fe87381034c5db023c209d28
BLAKE2b-256 6fde6b31305b1e4b62a22a8b610a8785c6acb3782e3cd1f94ab91393b8e0bfaf

See more details on using hashes here.

File details

Details for the file dartobsgen-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dartobsgen-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for dartobsgen-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3477c6a4326811b16a111c61248ec09828dfa4ea9e3520a38879f1f7a10e4eb4
MD5 1003177e2dd8a6ec21b1dac86dd92b63
BLAKE2b-256 6eabd8a5323fa2d6e63f48eabd3d3195346917b01958b39f72c5795b3bc22333

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page