Skip to main content

Process oceanographic measurements from USVs and write partitioned GeoParquet datasets with geospatial metadata.

Project description

Oceanstream

This project processes oceanographic measurements collected from Unmanned Surface Vehicles (USVs). The pipeline reads CSV files from a specified directory, consolidates the data, and stores it in a GeoParquet format partitioned by latitude and longitude bins. Optionally, it can upload the resulting GeoParquet files to Azure Blob Storage.

Project Structure

oceanstream
├── src
│   ├── app.py                # Main entry point for the application
│   ├── cli.py                # Command-line interface for running the pipeline
│   ├── config                # Configuration settings
│   │   ├── __init__.py
│   │   └── settings.py
│   ├── pipeline              # Data processing pipeline
│   │   ├── __init__.py
│   │   ├── csv_reader.py     # Functions for reading CSV files
│   │   ├── binning.py        # Functions for data partitioning
│   │   └── geoparquet_writer.py # Functions for writing geoparquet files
│   ├── storage               # Storage handling
│   │   ├── __init__.py
│   │   ├── local.py          # Local storage functions
│   │   └── azure_blob.py     # Azure Blob Storage functions
│   └── types                 # Data models and types
│       ├── __init__.py
│       └── models.py
├── data
│   └── raw_data              # Directory for raw CSV data
│       └── .gitkeep
├── tests                     # Unit tests for the application
│   ├── __init__.py
│   └── test_pipeline.py
├── .env.example              # Template for environment variables
├── .gitignore                # Git ignore file
├── .python-version           # Python version specification
├── pyproject.toml            # Project dependencies and configuration
└── README.md                 # Project documentation

Setup Instructions

  1. Clone the Repository:

    git clone <repository-url>
    cd oceanstream
    
  2. Create a Virtual Environment:

    python3.12 -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
    
  3. Install Dependencies:

    Oceanstream is organized into optional processing modules. Install only what you need:

    # Install core + geotrack processing (GPS/navigation data → GeoParquet)
    pip install -e ".[geotrack]"
    
    # Install core + echodata processing (echosounder data → Zarr)
    pip install -e ".[echodata]"
    
    # Install all processing modules
    pip install -e ".[all]"
    
    # Install for development (includes all modules + dev tools)
    pip install -e ".[all]" -r requirements-dev.txt
    

    Available extras:

    • geotrack - GPS/navigation track processing (pandas, geopandas, shapely)
    • echodata - Echosounder data processing (echopype, xarray, zarr, netcdf4)
    • multibeam - Multibeam sonar processing (planned)
    • adcp - ADCP current profiler processing (planned)
    • all - All processing modules
    • geo - Legacy alias for geotrack
  4. Configure Environment Variables: Copy .env.example to .env and fill in the necessary configuration values.

  5. Run Processing Commands:

    Oceanstream provides separate commands for each data type:

    # Process geotrack data (CSV → GeoParquet)
    oceanstream process geotrack --input-dir raw_data --output-dir out/geoparquet -v
    
    # Process echosounder data (planned - requires echodata extra)
    oceanstream process echodata --input-dir raw_echodata --output-dir out/echodata -v
    
    # Process multibeam data (planned - requires multibeam extra)
    oceanstream process multibeam --input-dir raw_multibeam --output-dir out/multibeam -v
    
    # Process ADCP data (planned - requires adcp extra)
    oceanstream process adcp --input-dir raw_adcp --output-dir out/adcp -v
    
    # List available data providers
    oceanstream providers
    

    All processing commands support --provider flag to specify the data source:

    oceanstream process --provider saildrone geotrack --input-dir data -v
    

Usage

Geotrack Processing (GPS/Navigation Data)

CLI usage examples:

# Process sample fixture data bundled with tests
oceanstream process geotrack --input-dir oceanstream/tests/data/raw_data --output-dir out/geoparquet -v

# Process your raw_data directory at repo root (default input-dir is ./raw_data)
oceanstream process geotrack --output-dir out/geoparquet -v

# Dry run to see what would be processed
oceanstream process geotrack --input-dir raw_data --dry-run -v

# List available columns in the data
oceanstream process geotrack --input-dir raw_data --list-columns

The CLI reads CSVs, auto-derives coarse 5° bins, and writes a partitioned GeoParquet dataset with metadata. Use -v for progress logs.

Processing Modules

Oceanstream is organized into separate processing modules:

  • oceanstream.geotrack - Process GPS/navigation track data into GeoParquet format
  • oceanstream.echodata - Process echosounder data (EK60/EK80) into Zarr (coming soon)
  • oceanstream.multibeam - Process multibeam sonar data (coming soon)
  • oceanstream.adcp - Process ADCP current profiler data (coming soon)

Each module can be installed independently using pip extras (see Installation section).

Using OceanStream Data in GIS Tools

OceanStream generates cloud-optimized GeoParquet files designed to work seamlessly with modern GIS tools and data analysis frameworks. Our output includes:

  • GeoParquet: Columnar format with embedded geometry and spatial partitioning
  • STAC Metadata: Standard catalog format for discovery and integration
  • PMTiles (optional): Vector tiles for web-based visualization

Comprehensive GIS Integration Guides

We provide detailed integration guides for popular GIS tools and frameworks:

Desktop GIS:

  • QGIS - Open-source desktop GIS
  • ArcGIS Pro - Professional ESRI platform

Data Analysis:

  • DuckDB - Fast in-process SQL analytics
  • GeoPandas - Python spatial data analysis

Web GIS (coming soon):

  • Leaflet + PMTiles
  • Mapbox GL JS
  • STAC Browser

See GIS Integration Documentation for complete guides with:

  • Installation instructions
  • Step-by-step usage examples
  • Code samples and workflows
  • Performance optimization tips
  • Troubleshooting guides

Quick Start Examples

Load in QGIS:

# Generate data
oceanstream process geotrack --input-source ./data/sample.csv --output-dir ./output

# Open QGIS and drag-and-drop .parquet files from:
# output/campaign_id/lat_bin=X/lon_bin=Y/*.parquet

Query with DuckDB:

INSTALL spatial;
LOAD spatial;

SELECT time, latitude, longitude, temperature_sea_water
FROM read_parquet('output/campaign_id/**/*.parquet')
WHERE lat_bin = 30 AND lon_bin = -120
LIMIT 10;

Analyze with GeoPandas:

import geopandas as gpd

# Read all spatial partitions
gdf = gpd.read_parquet('output/campaign_id/')

# Filter and analyze
warm_water = gdf[gdf['temperature_sea_water'] > 25]
print(f"Found {len(warm_water)} warm water measurements")

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any enhancements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oceanstream-0.1.1.tar.gz (237.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oceanstream-0.1.1-py3-none-any.whl (305.1 kB view details)

Uploaded Python 3

File details

Details for the file oceanstream-0.1.1.tar.gz.

File metadata

  • Download URL: oceanstream-0.1.1.tar.gz
  • Upload date:
  • Size: 237.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for oceanstream-0.1.1.tar.gz
Algorithm Hash digest
SHA256 685ff78f75587fb26c464d2426f4c2179034785aee4d3ab4bac185a66d7c62ec
MD5 fe7d45408ad24add1fc76e9a38d32fdf
BLAKE2b-256 c89c6105ac1521d1795e445131a6fbf91938bbaab9d7457f31313a98c26cacf5

See more details on using hashes here.

File details

Details for the file oceanstream-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: oceanstream-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 305.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for oceanstream-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5e3c044c632290130b0d015b908e9798a8d77796e95b8c649fc40875cc287ab8
MD5 5e4d46feb66612ea6e54f154328690a8
BLAKE2b-256 22eb22bd6902eb28e680a4cfa653a0363dfe1bb3c17b158b864d9a827f35009d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page