Process oceanographic measurements from USVs and write partitioned GeoParquet datasets with geospatial metadata.
Project description
Oceanstream
This project processes oceanographic measurements collected from Unmanned Surface Vehicles (USVs). The pipeline reads CSV files from a specified directory, consolidates the data, and stores it in a GeoParquet format partitioned by latitude and longitude bins. Optionally, it can upload the resulting GeoParquet files to Azure Blob Storage.
Project Structure
oceanstream
├── src
│ ├── app.py # Main entry point for the application
│ ├── cli.py # Command-line interface for running the pipeline
│ ├── config # Configuration settings
│ │ ├── __init__.py
│ │ └── settings.py
│ ├── pipeline # Data processing pipeline
│ │ ├── __init__.py
│ │ ├── csv_reader.py # Functions for reading CSV files
│ │ ├── binning.py # Functions for data partitioning
│ │ └── geoparquet_writer.py # Functions for writing geoparquet files
│ ├── storage # Storage handling
│ │ ├── __init__.py
│ │ ├── local.py # Local storage functions
│ │ └── azure_blob.py # Azure Blob Storage functions
│ └── types # Data models and types
│ ├── __init__.py
│ └── models.py
├── data
│ └── raw_data # Directory for raw CSV data
│ └── .gitkeep
├── tests # Unit tests for the application
│ ├── __init__.py
│ └── test_pipeline.py
├── .env.example # Template for environment variables
├── .gitignore # Git ignore file
├── .python-version # Python version specification
├── pyproject.toml # Project dependencies and configuration
└── README.md # Project documentation
Setup Instructions
-
Clone the Repository:
git clone <repository-url> cd oceanstream
-
Create a Virtual Environment:
python3.12 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies:
Oceanstream is organized into optional processing modules. Install only what you need:
# Install core + geotrack processing (GPS/navigation data → GeoParquet) pip install -e ".[geotrack]" # Install core + echodata processing (echosounder data → Zarr) pip install -e ".[echodata]" # Install all processing modules pip install -e ".[all]" # Install for development (includes all modules + dev tools) pip install -e ".[all]" -r requirements-dev.txt
Available extras:
geotrack- GPS/navigation track processing (pandas, geopandas, shapely)echodata- Echosounder data processing (echopype, xarray, zarr, netcdf4)multibeam- Multibeam sonar processing (planned)adcp- ADCP current profiler processing (planned)all- All processing modulesgeo- Legacy alias forgeotrack
-
Configure Environment Variables: Copy
.env.exampleto.envand fill in the necessary configuration values. -
Run Processing Commands:
Oceanstream provides separate commands for each data type:
# Process geotrack data (CSV → GeoParquet) oceanstream process geotrack --input-dir raw_data --output-dir out/geoparquet -v # Process echosounder data (planned - requires echodata extra) oceanstream process echodata --input-dir raw_echodata --output-dir out/echodata -v # Process multibeam data (planned - requires multibeam extra) oceanstream process multibeam --input-dir raw_multibeam --output-dir out/multibeam -v # Process ADCP data (planned - requires adcp extra) oceanstream process adcp --input-dir raw_adcp --output-dir out/adcp -v # List available data providers oceanstream providers
All processing commands support
--providerflag to specify the data source:oceanstream process --provider saildrone geotrack --input-dir data -v
Usage
Geotrack Processing (GPS/Navigation Data)
CLI usage examples:
# Process sample fixture data bundled with tests
oceanstream process geotrack --input-dir oceanstream/tests/data/raw_data --output-dir out/geoparquet -v
# Process your raw_data directory at repo root (default input-dir is ./raw_data)
oceanstream process geotrack --output-dir out/geoparquet -v
# Dry run to see what would be processed
oceanstream process geotrack --input-dir raw_data --dry-run -v
# List available columns in the data
oceanstream process geotrack --input-dir raw_data --list-columns
The CLI reads CSVs, auto-derives coarse 5° bins, and writes a partitioned GeoParquet dataset with metadata. Use -v for progress logs.
Processing Modules
Oceanstream is organized into separate processing modules:
oceanstream.geotrack- Process GPS/navigation track data into GeoParquet formatoceanstream.echodata- Process echosounder data (EK60/EK80) into Zarr (coming soon)oceanstream.multibeam- Process multibeam sonar data (coming soon)oceanstream.adcp- Process ADCP current profiler data (coming soon)
Each module can be installed independently using pip extras (see Installation section).
Using OceanStream Data in GIS Tools
OceanStream generates cloud-optimized GeoParquet files designed to work seamlessly with modern GIS tools and data analysis frameworks. Our output includes:
- GeoParquet: Columnar format with embedded geometry and spatial partitioning
- STAC Metadata: Standard catalog format for discovery and integration
- PMTiles (optional): Vector tiles for web-based visualization
Comprehensive GIS Integration Guides
We provide detailed integration guides for popular GIS tools and frameworks:
Desktop GIS:
- QGIS - Open-source desktop GIS
- ArcGIS Pro - Professional ESRI platform
Data Analysis:
Web GIS (coming soon):
- Leaflet + PMTiles
- Mapbox GL JS
- STAC Browser
See GIS Integration Documentation for complete guides with:
- Installation instructions
- Step-by-step usage examples
- Code samples and workflows
- Performance optimization tips
- Troubleshooting guides
Quick Start Examples
Load in QGIS:
# Generate data
oceanstream process geotrack --input-source ./data/sample.csv --output-dir ./output
# Open QGIS and drag-and-drop .parquet files from:
# output/campaign_id/lat_bin=X/lon_bin=Y/*.parquet
Query with DuckDB:
INSTALL spatial;
LOAD spatial;
SELECT time, latitude, longitude, temperature_sea_water
FROM read_parquet('output/campaign_id/**/*.parquet')
WHERE lat_bin = 30 AND lon_bin = -120
LIMIT 10;
Analyze with GeoPandas:
import geopandas as gpd
# Read all spatial partitions
gdf = gpd.read_parquet('output/campaign_id/')
# Filter and analyze
warm_water = gdf[gdf['temperature_sea_water'] > 25]
print(f"Found {len(warm_water)} warm water measurements")
Contributing
Contributions are welcome! Please open an issue or submit a pull request for any enhancements or bug fixes.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oceanstream-0.1.1.tar.gz.
File metadata
- Download URL: oceanstream-0.1.1.tar.gz
- Upload date:
- Size: 237.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
685ff78f75587fb26c464d2426f4c2179034785aee4d3ab4bac185a66d7c62ec
|
|
| MD5 |
fe7d45408ad24add1fc76e9a38d32fdf
|
|
| BLAKE2b-256 |
c89c6105ac1521d1795e445131a6fbf91938bbaab9d7457f31313a98c26cacf5
|
File details
Details for the file oceanstream-0.1.1-py3-none-any.whl.
File metadata
- Download URL: oceanstream-0.1.1-py3-none-any.whl
- Upload date:
- Size: 305.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e3c044c632290130b0d015b908e9798a8d77796e95b8c649fc40875cc287ab8
|
|
| MD5 |
5e4d46feb66612ea6e54f154328690a8
|
|
| BLAKE2b-256 |
22eb22bd6902eb28e680a4cfa653a0363dfe1bb3c17b158b864d9a827f35009d
|