Microlibrary for interacting with the public bottom trawl surveys data from the NOAA AFSC GAP API.

These details have not been verified by PyPI

Project links

Project description

AFSC GAP for Python

Microlibrary for pythonic interaction with the public bottom trawl surveys data from the NOAA AFSC GAP.

build workflow status docs

Purpose

Unofficial microlibrary for interacting with the API for bottom trawl surveys from the Ground Fish Assessment Program (GAP), a dataset produced by the Resource Assessment and Conservation Engineering (RACE) Division of the Alaska Fisheries Science Center (AFSC) as part of the National Oceanic and Atmospheric Administration's Fisheries organization (NOAA Fisheries).

Need

Scientists and developers working on ocean health have an interest in survey data from organizations like NOAA Fisheries. However, interacting with the GAP API from NOAA AFSC in Python requires understanding a complex schema, how to interact with a proprietary REST data service, forming long query URLs, and navigating pagination. These various elements together may increase the barrier for working with these data, limiting their reach within the Python community.

Goal

This low-dependency library provides a type-annoated and documented Python interface to these data with ability to query with filters and pagination, providing results in various formats compatible with different Python usage modalities (Pandas, pure-Python, etc).

It adapts the Oracle REST Data Service used by the agency with Python type hints for easy query and interface. Furthermore, Python docstrings annotate the data structures provided by the API to help users navigate the various fields avilable, offering contextual documentation when supported by Python IDEs.

Though not intended to be general, this project also provides an example for working with Oracle REST Data Services (ORDS) APIs in Python.

Installation

This open source library is available for install via Pypi / Pip:

$ pip install afscgap

Note that its only dependency is requests and Pandas / numpy are not expected.

Usage

This library provides access to the public API endpoints with query keywords matching the column names described in the official metadata repository. Records are parsed into plain old Python objects with optional access to a dictionary representation.

Basic Usage

For example, this requests all records of Pasiphaea pacifica in 2021 from the Gulf of Alaska to get the median bottom temperature when they were observed:

import statistics

import afscgap

results = afscgap.query(
    year=2021,
    srvy='GOA',
    scientific_name='Pasiphaea pacifica'
)

temperatures = [record.get_bottom_temperature_c() for record in results]
print(statistics.median(temperatures))

Note that afscgap.query is the main entry point which returns Record objects whose fields and methods are defined in the data structure section.

Using an iterator will have the library negotiate pagination behind the scenes. You can do this with list comprehensions, maps, etc or with a good old for loop like in this example which gets a histogram of temperatures:

count_by_temperature_c = {}

results = afscgap.query(
    year=2021,
    srvy='GOA',
    scientific_name='Pasiphaea pacifica'
)

for record in results:
    temp = record.get_bottom_temperature_c()
    temp_rounded = round(temp)
    count = count_by_temperature_c.get(temp_rounded, 0) + 1
    count_by_temperature_c[temp_rounded] = count

print(count_by_temperature_c)

Note that this operation will cause multiple HTTP requests while the iterator runs.

Pagination

By default, the library will iterate through all results and handle pagination behind the scenes. However, one can also request an individual page:

results = afscgap.query(
    year=2021,
    srvy='GOA',
    scientific_name='Pasiphaea pacifica'
)

results_for_page = results.get_page(offset=20, limit=100)
print(len(results_for_page))  # Will print 32 (results contains 52 records)

Client code can also change the pagination behavior used when iterating:

results = afscgap.query(
    year=2021,
    srvy='GOA',
    scientific_name='Pasiphaea pacifica',
    start_offset=10,
    limit=200
)

for record in results:
    print(record.get_bottom_temperature_c())

Note that records are only requested once during iteration and only after the prior page has been returned via the iterator ("lazy" loading).

Serialization

Users may request a dictionary representation:

results = afscgap.query(
    year=2021,
    srvy='GOA',
    scientific_name='Pasiphaea pacifica'
)

# Get dictionary from individual record
for record in results:
    dict_representation = record.to_dict()
    print(dict_representation['bottom_temperature_c'])

results = afscgap.query(
    year=2021,
    srvy='GOA',
    scientific_name='Pasiphaea pacifica'
)

# Get dictionary for all records
results_dicts = results.to_dicts()

for record in results_dicts:
    print(record['bottom_temperature_c'])

Note to_dicts returns an iterator by default, but it can be realized as a full list using the list() command.

Pandas

The dictionary form of the data can be used to create a Pandas dataframe:

import pandas

import afscgap

results = afscgap.query(
    year=2021,
    srvy='GOA',
    scientific_name='Pasiphaea pacifica'
)

pandas.DataFrame(results.to_dicts())

Note that Pandas is not required to use this library.

Advanced Filtering

Finally, users may provide advanced queries using Oracle's REST API query parameters. For example, this queries for 2021 records with haul from the Gulf of Alaska in a specific geographic area:

import afscgap

results = afscgap.query(
    year=2021,
    latitude_dd={'$between': [56, 57]},
    longitude_dd={'$between': [-161, -160]}
)

count_by_common_name = {}

for record in results:
    common_name = record.get_common_name()
    count = count_by_common_name.get(common_name, 0) + 1
    count_by_common_name[common_name] = count

For more info about the options available, consider the Oracle docs or a helpful unaffiliated getting started tutorial from Jeff Smith.

Incomplete or invalid records

Metadata fields such as year are always required to make a Record whereas others such as catch weight cpue_kgkm2 are not present on all records returned by the API and are optional. See the Schema section below for additional details. For fields with optional values:

A maybe getter (get_cpue_kgkm2_maybe) is provided which will return None without error if the value is not provided or could not be parsed.
A regular getter (get_cpue_kgkm2) is provided which asserts the value is not None before it is returned.

Record objects also have an is_complete method which returns true if both all optional fields on the Record are non-None and the date_time field on the Record is a valid ISO 8601 string. By default, records for which is_complete are false are returned when iterating or through get_page but this can be overridden by with the filter_incomplete keyword argument like so:

results = afscgap.query(
    year=2021,
    srvy='GOA',
    scientific_name='Pasiphaea pacifica',
    filter_incomplete=True
)

for result in results:
    assert result.is_complete()

Results returned by the API for which non-Optional fields could not be parsed (like missing year) are considered "invalid" and always excluded during iteration when those raw unreadable records are kept in a queue.Queue[dict] that can be accessed via get_invalid like so:

results = afscgap.query(year=2021, srvy='GOA')
valid = list(results)

invalid_queue = results.get_invalid()
percent_invalid = invalid_queue.qsize() / len(valid) * 100
print('Percent invalid: %%%.2f' % percent_invalid)

complete = filter(lambda x: x.is_complete(), valid)
num_complete = sum(map(lambda x: 1, complete))
percent_complete = num_complete / len(valid) * 100
print('Percent complete: %%%.2f' % percent_complete)

Note that this queue is filled during iteration (like for result in results or list(results)) and not get_page whose invalid record handling behavior can be specified via the ignore_invalid keyword.

Debugging

For investigating issues or evaluating the underlying operations, you can also request a full URL for a query:

results = afscgap.query(
    year=2021,
    latitude_dd={'$between': [56, 57]},
    longitude_dd={'$between': [-161, -160]}
)

print(results.get_page_url(limit=10, offset=0))

The query can be executed by making an HTTP GET request at the provided location.

Data structure

The schema drives the getters and filters available on in the library.

Schema

A Python-typed description of the fields is provided below.

Field	Python Type	Description
year	float	Year for the survey in which this observation was made.
srvy	str	The name of the survey in which this observation was made. NBS (N Bearing Sea), EBS (SE Bearing Sea), BSS (Bearing Sea Slope), or GOA (Gulf of Alaska)
survey	str	Long form description of the survey in which the observation was made.
survey_id	float	Unique numeric ID for the survey.
cruise	float	An ID uniquely identifying the cruise in which the observation was made. Multiple cruises in a survey.
haul	float	An ID uniquely identifying the haul in which this observation was made. Multiple hauls per cruises.
stratum	float	Unique ID for statistical area / survey combination as described in the metadata or 0 if an experimental tow.
station	str	Station associated with the survey.
vessel_name	str	Unique ID describing the vessel that made this observation. This is left as a string but, in practice, is likely numeric.
vessel_id	float	Name of the vessel at the time the observation was made with multiple names potentially associated with a vessel ID.
date_time	str	The date and time of the haul which has been attempted to be transformed to an ISO 8601 string without timezone info. If it couldn’t be transformed, the original string is reported.
latitude_dd	float	Latitude in decimal degrees associated with the haul.
longitude_dd	float	Longitude in decimal degrees associated with the haul.
species_code	float	Unique ID associated with the species observed.
common_name	str	The “common name” associated with the species observed. Example: Pacific glass shrimp
scientific_name	str	The “scientific name” associated with the species observed. Example: Pasiphaea pacifica
taxon_confidence	str	Confidence flag regarding ability to identify species (High, Moderate, Low). In practice, this can also be Unassessed.
cpue_kgha	Optional[float]	Catch weight divided by net area (kg / hectares) if available. See metadata. None if could not interpret as a float.
cpue_kgkm2	Optional[float]	Catch weight divided by net area (kg / km^2) if available. See metadata. None if could not interpret as a float.
cpue_kg1000km2	Optional[float]	Catch weight divided by net area (kg / km^2 * 1000) if available. See metadata. None if could not interpret as a float.
cpue_noha	Optional[float]	Catch number divided by net sweep area if available (count / hectares). See metadata. None if could not interpret as a float.
cpue_nokm2	Optional[float]	Catch number divided by net sweep area if available (count / km^2). See metadata. None if could not interpret as a float.
cpue_no1000km2	Optional[float]	Catch number divided by net sweep area if available (count / km^2 * 1000). See metadata. None if could not interpret as a float.
weight_kg	Optional[float]	Taxon weight (kg) if available. See metadata. None if could not interpret as a float.
count	Optional[float]	Total number of organism individuals in haul. None if could not interpret as a float.
bottom_temperature_c	Optional[float]	Bottom temperature associated with observation if available in Celsius. None if not given or could not interpret as a float.
surface_temperature_c	Optional[float]	Surface temperature associated with observation if available in Celsius. None if not given or could not interpret as a float.
depth_m	float	Depth of the bottom in meters.
distance_fished_km	float	Distance of the net fished as km.
net_width_m	float	Distance of the net fished as m.
net_height_m	float	Height of the net fished as m.
area_swept_ha	float	Area covered by the net while fishing in hectares.
duration_hr	float	Duration of the haul as number of hours.
tsn	Optional[int]	Taxonomic information system species code.
ak_survey_id	int	AK identifier for the survey.

For more information on the schema, see the metadata repository but note that the fields may be slightly different in the Python library per what is actually returned by the API.

Filters and getters

These fields are available as getters on afscgap.model.Record (result.get_srvy()) and may be used as optional filters on the query asfcgagp.query(srvy='GOA'). Fields which are Optional have two getters. First, the "regular" getter (result.get_count()) will assert that the field is not None before returning a non-optional. The second "maybe" getter (result.get_count_maybe()) will return None if the value was not provided or could not be parsed.

Filter keyword	Regular Getter	Maybe Getter
year	get_year() -> float
srvy	get_srvy() -> str
survey	get_survey() -> str
survey_id	get_survey_id() -> float
cruise	get_cruise() -> float
haul	get_haul() -> float
stratum	get_stratum() -> float
station	get_station() -> str
vessel_name	get_vessel_name() -> str
vessel_id	get_vessel_id() -> float
date_time	get_date_time() -> str
latitude_dd	get_latitude_dd() -> float
longitude_dd	get_longitude_dd() -> float
species_code	get_species_code() -> float
common_name	get_common_name() -> str
scientific_name	get_scientific_name() -> str
taxon_confidence	get_taxon_confidence() -> str
cpue_kgha	get_cpue_kgha() -> float	get_cpue_kgha_maybe() -> Optional[float]
cpue_kgkm2	get_cpue_kgkm2() -> float	get_cpue_kgkm2_maybe() -> Optional[float]
cpue_kg1000km2	get_cpue_kg1000km2() -> float	get_cpue_kg1000km2_maybe() -> Optional[float]
cpue_noha	get_cpue_noha() -> float	get_cpue_noha_maybe() -> Optional[float]
cpue_nokm2	get_cpue_nokm2() -> float	get_cpue_nokm2_maybe() -> Optional[float]
cpue_no1000km2	get_cpue_no1000km2() -> float	get_cpue_no1000km2_maybe() -> Optional[float]
weight_kg	get_weight_kg() -> float	get_weight_kg_maybe() -> Optional[float]
count	get_count() -> float	get_count_maybe() -> Optional[float]
bottom_temperature_c	get_bottom_temperature_c() -> float	get_bottom_temperature_c_maybe() -> Optional[float]
surface_temperature_c	get_surface_temperature_c() -> float	get_surface_temperature_c_maybe() -> Optional[float]
depth_m	get_depth_m() -> float
distance_fished_km	get_distance_fished_km() -> float
net_width_m	get_net_width_m() -> float
net_height_m	get_net_height_m() -> float
area_swept_ha	get_area_swept_ha() -> float
duration_hr	get_duration_hr() -> float
tsn	get_tsn() -> int	get_tsn_maybe() -> Optional[int]
ak_survey_id	get_ak_survey_id() -> int

Record objects also have a is_complete method which returns true if all the fields with an Optional type are non-None and the date_time could be parsed and made into an ISO 8601 string.

License

We are happy to make this library available under the BSD 3-Clause license. See LICENSE for more details. (c) 2023 The Eric and Wendy Schmidt Center for Data Science and the Environment at UC Berkeley.

Community

Thanks for your support! Pull requests and issues very welcome.

Contribution guidelines

We invite contributions via our project Github. We have a few guidelines:

Please follow the Google Python Style Guide where possible for compatibility with the existing codebase.
Tests are encouraged and we aim for 80% coverage where feasible.
Type hints are encouraged and we aim for 80% coverage where feasible.
Docstrings are encouraged and we aim for 80% coverage.
Please check that you have no mypy errors when contributing.
Please check that you have no linting (pycodestyle, pyflakes) errors when contributing.
As contributors may be periodic, please do not re-write history / squash commits for ease of fast forward.
Open source is an act of love. Please be kind and respectful of all contributors at all levels.

Note that imports should be in alphabetical order in groups of standard library, third-party, and then first party. It is an explicit goal to provide a class with type hints for all record fields. Getters on an immutable record object are encouraged as to enable use of the type system and docstrings for understanding the data structures. Data structures have been used that could allow for threaded request but everything is currently single threaded.

Contacts

Sam Pottinger is the primary contact. Thanks to Giulia Zarpellon and Carl Boettiger for their contributions. This is a project of the The Eric and Wendy Schmidt Center for Data Science and the Environment at UC Berkeley. Please contact us via dse@berkeley.edu.

Open Source

We are happy to be part of the open source community.

At this time, the only open source dependency used by this microlibrary is Requests which is available under the Apache v2 License from Kenneth Reitz and other contributors.

Our build and documentation systems also use the following but are not distributed with or linked to the project itself:

mypy under the MIT License from Jukka Lehtosalo, Dropbox, and other contributors.
nose2 under a BSD license from Jason Pellerin and other contributors.
pdoc under the Unlicense license from Andrew Gallant and Maximilian Hils.
pycodestyle under the Expat License from Johann C. Rocholl, Florent Xicluna, and Ian Lee.
pyflakes under the MIT License from Divmod, Florent Xicluna, and other contributors.

Thank you to all of these projects for their contribution.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Jun 28, 2023

1.0.3

Jun 28, 2023

1.0.2

Jun 2, 2023

1.0.1

Jun 1, 2023

1.0.0

May 31, 2023

0.0.9

Apr 18, 2023

0.0.8

Apr 18, 2023

0.0.7

Mar 17, 2023

0.0.6

Mar 6, 2023

0.0.5

Mar 5, 2023

0.0.4

Mar 4, 2023

0.0.3

Feb 24, 2023

This version

0.0.2

Feb 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afscgap-0.0.2.tar.gz (25.3 kB view details)

Uploaded Feb 24, 2023 Source

Built Distribution

afscgap-0.0.2-py3-none-any.whl (30.2 kB view details)

Uploaded Feb 24, 2023 Python 3

File details

Details for the file afscgap-0.0.2.tar.gz.

File metadata

Download URL: afscgap-0.0.2.tar.gz
Upload date: Feb 24, 2023
Size: 25.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for afscgap-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`f42740d155e5abe17555d82145f19935f579ed17e6d1f4e80ab574ec60efa22a`
MD5	`d704d2a0a9a29cd3d07a5860bb69fecc`
BLAKE2b-256	`215679c384d61b8fdfc7586490a4e2f77bc0525ef9c06a94d41c1b3f148041b6`

See more details on using hashes here.

File details

Details for the file afscgap-0.0.2-py3-none-any.whl.

File metadata

Download URL: afscgap-0.0.2-py3-none-any.whl
Upload date: Feb 24, 2023
Size: 30.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for afscgap-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aee26e4914b2830a5d4689d623b22354a1c175e0beceb9865613e4d512aacd2a`
MD5	`5c3975b6c66b5f23aff2c70e99901085`
BLAKE2b-256	`294be718dd745879a9e51a9b557a933689b96af82212da4aea2a25980f4b5272`