Skip to main content

An algorithm that used to detect the ocean in-situ duplicate profiles (Song et al., 2024, FMS)

Project description

duplicated_checking_IQuOD

Author: Zhetao Tan (IAP/CAS), Xinyi Song (IAP/CAS)

Contributor: International Quality-controlled Ocean Database (IQuOD) members

Overview

This algorithm aims at detecting the duplicate ocean in-situ profiles by reducing the computational intensity with a cost-effective way.

It used a so called 'DNA' method, with defining a DNA for each profiles by using the primary metadata (e.g., latitude, longitude, instrument types etc.) and the secondary data (e.g., sum of depth, sum of temperature, standard deviation of temperature in the profile).

The assumption of this checking is: if it is a duplicate pair, most of the metadata and the observational data are identical.

The duplicate checking algorithm is contributed to the IQuOD group.

The codes need to be runned with MATLAB and Python 3.

A scentific paper to introduce this algorithm is in prepration.

A detailed user manual can be found in: DC_OCEAN_user_document-v1.0.md

Below is the simply instruction of the algorithm. Further information will be provided in the scentific paper.

Running orders:

(1) ./support/N00_read_metadata.m

(2) ./support/N01_formatted_metadata.m

(3) ./support/N02_1_duplicate_check_main_2_mapstd_average.m

(4) ./support/N02_2_duplicate_check_main_2_onlydepthtemp_compair.m

(5) ./support/N02_3_duplicate_check_main_2_weight_meta.m

(6) ./support/N02_4_duplicate_check_main_2_weighted_allinfo.m

(7) ./support/N02_5_duplicate_check_weight_meta_noLATLON.m

(8) ./support/N02_6_duplicate_check_main_2_weight_meta_noDepthinfo.m

(9) ./support/N02_7_duplicate_check_main_2_mapminmax_average.m

(10) ./support/N02_8_duplicate_check_main_2_mapminmax_onlydepthtemp_compair.m

(11) ./support/N02_9_duplicate_check_main_2_mapminmax_weight_meta.m

(12) ./support/N02_10_duplicate_check_main_2_mapminmax_weighted_allinfo.m

(13) ./support/N02_11_duplicate_check_mapminmax_weight_meta_noLATLON.m

(14) ./support/N02_12_duplicate_check_main_2_mapminmax_weight_meta_noDepthinfo.m

(15) ./support/N02_13_duplicate_check_main_2_PAC_90_allinfo.m

(16) ./support/N02_14_duplicate_check_main_2_PAC_95_allinfo.m

(17) ./support/N03_potential_duplicate_unique_list_3.m

(18) M01_MAIN_check_nc_duplicate_manual.py

(19) M02_MAIN_check_nc_duplicate_list.py

Input: A test file folder ./input_files/WOD18_sample_1995

Output: duplicated and non-duplicated list filesDuplicateList_potential_duplicate_ALL_1995_unique.txt and Unduplicatelist_potential_duplicate_ALL_1995_unique.txt

The above two output files could be opened by using Excel.

Here, ./supports/N00_read_metadata.m is to read the input test (netCDF format) file.

N02***.m and N03 are to calculate the 'DNA' for each profile in different weights and then detect the potential duplicated pairs.

M02_MAIN_check_nc_duplicate_list.py is to automatically check whether the potential duplicated pairs output from N03 is the real duplicate or not.

M01_MAIN_check_nc_duplicate_manual.py is to manually check whether the potential duplicated pairs output from N03 is the real duplicated or not. It needs to input the filename of the potential duplicated pairs manually.

Update logs

(1) January 15th, 2023: updated the N04 program with adding minor revision. (2) February 3rd,2023: expanded the N02 series of procedures, at present, the N02_1** to N02_6** programs are based on the normalization of data by row; the N02_7** to N02_12** programs are based on the normalization of data by column; the N02_13** and N02_14** program are based on the principal component analysis method. (3) March 29th,2023:updated the N04 program with adding minor revision; Added only output duplicate data file name and accession number program to facilitate sensitivity check; Added a program to output non-duplicate data for manual inspection;Added procedures for checking sensitivity.

(4) August 22rd, 2023: Finalized the first version of the duplicate checking alogrithim (v1.0)

(5) January 5, 2024:Improved version (v1.1)

Citation

[1] Xinyi Song, Zhetao Tan, Lijing Cheng et al, 2024: An open source algorithm of duplicate checking for ocean in-situ profiles. (In prepration)

[2] Xinyi Song, Zhetao Tan, Lijing Cheng. 2023, A benchmark dataset for ocean profiles duplicate checking. http://dx.doi.org/10.12157/IOCAS.20230821.001

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DC-OCEAN-1.1.tar.gz (11.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

DC_OCEAN-1.1-py3-none-any.whl (15.4 MB view details)

Uploaded Python 3

File details

Details for the file DC-OCEAN-1.1.tar.gz.

File metadata

  • Download URL: DC-OCEAN-1.1.tar.gz
  • Upload date:
  • Size: 11.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.6

File hashes

Hashes for DC-OCEAN-1.1.tar.gz
Algorithm Hash digest
SHA256 2242747a871c15427e8ecafb7ed76bedbb34601473879c60207a3398baee446b
MD5 0ea759035d4c654eec4f81d56295f228
BLAKE2b-256 de4d139d4912a78d45ae5a5eae55483971f1b6b51f816100722444946ed29dbc

See more details on using hashes here.

File details

Details for the file DC_OCEAN-1.1-py3-none-any.whl.

File metadata

  • Download URL: DC_OCEAN-1.1-py3-none-any.whl
  • Upload date:
  • Size: 15.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.6

File hashes

Hashes for DC_OCEAN-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d62fe1f31c025ec139fa85d6dc14646e8e3119ca861f8b1d65714164c32e1a5e
MD5 41c133b409923a6e2c304de4df97c712
BLAKE2b-256 cd17e7327435d39d96e66d98b59cfab5b99fb36c2e0fb19222cde022986bcef5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page