Skip to main content

Python package to handle Darwin Core Archive (DwCA) operations. This includes creating a DwCA zip file from one or more csvs, reading a DwCA, merge two DwCAs, validate DwCA and delete records from DwCA based on one or more key columns

Project description

dwcahandler

About

Python package to handle Darwin Core Archive (DwCA) operations. This includes creating a DwCA zip file from csv, reading a DwCA, merge two DwCAs, validate DwCA and delete records from DwCA based on one or more key columns

Motivation

This package was developed from a module in ALA's data preingestion to produce a DwCA for pipelines ingestion. ALA receive different forms of data from various data providers in the form of CSV and text files, API harvest and DwCA, this is needed to package up the data into DwCA.

The operations provided by dwcahandler includes creating a dwca from csv/text file, merge 2 dwcas, delete records in dwca and perform core key validations like testing duplicates of one or more keys, empty and duplicate keys.

Technologies

This package is developed in Python. Tested with Python 3.12, 3.11, 3.10 and 3.9

 

Setup

  • Clone the repository.
  • If using pyenv, install the required python version and activate it locally
pyenv local <python version>
  • Install the dependency in local virtual environment
poetry shell
poetry install
  • To import the darwin core and all the gbif extensions class row types and terms into dwcahandler
poetry run update-terms

 

Build

To build dwcahandler package

poetry build

 

Installation

Install published package

pip install dwcahandler

To use locally built package in a virtual environment:

pip install <folder>/dwcahandler/dist/dwcahandler-<version>.tar.gz

To install published package from testpypi

pip install -i https://test.pypi.org/simple/ dwcahandler

 

DwcaHandler is currently supporting the latest gbif extensions.

DwCA with the following extensions that have been ingested and tested in ALA are:

Terms

from dwcahandler import DwcaHandler

df_terms, df_class = DwcaHandler.list_terms()
print(df_terms, df_class)

Class

from dwcahandler import MetaElementTypes

print(MetaElementTypes.OCCURRENCE)
print(MetaElementTypes.MULTIMEDIA)

To list all the Class Rowtypes

from dwcahandler import DwcaHandler

DwcaHandler.list_class_rowtypes()

 

Examples of dwcahandler usages:

  • Create Darwin Core Archive from csv file.
  • Keys in core content are used as id/core id for Dwca with extensions and must be supplied in the data for core and extensions
  • If core data have more than 1 key (for eg: institutionCode, collectionCode and catalogNumber), resulting dwca would generate id/core id for extension
  • Validation is performed to make sure that the keys are unique in the core of the Dwca by default
  • If keys are supplied for the content extension, the validation will be run to check the uniqueness of the keys in the content
  • If keys are not provided, the default keys is eventID for event content and occurrenceID for occurrence content
  • In creating a dwca with multimedia extension, provide format and type values in the Simple Multimedia extension, otherwise, dwcahandler will attempt to fill these info by guessing the mimetype from url.
  • For convenience, if occurrence text file contain dwc term associatedMedia and no multimedia extension is supplied, dwcahandler attempts to extract out the multimedia url from associatedMedia into simple multimedia extemsion.
from dwcahandler import ContentData
from dwcahandler import DwcaHandler
from dwcahandler import MetaElementTypes
from dwcahandler import Dataset, Eml, Description, Contact, Name

core_csv = ContentData(data=["/tmp/occurrence.csv"], type=MetaElementTypes.OCCURRENCE, keys=["occurrenceID"])
ext_csvs = [ContentData(data=["/tmp/multimedia.csv"], type=MetaElementTypes.MULTIMEDIA)]

eml = Eml(
    dataset=Dataset(
        dataset_name="A Dataset",
        abstract=Description("Description of dataset"),
        creator=Contact(individual_name=Name(first_name="Jane", last_name="Doe"), email="jane.doe@org.com"),
    )
)

DwcaHandler.create_dwca(core_csv=core_csv, ext_csv_list=ext_csvs, eml_content=eml, output_dwca="/tmp/dwca.zip")

 

  • Create Darwin Core Archive from pandas dataframe
  • In creating a dwca with multimedia extension, provide format and type values in the Simple Multimedia extension, otherwise, dwcahandler will attempt to fill these info by guessing the mimetype from url.
from dwcahandler import DwcaHandler
from dwcahandler.dwca import ContentData
from dwcahandler import MetaElementTypes
from dwcahandler import Dataset, Eml, Description, Contact, Name
import pandas as pd

core_df = pd.read_csv("/tmp/occurrence.csv")
core_frame = ContentData(data=core_df, type=MetaElementTypes.OCCURRENCE, keys=["occurrenceID"])

ext_df = pd.read_csv("/tmp/multimedia.csv")
ext_frame = [ContentData(data=ext_df, type=MetaElementTypes.MULTIMEDIA)]

eml = Eml(
    dataset=Dataset(
        dataset_name="A Dataset",
        abstract=Description("Description of dataset"),
        creator=Contact(individual_name=Name(first_name="Jane", last_name="Doe"), email="jane.doe@org.com"),
    )
)

DwcaHandler.create_dwca(core_csv=core_frame, ext_csv_list=ext_frame, eml_content=eml, output_dwca="/tmp/dwca.zip")

 

  • Convenient helper function to build Darwin Core Archive from a list of csv files.
  • Build event core DwCA if event.txt file is supplied, otherwise, occurrence core DwCA if occurrence.txt is supplied.
  • Raises error if neither event.txt nor occurrence.txt is in the list
  • Class row types are determined by file names of the text files.
  • If no content keys provided, the default keys are eventID for event content and occurrenceID for occurrence content
  • Delimiter for txt files are comma delimiter by default. For tab delimiter, supply CsvEncoding
from dwcahandler import DwcaHandler
from dwcahandler import Dataset, Eml, Description, Contact, Name

eml = Eml(
    dataset=Dataset(
        dataset_name="A Dataset",
        abstract=Description("Description of dataset"),
        creator=Contact(individual_name=Name(first_name="Jane", last_name="Doe"), email="jane.doe@org.com"),
    )
)

DwcaHandler.create_dwca_from_file_list(
    files=["/tmp/event.csv", "/tmp/occurrence.csv"], eml_content=eml, output_dwca="/tmp/dwca.zip"
)

 

  • Convenient helper function to create Darwin Core Archive from csv files in a zip files.
  • Build event core DwCA if event.txt file is supplied, otherwise, occurrence core DwCA if occurrence.txt is supplied in the zip file
  • Raises error if neither event.txt nor occurrence.txt is in the zip file
  • Class row types are determined by file names of the text files.
  • If no content keys provided, the default keys are eventID for event content and occurrenceID for occurrence content.
  • Delimiter for txt files are comma delimiter by default. For tab delimiter, supply CsvEncoding
from dwcahandler import DwcaHandler
from dwcahandler import Dataset, Eml, Description, Contact, Name

eml = Eml(
    dataset=Dataset(
        dataset_name="A Dataset",
        abstract=Description("Description of dataset"),
        creator=Contact(individual_name=Name(first_name="Jane", last_name="Doe"), email="jane.doe@org.com"),
    )
)

DwcaHandler.create_dwca_from_zip_content(zip_file="/tmp/txt_files.zip", eml_content=eml, output_dwca="/tmp/dwca.zip")

 

  • Merge two Darwin Core Archives into a single file
  • Set extension sync to True to remove existing extension records before merging. Default for extension sync is False
from dwcahandler import DwcaHandler, MetaElementTypes

DwcaHandler.merge_dwca(dwca_file="/tmp/dwca.zip", delta_dwca_file="/tmp/delta-dwca.zip",
                       output_dwca="/tmp/new-dwca.zip", 
                       keys_lookup={MetaElementTypes.OCCURRENCE:["occurrenceID"]})

 

  • Delete Rows from core file in Darwin Core Archive
from dwcahandler import ContentData
from dwcahandler import DwcaHandler, MetaElementTypes

delete_csv = ContentData(data=["/tmp/old-records.csv"], type=MetaElementTypes.OCCURRENCE, keys=["occurrenceID"])

DwcaHandler.delete_records(dwca_file="/tmp/dwca.zip",
                           records_to_delete=delete_csv,
                           output_dwca="/tmp/new-dwca.zip")

 

Support for building Ecological Markup Language (EML) via DwcaHandler Eml class

  • DwcaHandler supports generating EML for dataset metadata.
  • Eml object can be passed into DwcaHandler to create DwCA.
  • Eml class requires a mandatory dataset and an optional additional metadata to generate the eml.
  • Failure to provide dataset or an empty additional metadata is supplied, Eml class will still generate the eml string but an error will be displayed.
  • For more info on the Eml Class, see eml.py
  • See sample EML eml-sample.xml
from dwcahandler import (
    Name,
    Address,
    Contact,
    Description,
    BoundingCoordinates,
    GeographicCoverage,
    DateRange,
    CalendarDate,
    TemporalCoverage,
    TaxonomicClassification,
    TaxonomicCoverage,
    Coverage,
    KeywordSet,
    Dataset,
    AdditionalMetadata,
    GBIFMetadata,
    Eml,
)

contact_person: Contact = Contact(
    individual_name=Name(first_name="John", last_name="Doe"),
    address=Address(city="City", postal_code="ABC-123", country="Country"),
    organization_name="An Organization",
    email="john.doe@org.com",
    userid="https://orcid.org/0000-0000-0000-0000",
)

dataset = Dataset(
    dataset_name="Test Dataset",
    alternate_identifier=[
        "https://ipt/eml.do?r=a-resource",
        "https://another-website/dataset-info",
    ],
    keyword_set=[
        KeywordSet(
            keyword="Occurrence",
            keyword_thesaurus="http://rs.gbif.org/vocabulary/gbif/dataset_type_2015-07-10.xml",
        )
    ],
    abstract=Description(
        description="Lorem Ipsum is simply dummy text of the printing and typesetting industry. \n"
                    "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when \n"
                    "an unknown printer took a galley of type and scrambled it to make a type specimen book"
    ),
    creator=contact_person,
    published_date="2020-05-01",
    coverage=Coverage(
        geographic_coverage=GeographicCoverage(
            description="The data set contains records of herbarium specimens",
            bounding_coordinates=BoundingCoordinates(west="1.0", east="2.0", north="3.0", south="4.0"),
        ),
        temporal_coverage=TemporalCoverage(
            DateRange(
                begin_date=CalendarDate(calendar_date="2020-01-01"),
                end_date=CalendarDate(calendar_date="2020-01-31"),
            )
        ),
        taxonomic_coverage=TaxonomicCoverage(
            general_taxonomic_coverage="All vascular plants are identified as species or genus",
            taxonomic_classification=[
                TaxonomicClassification(taxon_rank_name="Genus", taxon_rank_value="Acacia"),
                TaxonomicClassification(taxon_rank_name="Genus", taxon_rank_value="Acacia"),
            ],
        ),
    ),
    intellectual_rights=[
        Description(
            description=
    """"
    This work is licensed under a 
    <ulink url='http://creativecommons.org/licenses/by/4.0/legalcode'><citetitle>Creative Commons Attribution (CC-BY) 4.0 License</citetitle></ulink>
    """
        )
    ],
    contact=contact_person,
)

# GBIF Additional Metadata require at least a citation. 
# If gbif field is supplied, it must be a list of dictionaries
additional_metadata = AdditionalMetadata(
    metadata=GBIFMetadata(
        citation="Researchers should cite this work as follows: xxxxx",
        gbif=[{"resourceLogoUrl": "http://logo_url"}, {"hierarchyLevel":"dataset"}]
    )
)

eml = Eml(dataset=dataset,
          additional_metadata=additional_metadata)

eml_xml_str = eml.build_eml_xml()
print(eml_xml_str)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dwcahandler-1.1.2.tar.gz (59.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dwcahandler-1.1.2-py3-none-any.whl (61.5 kB view details)

Uploaded Python 3

File details

Details for the file dwcahandler-1.1.2.tar.gz.

File metadata

  • Download URL: dwcahandler-1.1.2.tar.gz
  • Upload date:
  • Size: 59.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dwcahandler-1.1.2.tar.gz
Algorithm Hash digest
SHA256 d82ec558843640b475b80cbccc4cc3f620dcf594d97ba7e67920b4ac04f7a437
MD5 09ce09b88c6e0a3abd9f64b416d1e0cf
BLAKE2b-256 054b548afb08e9b4021c14bf7fc88eb6cf5e376fb5887fe25850bbab57174a26

See more details on using hashes here.

Provenance

The following attestation bundles were made for dwcahandler-1.1.2.tar.gz:

Publisher: publish-release.yml on AtlasOfLivingAustralia/dwcahandler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dwcahandler-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: dwcahandler-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 61.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dwcahandler-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 782dc37b86490d08dfdcdf9c8fbe92afbb74342b0b8c919b1754fbee8c63ce87
MD5 c59001a66c99634e5e598c242cb2596a
BLAKE2b-256 77273f248951639289cc929ee95161bcdbe7bb21ffbe6a8a42d91b6f81b157b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for dwcahandler-1.1.2-py3-none-any.whl:

Publisher: publish-release.yml on AtlasOfLivingAustralia/dwcahandler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page