Skip to main content

Python package to handle Darwin Core Archive (DwCA) operations. This includes creating a DwCA zip file from one or more csvs, reading a DwCA, merge two DwCAs, validate DwCA and delete records from DwCA based on one or more key columns

Project description

dwcahandler

About

Python package to handle Darwin Core Archive (DwCA) operations. This includes creating a DwCA zip file from csv, reading a DwCA, merge two DwCAs, validate DwCA and delete records from DwCA based on one or more key columns

Motivation

This package was developed from a module in ALA's data preingestion to produce a DwCA for pipelines ingestion. ALA receive different forms of data from various data providers in the form of CSV and text files, API harvest and DwCA, this is needed to package up the data into DwCA.

The operations provided by dwcahandler includes creating a dwca from csv/text file, merge 2 dwcas, delete records in dwca and perform core key validations like testing duplicates of one or more keys, empty and duplicate keys.

The module uses and maintain the standard dwc terms from a point in time versioned copy of https://dwc.tdwg.org/terms/ and extensions like https://rs.gbif.org/extension/gbif/1.0/multimedia.xml.

Technologies

This package is developed in Python. Tested with Python 3.12, 3.11, 3.10 and 3.9

 

Setup

  • Clone the repository.
  • If using pyenv, install the required python version and activate it locally
pyenv local <python version>
  • Install the dependency in local virtual environment
poetry shell
poetry install
  • To update the darwin core terms supported in dwcahandler package
poetry run update-dwc-terms

 

Build

To build dwcahandler package

poetry build

 

Installation

Install published package

pip install dwcahandler

To use locally built package in a virtual environment:

pip install <folder>/dwcahandler/dist/dwcahandler-<version>.tar.gz

To install published package from testpypi

pip install -i https://test.pypi.org/simple/ dwcahandler

 

Examples of dwcahandler usages:

  • Create Darwin Core Archive from csv file
  • In creating a dwca with multimedia extension, provide format and type values in the Simple Multimedia extension, otherwise, dwcahandler will attempt to fill these info by guessing the mimetype from url.
from dwcahandler import CsvFileType
from dwcahandler import DwcaHandler
from dwcahandler import Eml

core_csv = CsvFileType(files=['/tmp/occurrence.csv'], type='occurrence', keys=['occurrenceID'])
ext_csvs = [CsvFileType(files=['/tmp/multimedia.csv'], type='multimedia', keys=['occurrenceID'])]

eml = Eml(dataset_name='Test Dataset',
          description='Dataset description',
          license='Creative Commons Attribution (International) (CC-BY 4.0 (Int) 4.0)',
          citation="test citation",
          rights="test rights")

DwcaHandler.create_dwca(core_csv=core_csv, ext_csv_list=ext_csvs, eml_content=eml, output_dwca_path='/tmp/dwca.zip')

 

  • Create Darwin Core Archive from pandas dataframe
  • In creating a dwca with multimedia extension, provide format and type values in the Simple Multimedia extension, otherwise, dwcahandler will attempt to fill these info by guessing the mimetype from url.
from dwcahandler import DwcaHandler
from dwcahandler.dwca import DataFrameType
from dwcahandler import Eml
import pandas as pd

core_df = pd.read_csv("/tmp/occurrence.csv")
core_frame = DataFrameType(df=core_df, type='occurrence', keys=['occurrenceID'])

ext_df = pd.read_csv("/tmp/multimedia.csv")
ext_frame = [DataFrameType(df=ext_df, type='multimedia', keys=['occurrenceID'])]

eml = Eml(dataset_name='Test Dataset',
          description='Dataset description',
          license='Creative Commons Attribution (International) (CC-BY 4.0 (Int) 4.0)',
          citation="test citation",
          rights="test rights")

DwcaHandler.create_dwca(core_csv=core_frame, ext_csv_list=ext_frame, eml_content=eml, output_dwca_path='/tmp/dwca.zip')

 

  • Merge Darwin Core Archive
from dwcahandler import DwcaHandler

DwcaHandler.merge_dwca(dwca_file='/tmp/dwca.zip', delta_dwca_file='/tmp/delta-dwca.zip',
                       output_dwca_path='/tmp/new-dwca.zip', 
                       keys_lookup={'occurrence':'occurrenceID'})

 

  • Delete Rows from core file in Darwin Core Archive
from dwcahandler import CsvFileType
from dwcahandler import DwcaHandler

delete_csv = CsvFileType(files=['/tmp/old-records.csv'], type='occurrence', keys=['occurrenceID'])

DwcaHandler.delete_records(dwca_file='/tmp/dwca.zip',
                           records_to_delete=delete_csv, 
                           output_dwca_path='/tmp/new-dwca.zip')

 

  • List darwin core terms that is supported in dwcahandler package
from dwcahandler import DwcaHandler

df = DwcaHandler.list_dwc_terms()
print(df)

 

  • Other usages may include subclassing the dwca class, modifying the core dataframe content and rebuilding the dwca.
from dwcahandler import Dwca

class DerivedDwca(Dwca):
    """
    Derived class to perform other custom operations that is not included as part of the core operations
    """
    def drop_columns(self):
        """
        Drop existing column in the core content
        """
        self.core_content.df_content.drop(columns=['column1', 'column2'], inplace=True)
        self._update_meta_fields(self.core_content)


dwca = DerivedDwca(dwca_file_loc='/tmp/dwca.zip')
dwca.extract_dwca()
dwca.drop_columns()
dwca.generate_eml()
dwca.generate_meta()
dwca.write_dwca('/tmp/newdwca.zip')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dwcahandler-0.2.0.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

dwcahandler-0.2.0-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file dwcahandler-0.2.0.tar.gz.

File metadata

  • Download URL: dwcahandler-0.2.0.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for dwcahandler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4160e4d8d0727aa7472af6b0d3bef3a194f7fefff47c64bcd5db221b8fac832a
MD5 47bd80cad48d88424ec1e3a5e25a6b50
BLAKE2b-256 d54bb82426eb98df894524d4cc7efe9bd33ed98dcd0b7c6701b81f602ab81aed

See more details on using hashes here.

File details

Details for the file dwcahandler-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dwcahandler-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for dwcahandler-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dab0f8c83eae64d33c2885d294d1ae115ea6ee0a2ebb48cda54b8f6676a60209
MD5 00502580b3a586fb0b7fe5b762d3fbb7
BLAKE2b-256 d7dca7b24d4b4353ee7345f01d052fdb5d2aa221cd04623568a1099a84a78abc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page