Skip to main content

HDP Declarative Programming toolchain (working draft)

Project description

Data Science files exported from HXL (The Humanitarian Exchange Language)

[Proof of concept] Common file formats used for Data Science exported from HXL (The Humanitarian Exchange Language)

Standard HXL License Google Drive



HXL-Data-Science-file-formats

In addition to this GitHub repository, check also the EticaAI-Data_HXL-Data-Science-file-formats Google Drive folder.

1. The main focus

1.1 Vocabulary, Taxonomies and URNs

1.1.1 Vocabulary & Taxonomies on HXL

This project either use explicit HXL +attributes (easy to implement, but more verbose) or do inferences on well know HXLated datasets used on humanitarian areas. To make this work, the main reference is not software implementation, but reference tables.

1.1.2 Uniform Resource Name on URN:DATA
Why use URN to identify resources is more than naming convention

While find good URNs conventions to be used for typical datasets used on humanitarian context is more complex than the ISO URN or even the LEX URN (this one already used in Brazil), one goal of the urnresolver is accept that most data shared are VERY sensitive and private, so this this actually is the challenge. So in addition to converting some well known public datasets related to HXL, we're already designing to eventually be used as abstraction to scripts and tools that without this would need to have access to real datasets.

By using URNs, at worst case we're creating documentations and scripts that a new user would need to replace by the real one of its use case. But the ideal case is to allow exchange scripts or, when an issue happens in a new region, the personel who prepare the data could do it and then publish also on private URN listing so others could reuse.

Note that the URN Resolver, even if it does have links to resources and not just the contact page, the links themselves to download the real data could still require authentication case by case. Also same URNs, if you manage to have contact with several peers, in special for datasets that are not already an COD, but are often needed, are likely to exist with more than one option to use.

Deeper integration with CKAN instances and/or awareness of encrypted data still not implemented on the current version (v0.7.3)

Security (and privacy) considerations (for URN:DATA)

Since the main goal of URNs is also help with auditing and sharing of scripts and even how to reference "best acceptable use" of exchanced data (with special focus for private/sensitive), while the URN:DATA themselves are mean to be NOT a secret and could be published on official documents, the local implementations (aka how to resolve/redirect these URNs for real data) need to take in account concepts that the "perfect optimization" (think "secure from misuse" vs "protect privacy from legitimate use") often is contraditory.

TODO: add more context

Disclaimer (for URN:DATA)

Note: while this project, in addition to CLI tools to convert URNs to usable tool ("the implementation"), also draft the logic about how to construct potentially useful URNs reusable at International level (e.g. what may seem as drafted "an standard", think ISO, or an Best Current Practice, think IETF) please do not take EticaAI/HXL-Data-Science-file-formats... as endorsed by any organization.

Also, authors from @EticaAI / @HXL-CPLP (both past and future ones who cooperate directly with this project) explicitly release both software and drafted 'how to Implement' under public domain-like licenses. Under ideal circumstances data global namespace (the ZZ on urn:data:ZZ:example) may have more specific rules

1.2 HXL2 Command line tools

1.2.1 hxl2example: create your own exporter/importer

The hxl2example is an example python script with generic functionality that allow you to create your custom functions. Feel free to add your name, edit license etc.

What it does: hxl2example accepts one HXLated dataset and save as .CSV.

Quick examples

### Basic examples

# This will output a local file to stdout (tip: you can disable local files)
hxl2example tests/files/iris_hxlated-csv.csv

# This will save to a local file
hxl2example tests/files/iris_hxlated-csv.csv my-local-file.example

# Since we use the libhxl-python, remote HXLated remote urls works too!
hxl2example https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/edit#gid=319251406

### Advanced usage (if you need to share work with others)

## Quick ad-hoc web proxy, local usage
# @see https://github.com/hugapi/hug

hug -f bin/hxl2example
# http://localhost:8000/ will how an JSON documentation of hug endpoints. TL;DR:
# http://localhost:8000/hxl2example.csv?source_url=http://example.com/remote-file.csv

## Expose local web proxy to others
# @see https://ngrok.com/
ngrok http 8000
1.2.2 hxl2tab: tab format, focused for compatibility with Orange Data Mining

What it does: hxl2tab uses an already HXLated dataset and then, based on #hashtag+attributes, generates an Orange Data Mining .tab format with extra hints.

The hxl2tab v2.0 has some usable functionality to use a web interface instead of cli to generate the file. Uses hug 🐨 🤗.

If you want quick expose outside localhost, try ngrok.

1.2.3 hxlquickmeta: output information about local/remote datasets (even non HXLated yet)

What it does: hxlquickmeta output information about a local or remote dataset. If the file already is HXLated, it will print even more information.

v1.1.0 added support to give an overview by default, equivalent to users of Python Pandas.

Quick examples

#### inline result for and hashtag and (optional) value ________________________

hxlquickmeta --hxlquickmeta-hashtag="#adm2+code" --hxlquickmeta-value="BR3106200"
# > get_hashtag_info
# >> hashtag: #adm2+code
# >>> HXLMeta._parse_heading: #adm2+code
# >>> HXLMeta.is_hashtag_base_valid: None
# >>> libhxl_is_token None
# >> value: BR3106200
# >>> libhxl_is_empty False
# >>> libhxl_is_date False
# >>> libhxl_is_number False
# >>> libhxl_is_string True
# >>> libhxl_is_token None
# >>> libhxl_is_truthy False
# >>> libhxl_typeof string

#### Output information for an file, and (if any) HXLated information __________
# Local file
hxlquickmeta tests/files/iris_hxlated-csv.csv

# Remove file
hxlquickmeta https://docs.google.com/spreadsheets/u/1/d/1l7POf1WPfzgJb-ks4JM86akFSvaZOhAUWqafSJsm3Y4/edit#gid=634938833
1.2.4 hxlquickimport: (like the hxltag)

What it does: hxlquickimport is similar to the hxltag (cli tools that are installed with libhxl) mostly only try to by default slugfy whatever was before on the old headers and add it as HXL attribute. Please consider using the HXL-Proxy for serious usage. This quick script is more for internal testing

1.3 URN Command line tools

1.3.1 urnresolver: convert Uniform Resource Name of datasets to real IRIs (URLs)

The urnresolver is an proof of concept of an URN resolver. (see Uniform Resource Name (URN) on Wikipedia).

Examples (note: early working draft!)

# Basic usage: based on local and (to be implemented) remote listing pages
# it translate one readable URN to one or more datasets
urnresolver urn:data:xz:hxl:standard:core:hashtag
# https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/pub?gid=319251406&single=true&output=csv

# Now, the more practical example: using to translate to other commands:
hxlselect "$(urnresolver urn:data:xz:hxl:standard:core:hashtag)" --query '#valid_vocab=+v_pcode'
#    Hashtag,Hashtag one-liner,Hashtag long description,Release status,Data type restriction,First release,Default taxonomy,Category,Sample HXL,Sample description
#    #valid_tag,#description+short+en,#description+long+en,#status,#valid_datatype,#meta+release,#valid_vocab+default,#meta+category,#meta+example+hxl,#meta+example+description+en
#    #adm1,Level 1 subnational area,Top-level subnational administrative area (e.g. a governorate in Syria).,Released,,1.0,+v_pcode,1.1. Places,#adm1 +code,administrative level 1 P-code
#    #adm2,Level 2 subnational area,Second-level subnational administrative area (e.g. a subdivision in Bangladesh).,Released,,1.0,+v_pcode,1.1. Places,#adm2 +name,administrative level 2 name
#    #adm3,Level 3 subnational area,Third-level subnational administrative area (e.g. a subdistrict in Afghanistan).,Released,,1.0,+v_pcode,1.1. Places,#adm3 +code,administrative level 3 P-code
#    #adm4,Level 4 subnational area,Fourth-level subnational administrative area (e.g. a barangay in the Philippines).,Released,,1.0,+v_pcode,1.1. Places,#adm4 +name,administrative level 4 name
#    #adm5,Level 5 subnational area,Fifth-level subnational administrative area (e.g. a ward of a city).,Released,,1.0,+v_pcode,1.1. Places,#adm5 +code,administrative level 5 name

hxlselect "$(urnresolver urn:data:xz:hxlcplp:fod:lang)" --query '#vocab+id+v_iso6393_3letter=por'
#    Id,Part2B,Part2T,Part1,Scope,Language_Type,Ref_Name,Comment
#    #vocab+id+v_iso6393_3letter,#vocab+code+v_iso3692_3letter+z_bibliographic,#vocab+code+v_3692_3letter+z_terminology,#vocab+code+v_6391,#status,#vocab+type,#vocab+name,#description+comment+i_en
#    por,por,por,pt,I,L,Portuguese,

1.4 HDP HDP Declarative Programming (early draft)

1.4.1 HDP conventions (The YAML/JSON file structure)
1.4.2 hdpcli (command line interface)
1.4.3 HXLm.HDP (python library subpackage) usage

2. Reasons behind

2.1 Why?

The HXL already is used in production in special humanitarian areas (see The Humanitarian Data Exchange). With one line change is possible to convert most of already used spreadsheet-like data to be machine readable without need to disturb end users as other alternatives. One notable implementation (data visualization) powered by HXL is HXLDash (see this HXLDash example video).

The idea of this project strategies to turn already HXLated datasets to be used directly on open source desktop tools like the Orange Data Mining and WEKA "The workbench for machine learning" with the the minimum extra explanation on how to convert already existing HXL datasets AND do exist tools that solve know issues that are likely to be found.

2.2 How?

NOTE: already is possible to use HXLated CSVs on these tools! For either who is leaning HXL or who is using in production for humanitarian intent, the HXL-proxy (https://proxy.hxlstandard.org/) with "Strip text headers" can serve live-updated CSV-like files. Other usages can still use the HXL CLI tools or run the unocha/hxl-proxy with Docker on your machine or an private public server.

One way to implement this is to create minimum usable conversion tools that are able to export already HXLated datasets with additional hints to file formats used by default by their applications.

In practice this is beyond just file conversion (like XLSX to CSV), since it includes both "variable type" AND "intent to use (on data mining)". This is why this project also has the taxonomy/vocabulary reference table (and this ctually is more important than the implementation itself!). Without some extra step HXLated datasets work as averange CSV (good, but is just not great).

But yes, some of these converted files, in special Weka (at least if compared to Orange) are more strict on the tabular format it accepts, and this can be infuriating EVEN for who actually would know how to debug these issues! But this issue, at least, is more automatable.

Note: one practical reason to use HXLated files as base instead of plain CSV or XLSX (beyond obviously being available in humanitarian context) is because the grammar of HXL +attributes are flexible to export to several different formats with freetom to choose other aspects of the tagging.

2.3 Non-goals

  • The software implementation for file formats not typically used by easy to use desktop applications is a non-goal
    • Yet, since as part of the HXL +attributes conversion tables, some of these proposed implementations may already be drafted. These reference tables are released under public domain licenses.
    • Note that often humans who already use these formats already are likely to have skill to manually concert from CSVs (so could convert from HXL)
  • The software implementation (at least at the start) will not optimize for speed or low local disk usage
    • but should work to convert large datasets with reasonable low memory usage
  • The software implementations assume an already HXLated input dataset to keep it simple
    • Note that it is possible to quickly convert already well formatted CSVs to HXL by changing the header line (first line of the CSV).
  • While is technically possible to import back (reconstruct the original HXLated file) from exported files, this is an non-goal to be 100% compatible
    • This applicable in special cases for .arff exports: the default export may need to clean known issues with exported strings.

HXLated datasets to test

Production data on The Humanitarian Data Exchange ("HDX")

The Humanitarian Data Exchange ("HDX") contains public datasets and part of them already is HXLated and ready to test.

PROTIP: on the https://proxy.hxlstandard.org/data/source, the   Option 2: choose from the cloud also have an icon "HDX" also can be used.   This can be helpful if you are just looking around several datasets.

Files from EticaAI-Data_HXL-Data-Science-file-formats

Both Google Drive Folder and this repository has some test files. The not-so-documented manual tests may also give a quick idea on how it works.

Additional Guides

Note: these additional guides are not part of the main focus of this project

Command line tools for CSV

NOTE: Often people who work with HXL simply use the HXL-proxy, including to convert from non-HXLated sources.

Here there is an an quick overview of different command line tools that worth at least mention, in special if are dealing with raw formats already not HXLated.

Alternatives to preview spreadsheets with over 1.000.000 rows

90% of the time 1.000.000 rows is likely to be enough even if you are dealing with data science projects. So it means that there is no need to use command line tools or use more complex solutions, like import to an database or pay for enterprise solutions.

This guide if when you need to go over these limits without change too much your tools.

License

Public Domain Dedication

The EticaAI has dedicated the work to the public domain by waiving all of their rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdp-toolchain-0.8.7.tar.gz (729.9 kB view details)

Uploaded Source

Built Distribution

hdp_toolchain-0.8.7-py3-none-any.whl (315.2 kB view details)

Uploaded Python 3

File details

Details for the file hdp-toolchain-0.8.7.tar.gz.

File metadata

  • Download URL: hdp-toolchain-0.8.7.tar.gz
  • Upload date:
  • Size: 729.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.5

File hashes

Hashes for hdp-toolchain-0.8.7.tar.gz
Algorithm Hash digest
SHA256 d71340ba2cf6cb9435b76bea60464897db7253f7e1a0d23e63765d5740e80a95
MD5 5818bcd1dbe688e4da3f91dbfaa4d14e
BLAKE2b-256 b21d55b358ac24f0c6e92d1c8dfaed91443d83cc1413bb756724c8f2f462893a

See more details on using hashes here.

File details

Details for the file hdp_toolchain-0.8.7-py3-none-any.whl.

File metadata

  • Download URL: hdp_toolchain-0.8.7-py3-none-any.whl
  • Upload date:
  • Size: 315.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.5

File hashes

Hashes for hdp_toolchain-0.8.7-py3-none-any.whl
Algorithm Hash digest
SHA256 4b0b5ac6415707ffbd2aeb050211c33ab046ccbf4b7136a20ca020ae69346e88
MD5 91f8c8d1e0aad4a0ba08ba858bd5c53c
BLAKE2b-256 fc1bb1662de4d495188e9fedd5e9da426500ba08cc09dcb6676e7c77e3291d7a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page