HDP Declarative Programming toolchain (working draft)
Project description
Data Science files exported from HXL (The Humanitarian Exchange Language)
[Proof of concept] Common file formats used for Data Science exported from HXL (The Humanitarian Exchange Language)
- HXL-Data-Science-file-formats
- 1. The main focus
- 1.1 Vocabulary, Taxonomies and URNs
- 1.2
HXL2
Command line tools - 1.3
URN
Command line tools - 1.4
HDP
HDP Declarative Programming (early draft) - 1.5
HXLTM
HXL Trānslātiōnem Memoriam
- 2. Reasons behind
- 1. The main focus
- HXLated datasets to test
- Additional Guides
HXL-Data-Science-file-formats
In addition to this GitHub repository, check also the EticaAI-Data_HXL-Data-Science-file-formats Google Drive folder.
1. The main focus
1.1 Vocabulary, Taxonomies and URNs
1.1.1 Vocabulary & Taxonomies on HXL
This project either use explicit HXL +attributes (easy to implement, but more verbose) or do inferences on well know HXLated datasets used on humanitarian areas. To make this work, the main reference is not software implementation, but reference tables.
1.1.2 Uniform Resource Name on URN:DATA
- Extra content: urn-data-specification/ (warning: its complicated)
Why use URN to identify resources is more than naming convention
While find good URNs conventions to be used for typical datasets used on
humanitarian context is more complex than the
ISO URN or even the
LEX URN (this one
already used in Brazil),
one goal of the urnresolver
is accept that most data shared are VERY
sensitive and private, so this this actually is the challenge. So in addition
to converting some well known public datasets related to HXL, we're already
designing to eventually be used as abstraction to scripts and tools that
without this would need to have access to real datasets.
By using URNs, at worst case we're creating documentations and scripts that a new user would need to replace by the real one of its use case. But the ideal case is to allow exchange scripts or, when an issue happens in a new region, the personel who prepare the data could do it and then publish also on private URN listing so others could reuse.
Note that the URN Resolver, even if it does have links to resources and not just the contact page, the links themselves to download the real data could still require authentication case by case. Also same URNs, if you manage to have contact with several peers, in special for datasets that are not already an COD, but are often needed, are likely to exist with more than one option to use.
Deeper integration with CKAN instances and/or awareness of encrypted data still not implemented on the current version (v0.7.3)
Security (and privacy) considerations (for URN:DATA
)
Since the main goal of URNs is also help with auditing and sharing of
scripts and even how to reference "best acceptable use" of exchanced data
(with special focus for private/sensitive), while the URN:DATA
themselves
are mean to be NOT a secret and could be published on official documents, the
local implementations (aka how to resolve/redirect these URNs for real data)
need to take in account concepts that the "perfect optimization" (think
"secure from misuse" vs "protect privacy from legitimate use") often is
contraditory.
TODO: add more context
Disclaimer (for URN:DATA
)
Note: while this project, in addition to CLI tools to convert URNs to usable tool ("the implementation"), also draft the logic about how to construct potentially useful URNs reusable at International level (e.g. what may seem as drafted "an standard", think ISO, or an Best Current Practice, think IETF) please do not take EticaAI/HXL-Data-Science-file-formats... as endorsed by any organization.
Also, authors from @EticaAI / @HXL-CPLP (both past and future ones who cooperate directly with this project) explicitly release both software and drafted 'how to Implement' under public domain-like licenses. Under ideal circumstances
data global namespace
(the ZZ onurn:data:ZZ:example
) may have more specific rules
1.1.3 Ontologia
See ontologia/
"In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject." -- [Wikipedia: Ontology (information science)](https://en.wikipedia.org/wiki/Ontology_(information_science)
The contents from ontologia/ both contain some selected datasets and (while not 100% converted) the main parts of how command line tools and libraries released by this repository use.
Why: focus on abstract complexity for users AND allow reuse by other projects
When feasible, even if it make harder to do initial implementation or be a bit less efficient than use dedicated "advanced" strategies with state of the art tools, the internal parts of hxlm.core that deal with ontology will be stored in this folder.
This strategy is likely to make it easier for non-developers to update internals, like individuals interested in adding new languages or proposing corrections.
Distribution channels
For production usage, these files are both availible via:
- Installable with Python Pypi hdp-toolchain
- The GitHub repository https://github.com/EticaAI/HXL-Data-Science-file-formats
- Public "CDN": GitHub hosted + CloudFlare cached endpoint at https://hdp.etica.ai/ontologia/
1.2 HXL2
Command line tools
- See folder bin/
- See discussions at
- See (not so docummented tests): tests/manual-tests.sh
1.2.1 hxl2example
: create your own exporter/importer
- Source code: bin/hxl2example
The hxl2example
is an example python script with generic functionality that
allow you to create your custom functions. Feel free to add your name, edit
license etc.
What it does: hxl2example
accepts one HXLated dataset and save as .CSV.
Quick examples
### Basic examples
# This will output a local file to stdout (tip: you can disable local files)
hxl2example tests/files/iris_hxlated-csv.csv
# This will save to a local file
hxl2example tests/files/iris_hxlated-csv.csv my-local-file.example
# Since we use the libhxl-python, remote HXLated remote urls works too!
hxl2example https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/edit#gid=319251406
### Advanced usage (if you need to share work with others)
## Quick ad-hoc web proxy, local usage
# @see https://github.com/hugapi/hug
hug -f bin/hxl2example
# http://localhost:8000/ will how an JSON documentation of hug endpoints. TL;DR:
# http://localhost:8000/hxl2example.csv?source_url=http://example.com/remote-file.csv
## Expose local web proxy to others
# @see https://ngrok.com/
ngrok http 8000
1.2.2 hxl2tab
: tab format, focused for compatibility with Orange Data Mining
- Main issue: https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/2
- Orange File Specification: https://orange-data-mining-library.readthedocs.io/en/latest/reference/data.io.html
- Source code: bin/hxl2tab
What it does: hxl2tab
uses an already HXLated dataset and then, based on
#hashtag+attributes
, generates an Orange Data Mining .tab format with extra
hints.
The
hxl2tab
v2.0 has some usable functionality to use a web interface instead of cli to generate the file. Uses hug 🐨 🤗.
If you want quick expose outside localhost, try ngrok.
Installation
This package can both be installed by doing a copy of bin/hxl2tab to a place on your executable path and installing dependencies manually.
The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:
python3 -m pip install hdp-toolchain[hxl2tab]
# python3 -m pip install hdp-toolchain[full]
1.2.3 hxlquickmeta
: output information about local/remote datasets (even non HXLated yet)
- Main issue: https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/6
- Source code: bin/hxlquickmeta
What it does: hxlquickmeta
output information about a local or remote
dataset. If the file already is HXLated, it will print even more information.
v1.1.0 added support to give an overview by default, equivalent to users of Python Pandas.
Installation
This package can both be installed by doing a copy of bin/hxlquickmeta to a place on your executable path and installing dependencies manually.
The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:
python3 -m pip install hdp-toolchain[hxlquickmeta]
# python3 -m pip install hdp-toolchain[full]
Quick examples
#### inline result for and hashtag and (optional) value ________________________
hxlquickmeta --hxlquickmeta-hashtag="#adm2+code" --hxlquickmeta-value="BR3106200"
# > get_hashtag_info
# >> hashtag: #adm2+code
# >>> HXLMeta._parse_heading: #adm2+code
# >>> HXLMeta.is_hashtag_base_valid: None
# >>> libhxl_is_token None
# >> value: BR3106200
# >>> libhxl_is_empty False
# >>> libhxl_is_date False
# >>> libhxl_is_number False
# >>> libhxl_is_string True
# >>> libhxl_is_token None
# >>> libhxl_is_truthy False
# >>> libhxl_typeof string
#### Output information for an file, and (if any) HXLated information __________
# Local file
hxlquickmeta tests/files/iris_hxlated-csv.csv
# Remove file
hxlquickmeta https://docs.google.com/spreadsheets/u/1/d/1l7POf1WPfzgJb-ks4JM86akFSvaZOhAUWqafSJsm3Y4/edit#gid=634938833
1.2.4 hxlquickimport
: (like the hxltag
)
- Main issue: https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/6
- Source code: bin/hxlquickimport
What it does: hxlquickimport
is similar to the hxltag
(cli tools that are
installed with libhxl
) mostly only try to by default slugfy whatever was
before on the old headers and add it as HXL attribute. Please consider using
the HXL-Proxy for serious usage. This quick script is more for internal
testing
Installation
This package can both be installed by doing a copy of bin/hxlquickimport to a place on your executable path and installing dependencies manually.
The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:
python3 -m pip install hdp-toolchain[hxlquickimport]
# python3 -m pip install hdp-toolchain[full]
1.3 URN
Command line tools
Installation
The automated way to install is using the Python pypi package hdp-toolchain. urnresolver is installed by default.
python3 -m pip install hdp-toolchain
1.3.1 urnresolver
: convert Uniform Resource Name of datasets to real IRIs (URLs)
- Main issue: https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/13
- Source code: hxlm/core/bin/urnresolver.py
The urnresolver
is an proof of concept of an URN resolver. (see
Uniform Resource Name (URN) on Wikipedia).
Examples (note: early working draft!)
# Basic usage: based on local and (to be implemented) remote listing pages
# it translate one readable URN to one or more datasets
urnresolver urn:data:xz:hxl:standard:core:hashtag
# https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/pub?gid=319251406&single=true&output=csv
# Now, the more practical example: using to translate to other commands:
hxlselect "$(urnresolver urn:data:xz:hxl:standard:core:hashtag)" --query '#valid_vocab=+v_pcode'
# Hashtag,Hashtag one-liner,Hashtag long description,Release status,Data type restriction,First release,Default taxonomy,Category,Sample HXL,Sample description
# #valid_tag,#description+short+en,#description+long+en,#status,#valid_datatype,#meta+release,#valid_vocab+default,#meta+category,#meta+example+hxl,#meta+example+description+en
# #adm1,Level 1 subnational area,Top-level subnational administrative area (e.g. a governorate in Syria).,Released,,1.0,+v_pcode,1.1. Places,#adm1 +code,administrative level 1 P-code
# #adm2,Level 2 subnational area,Second-level subnational administrative area (e.g. a subdivision in Bangladesh).,Released,,1.0,+v_pcode,1.1. Places,#adm2 +name,administrative level 2 name
# #adm3,Level 3 subnational area,Third-level subnational administrative area (e.g. a subdistrict in Afghanistan).,Released,,1.0,+v_pcode,1.1. Places,#adm3 +code,administrative level 3 P-code
# #adm4,Level 4 subnational area,Fourth-level subnational administrative area (e.g. a barangay in the Philippines).,Released,,1.0,+v_pcode,1.1. Places,#adm4 +name,administrative level 4 name
# #adm5,Level 5 subnational area,Fifth-level subnational administrative area (e.g. a ward of a city).,Released,,1.0,+v_pcode,1.1. Places,#adm5 +code,administrative level 5 name
hxlselect "$(urnresolver urn:data:xz:hxlcplp:fod:lang)" --query '#vocab+id+v_iso6393_3letter=por'
# Id,Part2B,Part2T,Part1,Scope,Language_Type,Ref_Name,Comment
# #vocab+id+v_iso6393_3letter,#vocab+code+v_iso3692_3letter+z_bibliographic,#vocab+code+v_3692_3letter+z_terminology,#vocab+code+v_6391,#status,#vocab+type,#vocab+name,#description+comment+i_en
# por,por,por,pt,I,L,Portuguese,
1.4 HDP
HDP Declarative Programming (early draft)
- [Big Picture] The main GitHUb issue:
- https://en.wikipedia.org/wiki/Non-English-based_programming_languages#International_programming_languages
- Note: most of the logic that matters of HDP is likely to be on
Knowledge Graphs (YAML files that expand in memory).
- See hxlm/ontologia/
- In special ontologia/core.vkg.yml
- See hxlm/ontologia/
Installation
The automated way to install is using the Python pypi package hdp-toolchain. All the relevand parts, including bare minimal ontologia, are part of the default installation.
python3 -m pip install hdp-toolchain
1.4.1 HDP conventions (The YAML/JSON file structure)
1.4.2 hdpcli
(command line interface)
1.4.3 HXLm.HDP
(python library subpackage) usage
- GitHub Gist
- Google Colab (Jupyter Notebook)
- File
- Folder
HXL-CPLP-Publico/Datasets/EticaAI-Data/EticaAI-Data_HXL-Data-Science-file-formats/HDP-playbooks
1.5 HXLTM
HXL Trānslātiōnem Memoriam
The Humanitarian Exchange Language Trānslātiōnem Memoriam
(abbreviation: "HXLTM") is an HXLated valid HXL tabular format by
HXL-CPLP to store community contributed
translations and glossaries. The hxltmcli
public domain python cli tool allow
reuse by others interested in export HXLTM files to common formats used by
professional translators.
1.5.1 Common hxltm
FAQ
1.5.1.1 hxltmcli
installation
hxltmcli
uses Python 3. While is possible to just copy the hxltmcli
file
and install manually dependencies, like the
HXLStandard/libhxl-python,
you can install with the hdp-toolchain.
# hxltmcli is installed with the hdp-toolchain, no extras required.
# @see https://pypi.org/project/hdp-toolchain/
pip install hdp-toolchain
hxltmcli --help
1.5.1.2 Save entire Translations Memory on Excel files
1.5.1.3 Example data
HXLTM-Exemplum
: Generic test files:- Input files: tests/hxltm/
- Output files: tests/hxltm/resultatum/
- Production files:
HXL-CPLP-Vocab_Auxilium-Humanitarium-API
: Hapi project- GitHub:
- Live Spreadsheet:
- https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=470146486
- Note: the project may eventually use other sources of data (and this link here may eventually not be up to date)
1.5.1.4 Advanced filter with HXL cli tools
Since a HXLTM (before export) is a valid HXL file, advanced seleting is
possible by, instead of hxltmcli input.hxl.csv output.hxl.csv
use
hxlcut input.hxl.csv --exclude (...) | hxltmclioutput.hxl.csv
.
# libhxl already is installed with hdp-toolchain
hxlselect --help
# Filter rows in a HXL dataset. (...)
hxlcut --help
# Cut columns from a HXL dataset.
## Examples with HXL TM (used before pass data to hxltmcli)
hxlcut --exclude item+i_la+i_lat+is_Latn --sheet 6 HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx | hxltmcli
# Excludes Latin before pass to hxltmcli, from Microsoft Excel
hxlcut --exclude item+i_la+i_lat+is_Latn https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422 | hxltmcli
# Excludes Latin before pass to hxltmcli, from Google Sheets
1.5.1.5 Advanced filter with HXL-proxy (integration with Google Sheets and CSV/XLSX/etc avalible on web)
In special if you are contributing for either tools for HXL, testing this tool or helping in production (e.g. real time disaster response) please consider usage of the public HXL-Proxy on https://proxy.hxlstandard.org/.
Most advanced features of the libhxl cli tools are availible via HXL-proxy.
Note about heavy usage: use cache Both https://hapi.etica.ai/ and https://github.com/HXL-CPLP/Auxilium-Humanitarium-API (and some links used on this documentation) may use the HXL-Proxy default 1 hour cache disabled. This is necessary because the HXL-proxy is used to build static content based on latest translations.
It's a good practice if you are not only testing, but deployng in production, to not disable HXL-Proxy cache (it's the default option if not copy and pasting HXL-CPLP/Auxilium-Humanitarium-API internal build script links).
Also, even if you do not use the HXL-Proxy (but is using hxltm
directly to
your own Google Spreadsheets) if you keep doing too much calls in short time
eventually the Google Docs may raise 400 errors since hxltm
are not
authenticated requests. Our recomendations on this case is:
- download the entire Spreadsheet as .xlsx file and process the .xlsx file locally.
- Download individual sheets as CSV files and save locally (this consumes less CPU than process .xlsx)
1.5.2 TMX: Translation Memory eXchange v1.4b
- Wikipedia: https://en.wikipedia.org/wiki/Translation_Memory_eXchange
- Specification:
- TMX 1.4b DTD
- Relevant GitHub issues:
## The next 2 examples are equivalent: will print to stdout the result
hxltmcli hxltm-exemplum-linguam.tm.hxl.csv --TMX
# (will print out TMX result of input HXLTM file)
cat hxltm-exemplum-linguam.tm.hxl.csv | hxltmcli --TMX
# (will print out TMX result of input HXLTM file)
hxltmcli hxltm-exemplum-linguam.tm.hxl.csv resultatum/hxltm-exemplum-linguam.tmx --TMX
# (Instead of print to stdout, save the contents to a single CSV file)
1.5.3 XLIFF: XML Localization Interchange File Format v2.1
- Wikipedia: https://en.wikipedia.org/wiki/XLIFF
- Specification:
- Relevant GitHub issues:
- Extra links
- Okapi about XLIFF: https://okapiframework.org/wiki/index.php/XLIFF
## The next 2 examples are equivalent: will print to stdout the result
hxltmcli hxltm-exemplum-linguam.tm.hxl.csv --XLIFF
# (will print out TMX result of input HXLTM file)
cat hxltm-exemplum-linguam.tm.hxl.csv | hxltmcli --XLIFF
# (will print out TMX result of input HXLTM file)
hxltmcli hxltm-exemplum-linguam.tm.hxl.csv resultatum/hxltm-exemplum-linguam.xlf --XLIFF
# (Instead of print to stdout, save the contents to a single CSV file)
Extras: VSCode XLIFF extension Check also this VSCode extension https://marketplace.visualstudio.com/items?itemName=rvanbekkum.xliff-sync. While we do not checked yet, it seems to allow "merge" new translations from a different XLIFF file to another one.
1.5.3.1 HXLTM supported features of XLIFF
TODO: improve documentation of features HXLTM support export to XLIFF
1.5.4 Google Sheets
The hxltmcli
supports read directly from Google Sheets (no extra plugins
required).
Read HXL TM data saved on Google Sheets
hxltmcli https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422
# (will print out contents of Google Sheets, without exporting to other formats)
hxltmcli https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422 | grep UN_codicem_anglicum_IOM_HTCDS_nomen
# UN_codicem_anglicum_IOM_HTCDS_nomen,,,,13,1,UN,UN,codicem_anglicum,IOM,HTCDS,,,nomen,,,,,,,,,,,,,,∅,∅,Padrão de Dados de Casos de Tráfico Humano,∅,Revisão de texto requerida,Human Trafficking Case Data Standard,∅,∅,,∅,∅,,∅,∅,,∅,∅,,∅,∅
hxltmcli https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422 schemam-un-htcds.tm.hxl.csv
# (Instead of print to stdout, save the contents to a single CSV file)
Write HXL TM data on Google Sheets
Writting to Google Sheets is possible by using external tool to import the CSV versions.
TODO: document some external cli script that allow upload CSV to Google Sheets.
1.5.5 Microsoft Excel
Read HXL TM data saved on Excel
The hxltmcli
supports read directly from Microsoft Excel (no extra plugins
required).
# The HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx is a downloaded version of
# the Google Sheets entire groups of HXL TMs on 2021-06-29. New versions are
# likely to be a different number than --sheet 6
hxltmcli --sheet 6 HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx
# (will print out contents of --sheet 6, without exporting to other formats)
hxltmcli --sheet 6 HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx | grep UN_codicem_anglicum_IOM_HTCDS_nomen
# UN_codicem_anglicum_IOM_HTCDS_nomen,,,,13,1,UN,UN,codicem_anglicum,IOM,HTCDS,,,nomen,,,,,,,,,,,,,,∅,∅,Padrão de Dados de Casos de Tráfico Humano,∅,Revisão de texto requerida,Human Trafficking Case Data Standard,∅,∅,,∅,∅,,∅,∅,,∅,∅,,∅,∅
hxltmcli --sheet 6 HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx schemam-un-htcds.tm.hxl.csv
# (Instead of print to stdout, save the contents to a single CSV file)
Write HXL TM data on Microsoft Excel
Writting to Microsoft Excel is possible by using external tool to import the CSV versions. Here is just one example, but you are free to use alternatives.
Example using unoconv. Tested with Ubuntu 20.04 LTS and LibreOffice 6.4.
# One recommendedy way to install unoconv is via operational system packages
# not with pip.
sudo apt install unoconv
# Test data at EticaAI/HXL-Data-Science-file-formats/tests/hxltm/
unoconv --format xlsx hxltm-exemplum-linguam.tm.hxl.csv
# Note: in our tests, unoconv may have exporting bugs with unicode, see
# @see https://github.com/unoconv/unoconv/issues/271
1.5.6 CSV
1.5.6.1 CSV reference format, HXLated CSV (multilingual)
The default output of hxltmcli
already is output an valid HXLated CSV without
data changes changes (with notable exception of normalize HXL hashtags, like
convert #item +i_ar +i_arb +is_Arab
to #item+i_ar+i_arb+is_arab
).
## The next 2 examples are equivalent: will print to stdout the result
hxltmcli hxltm-exemplum-linguam.tm.hxl.csv
# (will print out contents of hxltm-exemplum-linguam.tm.hxl.csv)
cat hxltm-exemplum-linguam.tm.hxl.csv | hxltmcli
# (will print out contents of hxltm-exemplum-linguam.tm.hxl.csv)
hxltmcli hxltm-exemplum-linguam.tm.hxl.csv output-file.tm.hxl.csv
# (Instead of print to stdout, save the contents to a single CSV file)
PROTIP: You can chain several hxltmcli
commands (ideally, the last
command to export) or the first command to import from something that already
is not HXL should be hxltmcli
, but for advanced processing, see
HXLTM-libhxl-cli-tools.
1.5.6.2 CSV source + target format (bilingual)
TODO: document minimal usage
1.5.7 UTX
- https://aamt.info/english/utx/
- Specification: https://aamt.info/wp-content/uploads/2019/06/utx1.20-specification-e.pdf
TODO: maybe implement exporting to UTX (it's not complex than already done with CSV)
1.5.8 PO, TBX, SRX
hxltmcli
does not import or export PO files directly. Okapi Framework can be
used to export XLIFF created by hxltmcli
.
hxltmcli
does not import or export TBX and SRX files directly. It's
not clear if possible to use any external to import/export from already
supported formats (like TMX and XLIFF) creted by hxltmcli
without
implementing this feature directly on hxltmcli
.
TODO: we could consider supporting TBX (see https://en.wikipedia.org/wiki/TermBase_eXchange) since IATE seems to export glossaries on this format. See also https://termcoord.eu/iate/download-iate-tbx/.
1.5.9 Notable alternatives to HXL TM and hxltmcli
to manage Translation Memories
1.5.9.1 Okapi Framework
TODO: this is a draft. Improve it.
2. Reasons behind
2.1 Why?
The HXL already is used in production in special humanitarian areas (see The Humanitarian Data Exchange). With one line change is possible to convert most of already used spreadsheet-like data to be machine readable without need to disturb end users as other alternatives. One notable implementation (data visualization) powered by HXL is HXLDash (see this HXLDash example video).
The idea of this project strategies to turn already HXLated datasets to be used directly on open source desktop tools like the Orange Data Mining and WEKA "The workbench for machine learning" with the the minimum extra explanation on how to convert already existing HXL datasets AND do exist tools that solve know issues that are likely to be found.
2.2 How?
NOTE: already is possible to use HXLated CSVs on these tools! For either who is leaning HXL or who is using in production for humanitarian intent, the HXL-proxy (https://proxy.hxlstandard.org/) with "Strip text headers" can serve live-updated CSV-like files. Other usages can still use the HXL CLI tools or run the unocha/hxl-proxy with Docker on your machine or an private public server.
One way to implement this is to create minimum usable conversion tools that are able to export already HXLated datasets with additional hints to file formats used by default by their applications.
In practice this is beyond just file conversion (like XLSX to CSV), since it includes both "variable type" AND "intent to use (on data mining)". This is why this project also has the taxonomy/vocabulary reference table (and this ctually is more important than the implementation itself!). Without some extra step HXLated datasets work as averange CSV (good, but is just not great).
But yes, some of these converted files, in special Weka (at least if compared to Orange) are more strict on the tabular format it accepts, and this can be infuriating EVEN for who actually would know how to debug these issues! But this issue, at least, is more automatable.
Note: one practical reason to use HXLated files as base instead of plain CSV or XLSX (beyond obviously being available in humanitarian context) is because the grammar of HXL +attributes are flexible to export to several different formats with freetom to choose other aspects of the tagging.
2.3 Non-goals
- The software implementation for file formats not typically used by easy to use
desktop applications is a non-goal
- Yet, since as part of the HXL +attributes conversion tables, some of these proposed implementations may already be drafted. These reference tables are released under public domain licenses.
- Note that often humans who already use these formats already are likely to have skill to manually concert from CSVs (so could convert from HXL)
- The software implementation (at least at the start) will not optimize for
speed or low local disk usage
- but should work to convert large datasets with reasonable low memory usage
- The software implementations assume an already HXLated input dataset to keep
it simple
- Note that it is possible to quickly convert already well formatted CSVs to HXL by changing the header line (first line of the CSV).
- While is technically possible to import back (reconstruct the original
HXLated file) from exported files, this is an non-goal to be 100% compatible
- This applicable in special cases for .arff exports: the default export may need to clean known issues with exported strings.
HXLated datasets to test
Production data on The Humanitarian Data Exchange ("HDX")
- Generic search query: https://data.humdata.org/search?vocab_Topics=hxl
- HXL data on HDX
The Humanitarian Data Exchange ("HDX") contains public datasets and part of them already is HXLated and ready to test.
PROTIP: on the https://proxy.hxlstandard.org/data/source, the Option 2: choose from the cloud also have an icon "HDX" also can be used. This can be helpful if you are just looking around several datasets.
Files from EticaAI-Data_HXL-Data-Science-file-formats
- tests/files
- tests/manual-tests.sh
- Google Drive Folder: https://drhttps://drive.google.com/drive/u/1/folders/1qyTPaDgm7Ca-62blkdQjUox47WWKRwD3ive.google.com/drive/u/1/folders/1qyTPaDgm7Ca-62blkdQjUox47WWKRwD3
Both Google Drive Folder and this repository has some test files. The not-so-documented manual tests may also give a quick idea on how it works.
Additional Guides
Note: these additional guides are not part of the main focus of this project
Command line tools for CSV
NOTE: Often people who work with HXL simply use the HXL-proxy, including to convert from non-HXLated sources.
Here there is an an quick overview of different command line tools that worth at least mention, in special if are dealing with raw formats already not HXLated.
Alternatives to preview spreadsheets with over 1.000.000 rows
90% of the time 1.000.000 rows is likely to be enough even if you are dealing with data science projects. So it means that there is no need to use command line tools or use more complex solutions, like import to an database or pay for enterprise solutions.
This guide if when you need to go over these limits without change too much your tools.
License
The EticaAI has dedicated the work to the public domain by waiving all of their rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hdp_toolchain-0.8.8.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1dde9ffef363c375c66c493dc20ccd646f746d82282c790b4d5f9f1ce87d880d |
|
MD5 | 521bcedcdb1c6232d1345121ac1273e6 |
|
BLAKE2b-256 | 158238ac6fd333258438bfcbf77b2fd8e67d926195ccd603afaee66a7e4f9530 |