HDP Declarative Programming toolchain (working draft)

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Data Science files exported from HXL (The Humanitarian Exchange Language)

[Proof of concept] Common file formats used for Data Science exported from HXL (The Humanitarian Exchange Language)

License

HXL-Data-Science-file-formats
- 1. The main focus
- 2. Reasons behind
HXLated datasets to test
- Production data on The Humanitarian Data Exchange ("HDX")
- Files from EticaAI-Data_HXL-Data-Science-file-formats
Additional Guides
- Command line tools for CSV
- Alternatives to preview spreadsheets with over 1.000.000 rows

HXL-Data-Science-file-formats

In addition to this GitHub repository, check also the EticaAI-Data_HXL-Data-Science-file-formats Google Drive folder.

1. The main focus

1.1 Vocabulary, Taxonomies and URNs

1.1.1 Vocabulary & Taxonomies on HXL

https://docs.google.com/spreadsheets/d/1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY/edit#gid=1297379331

This project either use explicit HXL +attributes (easy to implement, but more verbose) or do inferences on well know HXLated datasets used on humanitarian areas. To make this work, the main reference is not software implementation, but reference tables.

1.1.2 Uniform Resource Name on `URN:DATA`

Extra content: urn-data-specification/ ^{(warning: its complicated)}

Why use URN to identify resources is more than naming convention

While find good URNs conventions to be used for typical datasets used on humanitarian context is more complex than the ISO URN or even the LEX URN (this one already used in Brazil), one goal of the urnresolver is accept that most data shared are VERY sensitive and private, so this this actually is the challenge. So in addition to converting some well known public datasets related to HXL, we're already designing to eventually be used as abstraction to scripts and tools that without this would need to have access to real datasets.

By using URNs, at worst case we're creating documentations and scripts that a new user would need to replace by the real one of its use case. But the ideal case is to allow exchange scripts or, when an issue happens in a new region, the personel who prepare the data could do it and then publish also on private URN listing so others could reuse.

Note that the URN Resolver, even if it does have links to resources and not just the contact page, the links themselves to download the real data could still require authentication case by case. Also same URNs, if you manage to have contact with several peers, in special for datasets that are not already an COD, but are often needed, are likely to exist with more than one option to use.

Deeper integration with CKAN instances and/or awareness of encrypted data still not implemented on the current version (v0.7.3)

Security (and privacy) considerations (for `URN:DATA`)

Since the main goal of URNs is also help with auditing and sharing of scripts and even how to reference "best acceptable use" of exchanced data (with special focus for private/sensitive), while the URN:DATA themselves are mean to be NOT a secret and could be published on official documents, the local implementations (aka how to resolve/redirect these URNs for real data) need to take in account concepts that the "perfect optimization" (think "secure from misuse" vs "protect privacy from legitimate use") often is contraditory.

TODO: add more context

Disclaimer (for `URN:DATA`)

Note: while this project, in addition to CLI tools to convert URNs to usable tool ("the implementation"), also draft the logic about how to construct potentially useful URNs reusable at International level (e.g. what may seem as drafted "an standard", think ISO, or an Best Current Practice, think IETF) please do not take EticaAI/HXL-Data-Science-file-formats... as endorsed by any organization.

Also, authors from @EticaAI / @HXL-CPLP (both past and future ones who cooperate directly with this project) explicitly release both software and drafted 'how to Implement' under public domain-like licenses. Under ideal circumstances data global namespace (the ZZ on urn:data:ZZ:example) may have more specific rules

1.1.3 Ontologia

See ontologia/

"In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject." -- [Wikipedia: Ontology (information science)](https://en.wikipedia.org/wiki/Ontology_(information_science)

The contents from ontologia/ both contain some selected datasets and (while not 100% converted) the main parts of how command line tools and libraries released by this repository use.

Why: focus on abstract complexity for users AND allow reuse by other projects

When feasible, even if it make harder to do initial implementation or be a bit less efficient than use dedicated "advanced" strategies with state of the art tools, the internal parts of hxlm.core that deal with ontology will be stored in this folder.

This strategy is likely to make it easier for non-developers to update internals, like individuals interested in adding new languages or proposing corrections.

Distribution channels

For production usage, these files are both availible via:

Installable with Python Pypi hdp-toolchain
The GitHub repository https://github.com/EticaAI/HXL-Data-Science-file-formats
Public "CDN": GitHub hosted + CloudFlare cached endpoint at https://hdp.etica.ai/ontologia/

1.2 `HXL2` Command line tools

See folder bin/
See discussions at
- https://github.com/EticaAI/HXL-Data-Science-file-formats/issues
- https://github.com/HXL-CPLP/forum/issues/52
See (not so docummented tests): tests/manual-tests.sh

1.2.1 `hxl2example`: create your own exporter/importer

Source code: bin/hxl2example

The hxl2example is an example python script with generic functionality that allow you to create your custom functions. Feel free to add your name, edit license etc.

What it does: hxl2example accepts one HXLated dataset and save as .CSV.

Quick examples

### Basic examples

# This will output a local file to stdout (tip: you can disable local files)
hxl2example tests/files/iris_hxlated-csv.csv

# This will save to a local file
hxl2example tests/files/iris_hxlated-csv.csv my-local-file.example

# Since we use the libhxl-python, remote HXLated remote urls works too!
hxl2example https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/edit#gid=319251406

### Advanced usage (if you need to share work with others)

## Quick ad-hoc web proxy, local usage
# @see https://github.com/hugapi/hug

hug -f bin/hxl2example
# http://localhost:8000/ will how an JSON documentation of hug endpoints. TL;DR:
# http://localhost:8000/hxl2example.csv?source_url=http://example.com/remote-file.csv

## Expose local web proxy to others
# @see https://ngrok.com/
ngrok http 8000

1.2.2 `hxl2tab`: tab format, focused for compatibility with Orange Data Mining

Main issue: https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/2
Orange File Specification: https://orange-data-mining-library.readthedocs.io/en/latest/reference/data.io.html
Source code: bin/hxl2tab

What it does: hxl2tab uses an already HXLated dataset and then, based on #hashtag+attributes, generates an Orange Data Mining .tab format with extra hints.

The hxl2tab v2.0 has some usable functionality to use a web interface instead of cli to generate the file. Uses hug 🐨 🤗.

If you want quick expose outside localhost, try ngrok.

Installation

This package can both be installed by doing a copy of bin/hxl2tab to a place on your executable path and installing dependencies manually.

The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:

python3 -m pip install hdp-toolchain[hxl2tab]

# python3 -m pip install hdp-toolchain[full]

1.2.3 `hxlquickmeta`: output information about local/remote datasets (even non HXLated yet)

Main issue: https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/6
Source code: bin/hxlquickmeta

What it does: hxlquickmeta output information about a local or remote dataset. If the file already is HXLated, it will print even more information.

v1.1.0 added support to give an overview by default, equivalent to users of Python Pandas.

Installation

This package can both be installed by doing a copy of bin/hxlquickmeta to a place on your executable path and installing dependencies manually.

The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:

python3 -m pip install hdp-toolchain[hxlquickmeta]

# python3 -m pip install hdp-toolchain[full]

Quick examples

#### inline result for and hashtag and (optional) value ________________________

hxlquickmeta --hxlquickmeta-hashtag="#adm2+code" --hxlquickmeta-value="BR3106200"
# > get_hashtag_info
# >> hashtag: #adm2+code
# >>> HXLMeta._parse_heading: #adm2+code
# >>> HXLMeta.is_hashtag_base_valid: None
# >>> libhxl_is_token None
# >> value: BR3106200
# >>> libhxl_is_empty False
# >>> libhxl_is_date False
# >>> libhxl_is_number False
# >>> libhxl_is_string True
# >>> libhxl_is_token None
# >>> libhxl_is_truthy False
# >>> libhxl_typeof string

#### Output information for an file, and (if any) HXLated information __________
# Local file
hxlquickmeta tests/files/iris_hxlated-csv.csv

# Remove file
hxlquickmeta https://docs.google.com/spreadsheets/u/1/d/1l7POf1WPfzgJb-ks4JM86akFSvaZOhAUWqafSJsm3Y4/edit#gid=634938833

1.2.4 `hxlquickimport`: (like the `hxltag`)

Main issue: https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/6
Source code: bin/hxlquickimport

What it does: hxlquickimport is similar to the hxltag (cli tools that are installed with libhxl) mostly only try to by default slugfy whatever was before on the old headers and add it as HXL attribute. Please consider using the HXL-Proxy for serious usage. This quick script is more for internal testing

Installation

This package can both be installed by doing a copy of bin/hxlquickimport to a place on your executable path and installing dependencies manually.

The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:

python3 -m pip install hdp-toolchain[hxlquickimport]

# python3 -m pip install hdp-toolchain[full]

1.3 `URN` Command line tools

Installation

The automated way to install is using the Python pypi package hdp-toolchain. urnresolver is installed by default.

python3 -m pip install hdp-toolchain

1.3.1 `urnresolver`: convert Uniform Resource Name of datasets to real IRIs (URLs)

Main issue: https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/13
Source code: hxlm/core/bin/urnresolver.py

The urnresolver is an proof of concept of an URN resolver. (see Uniform Resource Name (URN) on Wikipedia).

Examples (note: early working draft!)

# Basic usage: based on local and (to be implemented) remote listing pages
# it translate one readable URN to one or more datasets
urnresolver urn:data:xz:hxl:standard:core:hashtag
# https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/pub?gid=319251406&single=true&output=csv

# Now, the more practical example: using to translate to other commands:
hxlselect "$(urnresolver urn:data:xz:hxl:standard:core:hashtag)" --query '#valid_vocab=+v_pcode'
#    Hashtag,Hashtag one-liner,Hashtag long description,Release status,Data type restriction,First release,Default taxonomy,Category,Sample HXL,Sample description
#    #valid_tag,#description+short+en,#description+long+en,#status,#valid_datatype,#meta+release,#valid_vocab+default,#meta+category,#meta+example+hxl,#meta+example+description+en
#    #adm1,Level 1 subnational area,Top-level subnational administrative area (e.g. a governorate in Syria).,Released,,1.0,+v_pcode,1.1. Places,#adm1 +code,administrative level 1 P-code
#    #adm2,Level 2 subnational area,Second-level subnational administrative area (e.g. a subdivision in Bangladesh).,Released,,1.0,+v_pcode,1.1. Places,#adm2 +name,administrative level 2 name
#    #adm3,Level 3 subnational area,Third-level subnational administrative area (e.g. a subdistrict in Afghanistan).,Released,,1.0,+v_pcode,1.1. Places,#adm3 +code,administrative level 3 P-code
#    #adm4,Level 4 subnational area,Fourth-level subnational administrative area (e.g. a barangay in the Philippines).,Released,,1.0,+v_pcode,1.1. Places,#adm4 +name,administrative level 4 name
#    #adm5,Level 5 subnational area,Fifth-level subnational administrative area (e.g. a ward of a city).,Released,,1.0,+v_pcode,1.1. Places,#adm5 +code,administrative level 5 name

hxlselect "$(urnresolver urn:data:xz:hxlcplp:fod:lang)" --query '#vocab+id+v_iso6393_3letter=por'
#    Id,Part2B,Part2T,Part1,Scope,Language_Type,Ref_Name,Comment
#    #vocab+id+v_iso6393_3letter,#vocab+code+v_iso3692_3letter+z_bibliographic,#vocab+code+v_3692_3letter+z_terminology,#vocab+code+v_6391,#status,#vocab+type,#vocab+name,#description+comment+i_en
#    por,por,por,pt,I,L,Portuguese,

1.4 `HDP` HDP Declarative Programming (early draft)

[Big Picture] The main GitHUb issue:
- https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/16
https://en.wikipedia.org/wiki/Non-English-based_programming_languages#International_programming_languages
Note: most of the logic that matters of HDP is likely to be on Knowledge Graphs (YAML files that expand in memory).
- See hxlm/ontologia/
  - In special ontologia/core.vkg.yml

Installation

The automated way to install is using the Python pypi package hdp-toolchain. All the relevand parts, including bare minimal ontologia, are part of the default installation.

python3 -m pip install hdp-toolchain

1.4.1 HDP conventions (The YAML/JSON file structure)

hdp-conventions

1.4.2 `hdpcli` (command line interface)

hxlm/core/bin/hdpcli.py

1.4.3 `HXLm.HDP` (python library subpackage) usage

GitHub Gist
- https://gist.github.com/fititnt/3dd12c61170d290fe94cafb1f672a0b5
Google Colab (Jupyter Notebook)
- File
- Folder HXL-CPLP-Publico/Datasets/EticaAI-Data/EticaAI-Data_HXL-Data-Science-file-formats/HDP-playbooks
  - https://drive.google.com/drive/u/1/folders/1Zs-hw6y2ZHMgYXjGY1QbhrXn2UmheUEO

1.5 `HXLTM` HXL Trānslātiōnem Memoriam

§ HXLTM

The Humanitarian Exchange Language Trānslātiōnem Memoriam (abbreviation: "HXLTM") is an HXLated valid HXL tabular format by HXL-CPLP to store community contributed translations and glossaries. The hxltmcli public domain python cli tool allow reuse by others interested in export HXLTM files to common formats used by professional translators.

1.5.1 Common `hxltm` FAQ

1.5.1.1 `hxltmcli` installation

hxltmcli uses Python 3. While is possible to just copy the hxltmcli file and install manually dependencies, like the HXLStandard/libhxl-python, you can install with the hdp-toolchain.

# hxltmcli is installed with the hdp-toolchain, no extras required.
# @see https://pypi.org/project/hdp-toolchain/
pip install hdp-toolchain

hxltmcli --help

1.5.1.2 Save entire Translations Memory on Excel files

1.5.1.3 Example data

HXLTM-Exemplum: Generic test files:
- Input files: tests/hxltm/
  - Live spreadsheet: https://docs.google.com/spreadsheets/d/1isOgjeRJw__nky-YY-IR_EAZqLI6xQ96DKbD4tf0ZO8/edit#gid=0
- Output files: tests/hxltm/resultatum/
Production files:
- HXL-CPLP-Vocab_Auxilium-Humanitarium-API: Hapi project
  - GitHub:
    - https://github.com/HXL-CPLP/Auxilium-Humanitarium-API
  - Live Spreadsheet:
    - https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=470146486
    - Note: the project may eventually use other sources of data (and this link here may eventually not be up to date)

1.5.1.4 Advanced filter with HXL cli tools

§ HXLTM-libhxl-cli-tools

See https://github.com/HXLStandard/libhxl-python/wiki/HXL-cookbook

Since a HXLTM (before export) is a valid HXL file, advanced seleting is possible by, instead of hxltmcli input.hxl.csv output.hxl.csv use hxlcut input.hxl.csv --exclude (...) | hxltmclioutput.hxl.csv.

# libhxl already is installed with hdp-toolchain

hxlselect --help
#    Filter rows in a HXL dataset. (...)
hxlcut --help
#    Cut columns from a HXL dataset.

## Examples with HXL TM (used before pass data to hxltmcli)
hxlcut --exclude item+i_la+i_lat+is_Latn --sheet 6 HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx | hxltmcli
# Excludes Latin before pass to hxltmcli, from Microsoft Excel

hxlcut --exclude item+i_la+i_lat+is_Latn https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422 | hxltmcli
# Excludes Latin before pass to hxltmcli, from Google Sheets

1.5.1.5 Advanced filter with HXL-proxy (integration with Google Sheets and CSV/XLSX/etc avalible on web)

§ HXLTM-HXL-Proxy

In special if you are contributing for either tools for HXL, testing this tool or helping in production (e.g. real time disaster response) please consider usage of the public HXL-Proxy on https://proxy.hxlstandard.org/.

Most advanced features of the libhxl cli tools are availible via HXL-proxy.

Note about heavy usage: use cache Both https://hapi.etica.ai/ and https://github.com/HXL-CPLP/Auxilium-Humanitarium-API (and some links used on this documentation) may use the HXL-Proxy default 1 hour cache disabled. This is necessary because the HXL-proxy is used to build static content based on latest translations.

It's a good practice if you are not only testing, but deployng in production, to not disable HXL-Proxy cache (it's the default option if not copy and pasting HXL-CPLP/Auxilium-Humanitarium-API internal build script links).

Also, even if you do not use the HXL-Proxy (but is using hxltm directly to your own Google Spreadsheets) if you keep doing too much calls in short time eventually the Google Docs may raise 400 errors since hxltm are not authenticated requests. Our recomendations on this case is:

download the entire Spreadsheet as .xlsx file and process the .xlsx file locally.
Download individual sheets as CSV files and save locally (this consumes less CPU than process .xlsx)

1.5.2 TMX: Translation Memory eXchange v1.4b

§ HXLTM-TMX

Wikipedia: https://en.wikipedia.org/wiki/Translation_Memory_eXchange
Specification:
- https://www.gala-global.org/tmx-14b
- https://www.gala-global.org/knowledge-center/industry-development/standards/lisa-oscar-standards
TMX 1.4b DTD
- https://www.gala-global.org/sites/default/files/migrated-pages/docs/tmx14%20%281%29.dtd
Relevant GitHub issues:

## The next 2 examples are equivalent: will print to stdout the result
hxltmcli hxltm-exemplum-linguam.tm.hxl.csv --TMX
#    (will print out TMX result of input HXLTM file)

cat hxltm-exemplum-linguam.tm.hxl.csv | hxltmcli --TMX
#    (will print out TMX result of input HXLTM file)

hxltmcli hxltm-exemplum-linguam.tm.hxl.csv resultatum/hxltm-exemplum-linguam.tmx --TMX
#    (Instead of print to stdout, save the contents to a single CSV file)

1.5.3 XLIFF: XML Localization Interchange File Format v2.1

§ HXLTM-XLIFF

Wikipedia: https://en.wikipedia.org/wiki/XLIFF
Specification:
- http://docs.oasis-open.org/xliff/xliff-core/v2.1/os/xliff-core-v2.1-os.html
Relevant GitHub issues:
Extra links
- Okapi about XLIFF: https://okapiframework.org/wiki/index.php/XLIFF

## The next 2 examples are equivalent: will print to stdout the result
hxltmcli hxltm-exemplum-linguam.tm.hxl.csv --XLIFF
#    (will print out TMX result of input HXLTM file)

cat hxltm-exemplum-linguam.tm.hxl.csv | hxltmcli --XLIFF
#    (will print out TMX result of input HXLTM file)

hxltmcli hxltm-exemplum-linguam.tm.hxl.csv resultatum/hxltm-exemplum-linguam.xlf --XLIFF
#    (Instead of print to stdout, save the contents to a single CSV file)

Extras: VSCode XLIFF extension Check also this VSCode extension https://marketplace.visualstudio.com/items?itemName=rvanbekkum.xliff-sync. While we do not checked yet, it seems to allow "merge" new translations from a different XLIFF file to another one.

1.5.3.1 HXLTM supported features of XLIFF

TODO: improve documentation of features HXLTM support export to XLIFF

1.5.4 Google Sheets

The hxltmcli supports read directly from Google Sheets (no extra plugins required).

Read HXL TM data saved on Google Sheets

hxltmcli https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422
#    (will print out contents of Google Sheets, without exporting to other formats)

hxltmcli https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422 | grep UN_codicem_anglicum_IOM_HTCDS_nomen
#    UN_codicem_anglicum_IOM_HTCDS_nomen,,,,13,1,UN,UN,codicem_anglicum,IOM,HTCDS,,,nomen,,,,,,,,,,,,,,∅,∅,Padrão de Dados de Casos de Tráfico Humano,∅,Revisão de texto requerida,Human Trafficking Case Data Standard,∅,∅,,∅,∅,,∅,∅,,∅,∅,,∅,∅

hxltmcli https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422 schemam-un-htcds.tm.hxl.csv
#    (Instead of print to stdout, save the contents to a single CSV file)

Write HXL TM data on Google Sheets

Writting to Google Sheets is possible by using external tool to import the CSV versions.

TODO: document some external cli script that allow upload CSV to Google Sheets.

1.5.5 Microsoft Excel

§ HXLTM-XLSX

Read HXL TM data saved on Excel

The hxltmcli supports read directly from Microsoft Excel (no extra plugins required).

# The HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx is a downloaded version of
# the Google Sheets entire groups of HXL TMs on 2021-06-29. New versions are
# likely to be a different number than --sheet 6
hxltmcli --sheet 6 HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx
#    (will print out contents of --sheet 6, without exporting to other formats)

hxltmcli --sheet 6 HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx | grep UN_codicem_anglicum_IOM_HTCDS_nomen
#    UN_codicem_anglicum_IOM_HTCDS_nomen,,,,13,1,UN,UN,codicem_anglicum,IOM,HTCDS,,,nomen,,,,,,,,,,,,,,∅,∅,Padrão de Dados de Casos de Tráfico Humano,∅,Revisão de texto requerida,Human Trafficking Case Data Standard,∅,∅,,∅,∅,,∅,∅,,∅,∅,,∅,∅

hxltmcli --sheet 6 HXL-CPLP-Vocab_Auxilium-Humanitarium-API.xlsx schemam-un-htcds.tm.hxl.csv
#    (Instead of print to stdout, save the contents to a single CSV file)

Write HXL TM data on Microsoft Excel

Writting to Microsoft Excel is possible by using external tool to import the CSV versions. Here is just one example, but you are free to use alternatives.

Example using unoconv. Tested with Ubuntu 20.04 LTS and LibreOffice 6.4.

# One recommendedy way to install unoconv is via operational system packages
# not with pip.
sudo apt install unoconv

# Test data at EticaAI/HXL-Data-Science-file-formats/tests/hxltm/
unoconv --format xlsx hxltm-exemplum-linguam.tm.hxl.csv

# Note: in our tests, unoconv may have exporting bugs with unicode, see
# @see https://github.com/unoconv/unoconv/issues/271

1.5.6 CSV

§ HXLTM-CSV

1.5.6.1 CSV reference format, HXLated CSV (multilingual)

The default output of hxltmcli already is output an valid HXLated CSV without data changes changes (with notable exception of normalize HXL hashtags, like convert #item +i_ar +i_arb +is_Arab to #item+i_ar+i_arb+is_arab).

## The next 2 examples are equivalent: will print to stdout the result
hxltmcli hxltm-exemplum-linguam.tm.hxl.csv
#    (will print out contents of hxltm-exemplum-linguam.tm.hxl.csv)

cat hxltm-exemplum-linguam.tm.hxl.csv | hxltmcli
#    (will print out contents of hxltm-exemplum-linguam.tm.hxl.csv)

hxltmcli hxltm-exemplum-linguam.tm.hxl.csv output-file.tm.hxl.csv
#    (Instead of print to stdout, save the contents to a single CSV file)

PROTIP: You can chain several hxltmcli commands (ideally, the last command to export) or the first command to import from something that already is not HXL should be hxltmcli, but for advanced processing, see HXLTM-libhxl-cli-tools.

1.5.6.2 CSV source + target format (bilingual)

TODO: document minimal usage

1.5.7 UTX

§ HXLTM-UTX

TODO: maybe implement exporting to UTX (it's not complex than already done with CSV)

1.5.8 PO, TBX, SRX

About PO files: https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html

TBX:

http://www.ttt.org/oscarStandards/tbx/

TBX-Basic http://www.ttt.org/oscarStandards/tbx/tbx-basic.html

hxltmcli does not import or export PO files directly. Okapi Framework can be used to export XLIFF created by hxltmcli.

hxltmcli does not import or export TBX and SRX files directly. It's not clear if possible to use any external to import/export from already supported formats (like TMX and XLIFF) creted by hxltmcli without implementing this feature directly on hxltmcli.

TODO: we could consider supporting TBX (see https://en.wikipedia.org/wiki/TermBase_eXchange) since IATE seems to export glossaries on this format. See also https://termcoord.eu/iate/download-iate-tbx/.

1.5.9 Notable alternatives to HXL TM and `hxltmcli` to manage Translation Memories

1.5.9.1 Okapi Framework

TODO: this is a draft. Improve it.

https://okapiframework.org/
- http://okapiframework.org/wiki/index.php?title=Tikal

2. Reasons behind

2.1 Why?

The HXL already is used in production in special humanitarian areas (see The Humanitarian Data Exchange). With one line change is possible to convert most of already used spreadsheet-like data to be machine readable without need to disturb end users as other alternatives. One notable implementation (data visualization) powered by HXL is HXLDash (see this HXLDash example video).

The idea of this project strategies to turn already HXLated datasets to be used directly on open source desktop tools like the Orange Data Mining and WEKA "The workbench for machine learning" with the the minimum extra explanation on how to convert already existing HXL datasets AND do exist tools that solve know issues that are likely to be found.

2.2 How?

NOTE: already is possible to use HXLated CSVs on these tools! For either who is leaning HXL or who is using in production for humanitarian intent, the HXL-proxy (https://proxy.hxlstandard.org/) with "Strip text headers" can serve live-updated CSV-like files. Other usages can still use the HXL CLI tools or run the unocha/hxl-proxy with Docker on your machine or an private public server.

One way to implement this is to create minimum usable conversion tools that are able to export already HXLated datasets with additional hints to file formats used by default by their applications.

In practice this is beyond just file conversion (like XLSX to CSV), since it includes both "variable type" AND "intent to use (on data mining)". This is why this project also has the taxonomy/vocabulary reference table (and this ctually is more important than the implementation itself!). Without some extra step HXLated datasets work as averange CSV (good, but is just not great).

But yes, some of these converted files, in special Weka (at least if compared to Orange) are more strict on the tabular format it accepts, and this can be infuriating EVEN for who actually would know how to debug these issues! But this issue, at least, is more automatable.

Note: one practical reason to use HXLated files as base instead of plain CSV or XLSX (beyond obviously being available in humanitarian context) is because the grammar of HXL +attributes are flexible to export to several different formats with freetom to choose other aspects of the tagging.

2.3 Non-goals

The software implementation for file formats not typically used by easy to use desktop applications is a non-goal
- Yet, since as part of the HXL +attributes conversion tables, some of these proposed implementations may already be drafted. These reference tables are released under public domain licenses.
- Note that often humans who already use these formats already are likely to have skill to manually concert from CSVs (so could convert from HXL)
The software implementation (at least at the start) will not optimize for speed or low local disk usage
- but should work to convert large datasets with reasonable low memory usage
The software implementations assume an already HXLated input dataset to keep it simple
- Note that it is possible to quickly convert already well formatted CSVs to HXL by changing the header line (first line of the CSV).
While is technically possible to import back (reconstruct the original HXLated file) from exported files, this is an non-goal to be 100% compatible
- This applicable in special cases for .arff exports: the default export may need to clean known issues with exported strings.

HXLated datasets to test

Production data on The Humanitarian Data Exchange ("HDX")

Generic search query: https://data.humdata.org/search?vocab_Topics=hxl
HXL data on HDX
- Spreadsheet: https://docs.google.com/spreadsheets/d/1nLahxXVhnSuhCOi1yxAJCS7jFp8sMJ7IFXs1JRbpyjE/edit#gid=0
- Crawler: https://github.com/OCHA-DAP/hxl-hdx-stats

The Humanitarian Data Exchange ("HDX") contains public datasets and part of them already is HXLated and ready to test.

PROTIP: on the https://proxy.hxlstandard.org/data/source, the Option 2: choose from the cloud also have an icon "HDX" also can be used. This can be helpful if you are just looking around several datasets.

Files from EticaAI-Data_HXL-Data-Science-file-formats

Both Google Drive Folder and this repository has some test files. The not-so-documented manual tests may also give a quick idea on how it works.

Additional Guides

Note: these additional guides are not part of the main focus of this project

Command line tools for CSV

guides/command-line-tools-for-csv.sh

NOTE: Often people who work with HXL simply use the HXL-proxy, including to convert from non-HXLated sources.

Here there is an an quick overview of different command line tools that worth at least mention, in special if are dealing with raw formats already not HXLated.

Alternatives to preview spreadsheets with over 1.000.000 rows

guides/preview-huge-ammount-of-data.md

90% of the time 1.000.000 rows is likely to be enough even if you are dealing with data science projects. So it means that there is no need to use command line tools or use more complex solutions, like import to an database or pay for enterprise solutions.

This guide if when you need to go over these limits without change too much your tools.

License

The EticaAI has dedicated the work to the public domain by waiving all of their rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.8.8.6.post1

Nov 7, 2021

0.8.8.6

Nov 7, 2021

0.8.8.5

Nov 6, 2021

0.8.8.4

Oct 15, 2021

0.8.8.3

Jul 28, 2021

0.8.8.2

Jul 19, 2021

0.8.8.1

Jul 18, 2021

This version

0.8.8.0

Jul 4, 2021

0.8.7.3

Apr 30, 2021

0.8.7.2

Apr 26, 2021

0.8.7.1

Apr 20, 2021

0.8.7

Apr 19, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdp-toolchain-0.8.8.0.tar.gz (907.4 kB view hashes)

Uploaded Jul 4, 2021 Source

Built Distribution

hdp_toolchain-0.8.8.0-py3-none-any.whl (371.9 kB view hashes)

Uploaded Jul 4, 2021 Python 3

Hashes for hdp-toolchain-0.8.8.0.tar.gz

Hashes for hdp-toolchain-0.8.8.0.tar.gz
Algorithm	Hash digest
SHA256	`6a3b40007e39a93a1ca2b908174e31b9586d308dd7f853023f22d40961a9b97d`
MD5	`1d2d4d08cb495e428ebca3741a3d06f7`
BLAKE2b-256	`3f2eb34f9d0f9727c27fb9f577e647f86049064e70c4a1120b07b7ac6f73b307`

Hashes for hdp_toolchain-0.8.8.0-py3-none-any.whl

Hashes for hdp_toolchain-0.8.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1dde9ffef363c375c66c493dc20ccd646f746d82282c790b4d5f9f1ce87d880d`
MD5	`521bcedcdb1c6232d1345121ac1273e6`
BLAKE2b-256	`158238ac6fd333258438bfcbf77b2fd8e67d926195ccd603afaee66a7e4f9530`

hdp-toolchain 0.8.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Data Science files exported from HXL (The Humanitarian Exchange Language)

HXL-Data-Science-file-formats

1. The main focus

1.1 Vocabulary, Taxonomies and URNs

1.1.1 Vocabulary & Taxonomies on HXL

1.1.2 Uniform Resource Name on URN:DATA

Why use URN to identify resources is more than naming convention

Security (and privacy) considerations (for URN:DATA)

Disclaimer (for URN:DATA)

1.1.3 Ontologia

Why: focus on abstract complexity for users AND allow reuse by other projects

Distribution channels

1.2 HXL2 Command line tools

1.2.1 hxl2example: create your own exporter/importer

1.2.2 hxl2tab: tab format, focused for compatibility with Orange Data Mining

1.2.3 hxlquickmeta: output information about local/remote datasets (even non HXLated yet)

1.2.4 hxlquickimport: (like the hxltag)

1.3 URN Command line tools

1.3.1 urnresolver: convert Uniform Resource Name of datasets to real IRIs (URLs)

1.4 HDP HDP Declarative Programming (early draft)

1.4.1 HDP conventions (The YAML/JSON file structure)

1.4.2 hdpcli (command line interface)

1.4.3 HXLm.HDP (python library subpackage) usage

1.5 HXLTM HXL Trānslātiōnem Memoriam

1.5.1 Common hxltm FAQ

1.5.1.1 hxltmcli installation

1.5.1.2 Save entire Translations Memory on Excel files

1.5.1.3 Example data

1.5.1.4 Advanced filter with HXL cli tools

1.5.1.5 Advanced filter with HXL-proxy (integration with Google Sheets and CSV/XLSX/etc avalible on web)

1.5.2 TMX: Translation Memory eXchange v1.4b

1.5.3 XLIFF: XML Localization Interchange File Format v2.1

1.5.3.1 HXLTM supported features of XLIFF

1.5.4 Google Sheets

1.5.5 Microsoft Excel

1.5.6 CSV

1.5.6.1 CSV reference format, HXLated CSV (multilingual)

1.5.6.2 CSV source + target format (bilingual)

1.5.7 UTX

1.5.8 PO, TBX, SRX

1.5.9 Notable alternatives to HXL TM and hxltmcli to manage Translation Memories

1.5.9.1 Okapi Framework

2. Reasons behind

2.1 Why?

2.2 How?

2.3 Non-goals

HXLated datasets to test

Production data on The Humanitarian Data Exchange ("HDX")

Files from EticaAI-Data_HXL-Data-Science-file-formats

Additional Guides

Command line tools for CSV

Alternatives to preview spreadsheets with over 1.000.000 rows

License

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

1.1.2 Uniform Resource Name on `URN:DATA`

Security (and privacy) considerations (for `URN:DATA`)

Disclaimer (for `URN:DATA`)

1.2 `HXL2` Command line tools

1.2.1 `hxl2example`: create your own exporter/importer

1.2.2 `hxl2tab`: tab format, focused for compatibility with Orange Data Mining

1.2.3 `hxlquickmeta`: output information about local/remote datasets (even non HXLated yet)

1.2.4 `hxlquickimport`: (like the `hxltag`)

1.3 `URN` Command line tools

1.3.1 `urnresolver`: convert Uniform Resource Name of datasets to real IRIs (URLs)

1.4 `HDP` HDP Declarative Programming (early draft)

1.4.2 `hdpcli` (command line interface)

1.4.3 `HXLm.HDP` (python library subpackage) usage

1.5 `HXLTM` HXL Trānslātiōnem Memoriam

1.5.1 Common `hxltm` FAQ

1.5.1.1 `hxltmcli` installation

1.5.9 Notable alternatives to HXL TM and `hxltmcli` to manage Translation Memories