Export UNIHAN to Python, Data Package, CSV, JSON and YAML
Project description
*unihan-etl* - `ETL`_ tool for Unicode's Han Unification (`UNIHAN`_) database
releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the
database from Unicode's website to a flat, tabular or structured, tree-like
format.
unihan-etl can be used as a python library through its `API`_, to retrieve data
as a python object, or through the `CLI`_ to retrieve a CSV, JSON, or YAML file.
Part of the `cihai`_ project. Similar project: `libUnihan <http://libunihan.sourceforge.net/>`_.
UNIHAN Version compatibility (as of unihan-etl v0.10.0):
`11.0.0 <https://www.unicode.org/reports/tr38/tr38-25.html#History>`__
(released 2018-05-08, revision 25).
|pypi| |docs| |build-status| |coverage| |license|
`UNIHAN`_'s data is dispersed across multiple files in the format of::
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type.
`kHanyuPinyin <http://www.unicode.org/reports/tr38/#kHanyuPinyin>`_
maps Unicode codepoints to `Hànyǔ Dà Zìdiǎn <https://en.wikipedia.org/wiki/Hanyu_Da_Zidian>`_,
where ``10019.020:tiàn`` represents an entry. Complicating it further,
more variations::
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
*kHanyuPinyin* supports multiple entries delimited by spaces. ":"
(colon) separate locations in the work from pinyin readings. ","
(comma) separate multiple entries/readings. This is just one of 90
fields contained in the database.
.. _API: https://unihan-etl.git-pull.com/en/latest/api.html
.. _CLI: https://unihan-etl.git-pull.com/en/latest/cli.html
Tabular, "Flat" output
----------------------
CSV (default), ``$ unihan-etl``::
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With ``$ unihan-etl -F yaml --no-expand``:
.. code-block:: yaml
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
With ``$ unihan-etl -F json --no-expand``:
.. code-block:: json
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
"Structured" output
-------------------
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values
in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python
list/dicts.
.. admonition:: Why not CSV?
Unfortunately, CSV is only suitable for storing table-like
information. File formats such as JSON and YAML accept key-values and
hierarchical entries.
JSON, ``$ unihan-etl -F json``:
.. code-block:: json
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": [
"(same as U+4E18 丘) hillock or mound"
],
"kCantonese": [
"jau1"
],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": [
"to lick",
"to taste, a mat, bamboo bark"
],
"kCantonese": [
"tim2"
],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": [
"tiàn"
]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML ``$ unihan-etl -F yaml``:
.. code-block:: yaml
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features
--------
* automatically downloads UNIHAN from the internet
* strives for accuracy with the specifications described in `UNIHAN's database
design <http://www.unicode.org/reports/tr38/>`_
* export to JSON, CSV and YAML (requires `pyyaml`_) via ``-F``
* configurable to export specific fields via ``-f``
* accounts for encoding conflicts due to the Unicode-heavy content
* designed as a technical proof for future CJK (Chinese, Japanese,
Korean) datasets
* core component and dependency of `cihai`_, a CJK library
* `data package`_ support
* expansion of multi-value delimited fields in YAML, JSON and python
dictionaries
* supports python 2.7, >= 3.5 and pypy
If you encounter a problem or have a question, please `create an
issue`_.
.. _cihai: https://cihai.git-pull.com
.. _cihai-handbook: https://github.com/cihai/cihai-handbook
.. _cihai team: https://github.com/cihai?tab=members
.. _cihai-python: https://github.com/cihai/cihai-python
Usage
-----
``unihan-etl`` offers customizable builds via its command line arguments.
See `unihan-etl CLI arguments`_ for information on how you can specify
columns, files, download URL's, and output destination.
To download and build your own UNIHAN export:
.. code-block:: bash
$ pip install --user unihan-etl
To output CSV, the default format:
.. code-block:: bash
$ unihan-etl
To output JSON::
$ unihan-etl -F json
To output YAML::
$ pip install --user pyyaml
$ unihan-etl -F yaml
To only output the kDefinition field in a csv::
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces::
$ unihan-etl -f kCantonese kDefinition
To output to a custom file::
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension)::
$ unihan-etl --destination ./exported.{ext}
See `unihan-etl CLI arguments`_ for advanced usage examples.
.. _unihan-etl CLI arguments: https://unihan-etl.git-pull.com/en/latest/cli.html
Code layout
-----------
.. code-block:: bash
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
process.py # argparse, download, extract, transform UNIHAN's data
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
_compat.py # python 2/3 compatibility module
util.py # utility / helper functions
# test suite
tests/*
.. _UNIHAN: http://www.unicode.org/charts/unihan.html
.. _ETL: https://en.wikipedia.org/wiki/Extract,_transform,_load
.. _create an issue: https://github.com/cihai/unihan-etl/issues/new
.. _Data Package: http://frictionlessdata.io/data-packages/
.. _pyyaml: http://pyyaml.org/
.. |pypi| image:: https://img.shields.io/pypi/v/unihan-etl.svg
:alt: Python Package
:target: http://badge.fury.io/py/unihan-etl
.. |build-status| image:: https://img.shields.io/travis/cihai/unihan-etl.svg
:alt: Build Status
:target: https://travis-ci.org/cihai/unihan-etl
.. |coverage| image:: https://codecov.io/gh/cihai/unihan-etl/branch/master/graph/badge.svg
:alt: Code Coverage
:target: https://codecov.io/gh/cihai/unihan-etl
.. |license| image:: https://img.shields.io/github/license/cihai/unihan-etl.svg
:alt: License
.. |docs| image:: https://readthedocs.org/projects/unihan-etl/badge/?version=latest
:alt: Documentation Status
:scale: 100%
:target: https://readthedocs.org/projects/unihan-etl/
releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the
database from Unicode's website to a flat, tabular or structured, tree-like
format.
unihan-etl can be used as a python library through its `API`_, to retrieve data
as a python object, or through the `CLI`_ to retrieve a CSV, JSON, or YAML file.
Part of the `cihai`_ project. Similar project: `libUnihan <http://libunihan.sourceforge.net/>`_.
UNIHAN Version compatibility (as of unihan-etl v0.10.0):
`11.0.0 <https://www.unicode.org/reports/tr38/tr38-25.html#History>`__
(released 2018-05-08, revision 25).
|pypi| |docs| |build-status| |coverage| |license|
`UNIHAN`_'s data is dispersed across multiple files in the format of::
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type.
`kHanyuPinyin <http://www.unicode.org/reports/tr38/#kHanyuPinyin>`_
maps Unicode codepoints to `Hànyǔ Dà Zìdiǎn <https://en.wikipedia.org/wiki/Hanyu_Da_Zidian>`_,
where ``10019.020:tiàn`` represents an entry. Complicating it further,
more variations::
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
*kHanyuPinyin* supports multiple entries delimited by spaces. ":"
(colon) separate locations in the work from pinyin readings. ","
(comma) separate multiple entries/readings. This is just one of 90
fields contained in the database.
.. _API: https://unihan-etl.git-pull.com/en/latest/api.html
.. _CLI: https://unihan-etl.git-pull.com/en/latest/cli.html
Tabular, "Flat" output
----------------------
CSV (default), ``$ unihan-etl``::
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With ``$ unihan-etl -F yaml --no-expand``:
.. code-block:: yaml
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
With ``$ unihan-etl -F json --no-expand``:
.. code-block:: json
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
"Structured" output
-------------------
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values
in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python
list/dicts.
.. admonition:: Why not CSV?
Unfortunately, CSV is only suitable for storing table-like
information. File formats such as JSON and YAML accept key-values and
hierarchical entries.
JSON, ``$ unihan-etl -F json``:
.. code-block:: json
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": [
"(same as U+4E18 丘) hillock or mound"
],
"kCantonese": [
"jau1"
],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": [
"to lick",
"to taste, a mat, bamboo bark"
],
"kCantonese": [
"tim2"
],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": [
"tiàn"
]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML ``$ unihan-etl -F yaml``:
.. code-block:: yaml
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features
--------
* automatically downloads UNIHAN from the internet
* strives for accuracy with the specifications described in `UNIHAN's database
design <http://www.unicode.org/reports/tr38/>`_
* export to JSON, CSV and YAML (requires `pyyaml`_) via ``-F``
* configurable to export specific fields via ``-f``
* accounts for encoding conflicts due to the Unicode-heavy content
* designed as a technical proof for future CJK (Chinese, Japanese,
Korean) datasets
* core component and dependency of `cihai`_, a CJK library
* `data package`_ support
* expansion of multi-value delimited fields in YAML, JSON and python
dictionaries
* supports python 2.7, >= 3.5 and pypy
If you encounter a problem or have a question, please `create an
issue`_.
.. _cihai: https://cihai.git-pull.com
.. _cihai-handbook: https://github.com/cihai/cihai-handbook
.. _cihai team: https://github.com/cihai?tab=members
.. _cihai-python: https://github.com/cihai/cihai-python
Usage
-----
``unihan-etl`` offers customizable builds via its command line arguments.
See `unihan-etl CLI arguments`_ for information on how you can specify
columns, files, download URL's, and output destination.
To download and build your own UNIHAN export:
.. code-block:: bash
$ pip install --user unihan-etl
To output CSV, the default format:
.. code-block:: bash
$ unihan-etl
To output JSON::
$ unihan-etl -F json
To output YAML::
$ pip install --user pyyaml
$ unihan-etl -F yaml
To only output the kDefinition field in a csv::
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces::
$ unihan-etl -f kCantonese kDefinition
To output to a custom file::
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension)::
$ unihan-etl --destination ./exported.{ext}
See `unihan-etl CLI arguments`_ for advanced usage examples.
.. _unihan-etl CLI arguments: https://unihan-etl.git-pull.com/en/latest/cli.html
Code layout
-----------
.. code-block:: bash
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
process.py # argparse, download, extract, transform UNIHAN's data
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
_compat.py # python 2/3 compatibility module
util.py # utility / helper functions
# test suite
tests/*
.. _UNIHAN: http://www.unicode.org/charts/unihan.html
.. _ETL: https://en.wikipedia.org/wiki/Extract,_transform,_load
.. _create an issue: https://github.com/cihai/unihan-etl/issues/new
.. _Data Package: http://frictionlessdata.io/data-packages/
.. _pyyaml: http://pyyaml.org/
.. |pypi| image:: https://img.shields.io/pypi/v/unihan-etl.svg
:alt: Python Package
:target: http://badge.fury.io/py/unihan-etl
.. |build-status| image:: https://img.shields.io/travis/cihai/unihan-etl.svg
:alt: Build Status
:target: https://travis-ci.org/cihai/unihan-etl
.. |coverage| image:: https://codecov.io/gh/cihai/unihan-etl/branch/master/graph/badge.svg
:alt: Code Coverage
:target: https://codecov.io/gh/cihai/unihan-etl
.. |license| image:: https://img.shields.io/github/license/cihai/unihan-etl.svg
:alt: License
.. |docs| image:: https://readthedocs.org/projects/unihan-etl/badge/?version=latest
:alt: Documentation Status
:scale: 100%
:target: https://readthedocs.org/projects/unihan-etl/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
unihan-etl-0.10.4.tar.gz
(27.1 kB
view details)
File details
Details for the file unihan-etl-0.10.4.tar.gz
.
File metadata
- Download URL: unihan-etl-0.10.4.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
cc2d4e177d8a49ff67f39a3d9a36fa788d9a6f84656f01cd400621c7ea53aa56
|
|
MD5 |
e17fb1220d87b46d0ad06e277a5d97b2
|
|
BLAKE2b-256 |
2d8ddccf67dc50662802a07b7b2b253919cb81b1864271064707ea5c053c9eaf
|