Export UNIHAN to Python, Data Package, CSV, JSON and YAML
Project description
unihan-etl - ETL tool for Unicode’s Han Unification (UNIHAN) database releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the database from Unicode’s website to a flat, tabular or structured, tree-like format.
unihan-etl can be used as a python library through its API, to retrieve data as a python object, or through the CLI to retrieve a CSV, JSON, or YAML file.
Part of the cihai project. Similar project: libUnihan.
UNIHAN Version compatibility (as of unihan-etl v0.10.0): 11.0.0 (released 2018-05-08, revision 25).
UNIHAN’s data is dispersed across multiple files in the format of:
U+3400 kCantonese jau1 U+3400 kDefinition (same as U+4E18 丘) hillock or mound U+3400 kMandarin qiū U+3401 kCantonese tim2 U+3401 kDefinition to lick; to taste, a mat, bamboo bark U+3401 kHanyuPinyin 10019.020:tiàn U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type. kHanyuPinyin maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents an entry. Complicating it further, more variations:
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
kHanyuPinyin supports multiple entries delimited by spaces. “:” (colon) separate locations in the work from pinyin readings. “,” (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.
Tabular, “Flat” output
CSV (default), $ unihan-etl:
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin 㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū 㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With $ unihan-etl -F yaml --no-expand:
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
With $ unihan-etl -F json --no-expand:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
“Structured” output
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.
JSON, $ unihan-etl -F json:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": [
"(same as U+4E18 丘) hillock or mound"
],
"kCantonese": [
"jau1"
],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": [
"to lick",
"to taste, a mat, bamboo bark"
],
"kCantonese": [
"tim2"
],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": [
"tiàn"
]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML $ unihan-etl -F yaml:
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features
automatically downloads UNIHAN from the internet
strives for accuracy with the specifications described in UNIHAN’s database design
export to JSON, CSV and YAML (requires pyyaml) via -F
configurable to export specific fields via -f
accounts for encoding conflicts due to the Unicode-heavy content
designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
core component and dependency of cihai, a CJK library
data package support
expansion of multi-value delimited fields in YAML, JSON and python dictionaries
supports >= 3.6 and pypy
If you encounter a problem or have a question, please create an issue.
Usage
unihan-etl offers customizable builds via its command line arguments.
See unihan-etl CLI arguments for information on how you can specify columns, files, download URL’s, and output destination.
To download and build your own UNIHAN export:
$ pip install --user unihan-etl
To output CSV, the default format:
$ unihan-etl
To output JSON:
$ unihan-etl -F json
To output YAML:
$ pip install --user pyyaml $ unihan-etl -F yaml
To only output the kDefinition field in a csv:
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
Code layout
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
process.py # argparse, download, extract, transform UNIHAN's data
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
_compat.py # python 2/3 compatibility module
util.py # utility / helper functions
# test suite
tests/*
Developing
poetry is a required package to develop.
git clone https://github.com/cihai/unihan-etl.git
cd unihan-etl
poetry install -E "docs test coverage lint format"
Makefile commands prefixed with watch_ will watch files and rerun.
Tests
poetry run py.test
Helpers: make test Rerun tests on file change: make watch_test (requires entr(1))
Documentation
Default preview server: http://localhost:8039
cd docs/ and make html to build. make serve to start http server.
Helpers: make build_docs, make serve_docs
Rebuild docs on file change: make watch_docs (requires entr(1))
Rebuild docs and run server via one terminal: make dev_docs (requires above, and a make(1) with -J support, e.g. GNU Make)
Formatting / Linting
The project uses black and isort (one after the other) and runs flake8 via CI. See the configuration in pyproject.toml and setup.cfg:
make black isort: Run black first, then isort to handle import nuances make flake8, to watch (requires entr(1)): make watch_flake8
Releasing
As of 0.11, poetry handles virtualenv creation, package requirements, versioning, building, and publishing. Therefore there is no setup.py or requirements files.
Update __version__ in __about__.py and pyproject.toml:
git commit -m 'build(unihan-etl): Tag v0.1.1' git tag v0.1.1 git push git push --tags poetry build poetry deploy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for unihan_etl-0.12.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ebe239f113ffe31631d3de0853ea15c5a2dd23da4610de7bc20cc4ad4ab1825 |
|
MD5 | 06389335d31567414ad2d064ae354950 |
|
BLAKE2b-256 | 6fc312c5bc12272cf439f8444637dc69f51ac6760fbf67ba9c65e538d30c3c99 |