Export UNIHAN to Python, Data Package, CSV, JSON and YAML
Project description
unihan-etl - ETL tool UNIHAN. Retrieve, extract and transform UNIHAN into tabular or structured format. Load into python objects, JSON, CSV, and YAML. Part of the cihai project. See also: libUnihan.
UNIHAN’s data is dispersed across multiple files in the format of:
U+3400 kCantonese jau1 U+3400 kDefinition (same as U+4E18 丘) hillock or mound U+3400 kMandarin qiū U+3401 kCantonese tim2 U+3401 kDefinition to lick; to taste, a mat, bamboo bark U+3401 kHanyuPinyin 10019.020:tiàn U+3401 kMandarin tiàn
Field types contain additional information to extract. For example, kHanyuPinyin, which maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, 10019.020:tiàn represents a minimal case. More:
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
The kHanyuPinyin field supports multiple entries, delimited by spaces. Within an entry, a “:” (colon) separates locations in the work and pinyin readings. Within these split values, a “,” (comma) can separate multiple values. This is just one of 90 fields contained in the database.
Tabular, “Flat” output
CSV (default), $ unihan-etl:
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin 㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū 㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With $ unihan-etl -F yaml --no-expand:
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
With $ unihan-etl -F json --no-expand:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
“Structured” output
UNIHAN database’s documentation specifies how multiple values (lists) and structured information (hashes/dicts) are packed into fields. unihan-etl carefully handles these fields in a uniform output. Support only available on JSON, YAML and python output.
JSON, $ unihan-etl -F json:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": [
"(same as U+4E18 丘) hillock or mound"
],
"kCantonese": [
"jau1"
],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": [
"to lick",
"to taste, a mat, bamboo bark"
],
"kCantonese": [
"tim2"
],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": [
"tiàn"
]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML $ unihan-etl -F yaml:
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features
automatically downloads UNIHAN from the internet
strives for accuracy with the specifications described in UNIHAN’s database design
export to JSON, CSV and YAML (requires pyyaml) via -F
configurable to export specific fields via -f
accounts for encoding conflicts due to the Unicode-heavy content
designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
core component and dependency of cihai, a CJK library
data package support
expansion of multi-value delimited fields in YAML, JSON and python dictionaries
supports python 2.7, >= 3.5 and pypy
If you encounter a problem or have a question, please create an issue.
Usage
unihan-etl supports command line arguments. See unihan-etl CLI arguments for information on how you can specify custom columns, files, download URL’s and output destinations.
To download and build your own UNIHAN export:
$ pip install unihan-etl
To output CSV, the default format:
$ unihan-etl
To output JSON:
$ unihan-etl -F json
To output YAML:
$ pip install pyyaml $ unihan-etl -F yaml
To only output the kDefinition field in a csv:
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
Structure
# output w/ JSON
{XDG data dir}/unihan_etl/unihan.json
# output w/ CSV
{XDG data dir}/unihan_etl/unihan.csv
# output w/ yaml (requires pyyaml)
{XDG data dir}/unihan_etl/unihan.yaml
# script to download + build a SDF csv of unihan.
unihan_etl/process.py
# unit tests to verify behavior / consistency of builder
tests/*
# python 2/3 compatibility module
unihan_etl/_compat.py
# utility / helper functions
unihan_etl/util.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.