Export UNIHAN data of Chinese, Japanese, Korean to CSV, JSON or YAML
Project description
unihan-etl ·

An ETL tool for the Unicode Han Unification (UNIHAN) database releases. unihan-etl is designed to fetch (download), unpack (unzip), and convert the database from the Unicode website into either a flattened, tabular format or a structured, hierarchical format.
unihan-etl serves dual purposes: as a Python library offering an API for accessing data as Python objects, and as a command-line interface (CLI) for exporting data into CSV, JSON, or YAML formats.
This tool is a component of the cihai suite of CJK related projects. For a similar tool, see libUnihan.
As of v0.31.0, unihan-etl is compatible with UNIHAN Version 15.1.0 (released on 2023-09-01, revision 35).
The UNIHAN database
The UNIHAN database organizes data across multiple files, exemplified below:
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type.
kHanyuPinyin maps Unicode codepoints to
Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents
an entry. Complicating it further, more variations:
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
kHanyuPinyin supports multiple entries delimited by spaces. ":" (colon) separate locations in the work from pinyin readings. "," (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.
Tabular, "Flat" output
CSV (default)
$ unihan-etl
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With $ unihan-etl -F yaml --no-expand:
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
To preview in the CLI, try tabview or csvlens.
JSON
$ unihan-etl -F json --no-expand
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
Tools:
YAML
$ unihan-etl -F yaml --no-expand
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
Filter via the CLI with yq.
"Structured" output
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.
Why not CSV?
Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.
JSON
$ unihan-etl -F json
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": ["(same as U+4E18 丘) hillock or mound"],
"kCantonese": ["jau1"],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
"kCantonese": ["tim2"],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": ["tiàn"]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML
$ unihan-etl -F yaml
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features
- automatically downloads UNIHAN from the internet
- strives for accuracy with the specifications described in UNIHAN's database design
- export to JSON, CSV and YAML (requires pyyaml) via
-F - configurable to export specific fields via
-f - accounts for encoding conflicts due to the Unicode-heavy content
- designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
- core component and dependency of cihai, a CJK library
- data package support
- expansion of multi-value delimited fields in YAML, JSON and python dictionaries
- supports >= 3.7 and pypy
If you encounter a problem or have a question, please create an issue.
Installation
To download and build your own UNIHAN export:
Using uv to add the CLI to your project:
$ uv add unihan-etl
Using pip:
$ pip install --user unihan-etl
Run the tool without a persistent install via uvx:
$ uvx unihan-etl
or by pipx:
$ pipx install unihan-etl
Developmental releases
Using uv, opt-in to pre-release versions:
$ uv add --prerelease=allow unihan-etl
To pin a specific pre-release (for example 0.27.0a1):
$ uv add --prerelease=allow 'unihan-etl==0.27.0a1'
pip:
$ pip install --user --upgrade --pre unihan-etl
pipx:
$ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force
Then run unihan-etl@next load yoursession.
Run pre-release builds without installing with uvx:
$ uvx --prerelease=allow unihan-etl
Or pinned to that example version:
$ uvx --from 'unihan-etl==0.27.0a1' unihan-etl
Swap 0.27.0a1 for whichever pre-release you plan to use.
Usage
unihan-etl offers customizable builds via its command line arguments.
See unihan-etl CLI arguments for information on how you can specify columns, files, download URL's, and output destination.
To output CSV, the default format:
$ unihan-etl
To output JSON:
$ unihan-etl -F json
To output YAML:
Add PyYAML with uv:
$ uv add pyyaml
Or install it with pip:
$ pip install --user pyyaml
Then run:
$ unihan-etl -F yaml
To only output the kDefinition field in a csv:
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
Code layout
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
core.py # argparse, download, extract, transform UNIHAN's data
options.py # configuration object
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
types.py # type annotations
util.py # utility / helper functions
# test suite
tests/*
API
The package is python underneath the hood, you can utilize its full API. Example:
>>> from unihan_etl.core import Packager
>>> pkgr = Packager()
>>> hasattr(pkgr.options, 'destination')
True
Developing
$ git clone https://github.com/cihai/unihan-etl.git
$ cd unihan-etl
Bootstrap your environment and learn more about contributing. We use the same conventions / tools across all cihai projects: pytest, sphinx, mypy, ruff, tmuxp, and file watcher helpers (e.g. entr(1)).
More information
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unihan_etl-0.40.0.tar.gz.
File metadata
- Download URL: unihan_etl-0.40.0.tar.gz
- Upload date:
- Size: 374.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4745c5bb7afc4a0aa0dba8e356670a30e306561dbcca5544eb82af07769c0a11
|
|
| MD5 |
0b52e6faea87ee4656ca6a6224ec272b
|
|
| BLAKE2b-256 |
4fd86cc6dfed98a05566d5c2bb06b05425e20fafaa886b2cba0df5ea31ea9e43
|
Provenance
The following attestation bundles were made for unihan_etl-0.40.0.tar.gz:
Publisher:
tests.yml on cihai/unihan-etl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
unihan_etl-0.40.0.tar.gz -
Subject digest:
4745c5bb7afc4a0aa0dba8e356670a30e306561dbcca5544eb82af07769c0a11 - Sigstore transparency entry: 850161772
- Sigstore integration time:
-
Permalink:
cihai/unihan-etl@13688bf289e7c05f51fd7710913aeda424511494 -
Branch / Tag:
refs/tags/v0.40.0 - Owner: https://github.com/cihai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
tests.yml@13688bf289e7c05f51fd7710913aeda424511494 -
Trigger Event:
push
-
Statement type:
File details
Details for the file unihan_etl-0.40.0-py3-none-any.whl.
File metadata
- Download URL: unihan_etl-0.40.0-py3-none-any.whl
- Upload date:
- Size: 83.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
618e5b92b98de456170074b11b07214d44c6eb3f699685eceb06617e2b085f0e
|
|
| MD5 |
91bf451fa2f8489af02b81dd139bea31
|
|
| BLAKE2b-256 |
bea06e22c1431bd4e9e70e2ea3e2c6a7dbad0e7347b340b07185703ba552e7f2
|
Provenance
The following attestation bundles were made for unihan_etl-0.40.0-py3-none-any.whl:
Publisher:
tests.yml on cihai/unihan-etl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
unihan_etl-0.40.0-py3-none-any.whl -
Subject digest:
618e5b92b98de456170074b11b07214d44c6eb3f699685eceb06617e2b085f0e - Sigstore transparency entry: 850161775
- Sigstore integration time:
-
Permalink:
cihai/unihan-etl@13688bf289e7c05f51fd7710913aeda424511494 -
Branch / Tag:
refs/tags/v0.40.0 - Owner: https://github.com/cihai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
tests.yml@13688bf289e7c05f51fd7710913aeda424511494 -
Trigger Event:
push
-
Statement type: