Skip to main content

Export UNIHAN data of Chinese, Japanese, Korean to CSV, JSON or YAML

Project description

unihan-etl

Python Package Docs Build Status Code Coverage License

ETL tool for Unicode's Han Unification (UNIHAN) database releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the database from Unicode's website to a flat, tabular or structured, tree-like format.

unihan-etl can be used as a python library through its API, to retrieve data as a python object, or through the CLI to retrieve a CSV, JSON, or YAML file.

Part of the cihai project. Similar project: libUnihan.

UNIHAN Version compatibility (as of unihan-etl v0.10.0): 11.0.0 (released 2018-05-08, revision 25).

UNIHAN's data is dispersed across multiple files in the format of:

U+3400  kCantonese  jau1
U+3400  kDefinition (same as U+4E18 丘) hillock or mound
U+3400  kMandarin   qiū
U+3401  kCantonese  tim2
U+3401  kDefinition to lick; to taste, a mat, bamboo bark
U+3401  kHanyuPinyin    10019.020:tiàn
U+3401  kMandarin   tiàn

Values vary in shape and structure depending on their field type. kHanyuPinyin maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents an entry. Complicating it further, more variations:

U+5EFE  kHanyuPinyin    10513.110,10514.010,10514.020:gǒng
U+5364  kHanyuPinyin    10093.130:xī,lǔ 74609.020:lǔ,xī

kHanyuPinyin supports multiple entries delimited by spaces. ":" (colon) separate locations in the work from pinyin readings. "," (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.

Tabular, "Flat" output

CSV (default), $ unihan-etl:

char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn

With $ unihan-etl -F yaml --no-expand:

- char: 
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

With $ unihan-etl -F json --no-expand:

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": "(same as U+4E18 丘) hillock or mound",
    "kCantonese": "jau1",
    "kHanyuPinyin": null,
    "kMandarin": "qiū"
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": "to lick; to taste, a mat, bamboo bark",
    "kCantonese": "tim2",
    "kHanyuPinyin": "10019.020:tiàn",
    "kMandarin": "tiàn"
  }
]

"Structured" output

Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.

To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.

Why not CSV?

Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.

JSON, $ unihan-etl -F json:

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": ["(same as U+4E18 丘) hillock or mound"],
    "kCantonese": ["jau1"],
    "kMandarin": {
      "zh-Hans": "qiū",
      "zh-Hant": "qiū"
    }
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
    "kCantonese": ["tim2"],
    "kHanyuPinyin": [
      {
        "locations": [
          {
            "volume": 1,
            "page": 19,
            "character": 2,
            "virtual": 0
          }
        ],
        "readings": ["tiàn"]
      }
    ],
    "kMandarin": {
      "zh-Hans": "tiàn",
      "zh-Hant": "tiàn"
    }
  }
]

YAML $ unihan-etl -F yaml:

- char: 
  kCantonese:
    - jau1
  kDefinition:
    - (same as U+4E18 丘) hillock or mound
  kMandarin:
    zh-Hans: qiū
    zh-Hant: qiū
  ucn: U+3400
- char: 
  kCantonese:
    - tim2
  kDefinition:
    - to lick
    - to taste, a mat, bamboo bark
  kHanyuPinyin:
    - locations:
        - character: 2
          page: 19
          virtual: 0
          volume: 1
      readings:
        - tiàn
  kMandarin:
    zh-Hans: tiàn
    zh-Hant: tiàn
  ucn: U+3401

Features

  • automatically downloads UNIHAN from the internet
  • strives for accuracy with the specifications described in UNIHAN's database design
  • export to JSON, CSV and YAML (requires pyyaml) via -F
  • configurable to export specific fields via -f
  • accounts for encoding conflicts due to the Unicode-heavy content
  • designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
  • core component and dependency of cihai, a CJK library
  • data package support
  • expansion of multi-value delimited fields in YAML, JSON and python dictionaries
  • supports >= 3.6 and pypy

If you encounter a problem or have a question, please create an issue.

Usage

unihan-etl offers customizable builds via its command line arguments.

See unihan-etl CLI arguments for information on how you can specify columns, files, download URL's, and output destination.

To download and build your own UNIHAN export:

$ pip install --user unihan-etl

To output CSV, the default format:

$ unihan-etl

To output JSON:

$ unihan-etl -F json

To output YAML:

$ pip install --user pyyaml
$ unihan-etl -F yaml

To only output the kDefinition field in a csv:

$ unihan-etl -f kDefinition

To output multiple fields, separate with spaces:

$ unihan-etl -f kCantonese kDefinition

To output to a custom file:

$ unihan-etl --destination ./exported.csv

To output to a custom file (templated file extension):

$ unihan-etl --destination ./exported.{ext}

See unihan-etl CLI arguments for advanced usage examples.

Code layout

# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/

# output dir
{XDG data dir}/unihan_etl/
  unihan.json
  unihan.csv
  unihan.yaml   # (requires pyyaml)

# package dir
unihan_etl/
  process.py    # argparse, download, extract, transform UNIHAN's data
  constants.py  # immutable data vars (field to filename mappings, etc)
  expansion.py  # extracting details baked inside of fields
  _compat.py    # python 2/3 compatibility module
  util.py       # utility / helper functions

# test suite
tests/*

Developing

poetry is a required package to develop.

git clone https://github.com/cihai/unihan-etl.git

cd unihan-etl

poetry install -E "docs test coverage lint format"

Makefile commands prefixed with watch_ will watch files and rerun.

Tests

poetry run py.test

Helpers: make test Rerun tests on file change: make watch_test (requires entr(1))

Documentation

Default preview server: http://localhost:8039

cd docs/ and make html to build. make serve to start http server.

Helpers: make build_docs, make serve_docs

Rebuild docs on file change: make watch_docs (requires entr(1))

Rebuild docs and run server via one terminal: make dev_docs (requires above, and a make(1) with -J support, e.g. GNU Make)

Formatting / Linting

The project uses black and isort (one after the other) and runs flake8 via CI. See the configuration in pyproject.toml and setup.cfg:

make black isort: Run black first, then isort to handle import nuances make flake8, to watch (requires entr(1)): make watch_flake8

Releasing

As of 0.11, poetry handles virtualenv creation, package requirements, versioning, building, and publishing. Therefore there is no setup.py or requirements files.

Update __version__ in __about__.py and pyproject.toml:

$ git commit -m 'build(unihan-etl): Tag v0.1.1'
$ git tag v0.1.1
$ git push
$ git push --tags
$ poetry build
$ poetry deploy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unihan-etl-0.14.0a0.tar.gz (58.3 kB view hashes)

Uploaded Source

Built Distribution

unihan_etl-0.14.0a0-py3-none-any.whl (17.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page