Skip to main content

Learning Using Texts - Chinese Parser

Project description

lute3-mandarin

A Mandarin parser for Lute (lute3) using the jieba library, and pypinyin for readings.

Installation

See the Lute manual.

Usage

When this parser is installed, you can add "Mandarin Chinese" as a language to Lute, which comes with a simple story.

Parsing exceptions

Sometimes jieba groups too many characters together when parsing. For example, it returns "清华大学" as a single word of four characters, which might not be correct.

You can specify how Lute should correct these cases by adding some simple "rules" to the file plugins/lute_mandarin/parser_exceptions.txt found in your Lute data directory. This file is automatically created when Lute starts. Each rule contains the characters of the word as parsed by jieba, with regular commas added where the word should be split.

Some examples:

File content Results when parsing "清华大学"
(empty file) "清华大学"
清华,大学
Two tokens, "清华" and "大学" (the single token is split in two)
清,华,大,学
Four tokens, "清", "华", "大", "学"
清华,大学
大,学
Three tokens, "清华", "大, "学" (results are recursively broken down if rules are found)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lute3_mandarin-0.0.3.tar.gz (6.3 kB view hashes)

Uploaded Source

Built Distribution

lute3_mandarin-0.0.3-py3-none-any.whl (4.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page