Learning Using Texts - Chinese Parser
Project description
lute3-mandarin
A Mandarin parser for Lute (lute3
) using the jieba
library, and
pypinyin
for readings.
Installation
See the Lute manual.
Usage
When this parser is installed, you can add "Mandarin Chinese" as a language to Lute, which comes with a simple story.
Parsing exceptions
Sometimes jieba
groups too many characters together when parsing.
For example, it returns "清华大学" as a single word of four
characters, which might not be correct.
You can specify how Lute should correct these cases by adding some
simple "rules" to the file
plugins/lute_mandarin/parser_exceptions.txt
found in your Lute
data
directory. This file is automatically created when Lute
starts. Each rule contains the characters of the word as parsed by
jieba
, with regular commas added where the word should be split.
Some examples:
File content | Results when parsing "清华大学" |
---|---|
(empty file) | "清华大学" |
|
Two tokens, "清华" and "大学" (the single token is split in two) |
|
Four tokens, "清", "华", "大", "学" |
|
Three tokens, "清华", "大, "学" (results are recursively broken down if rules are found) |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lute3_mandarin-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5bfeaf7da14aa8cdab1a98f3a1ee4ba93ac6beb95a551d9098e157ebc51f050d |
|
MD5 | 81b9066e9c00b3e0297dfb5fc4a51bb5 |
|
BLAKE2b-256 | 08c37c0abea77e46f66b9168e61672da0e7679eaf4a072248b17df37bcca3c79 |