Learning Using Texts - Chinese Parser
Project description
lute3-mandarin
A Mandarin parser for Lute (lute3
) using the jieba
library, and
pypinyin
for readings.
Installation
See the Lute manual.
Usage
When this parser is installed, you can add "Mandarin Chinese" as a language to Lute, which comes with a simple story.
Parsing exceptions
Sometimes jieba
groups too many characters together when parsing.
For example, it returns "清华大学" as a single word of four
characters, which might not be correct.
You can specify how Lute should correct these cases by adding some
simple "rules" to the file
plugins/lute_mandarin/parser_exceptions.txt
found in your Lute
data
directory. This file is automatically created when Lute
starts. Each rule contains the characters of the word as parsed by
jieba
, with regular commas added where the word should be split.
Some examples:
File content | Results when parsing "清华大学" |
---|---|
(empty file) | "清华大学" |
|
Two tokens, "清华" and "大学" (the single token is split in two) |
|
Four tokens, "清", "华", "大", "学" |
|
Three tokens, "清华", "大, "学" (results are recursively broken down if rules are found) |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lute3_mandarin-0.0.3b1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 815abe024677bdb31c54b9eb28cb5bf3b644f6baf89355c1361424abcc8e6312 |
|
MD5 | 406da4c7da1bf085f2c3e235eb291926 |
|
BLAKE2b-256 | b09c4a229afbcfd80d6f12434e0210cfffce6c10c88925a2d879aa5ff9717564 |