Skip to main content

Tokenizer POS-tagger and Dependency-parser for Chinese (简体/繁體/文言文)

Project description

Current PyPI packages

UD-Chinese

Tokenizer, POS-Tagger, and Dependency-Parser for Chinese (简体/繁體/文言文), working on Universal Dependencies.

Basic usage

>>> import udchinese
>>> zh=udchinese.load()
>>> s=zh("我把这本书看完了。吾既讀是書也。")
>>> print(s)
# newdoc
# newpar
# sent_id = 1
# text = 我把这本书看完了。
1			PRON	n,代名詞,人称,止格	Person=1|PronType=Prs	6	nsubj	_	SpaceAfter=No
2			ADP	BB	_	5	case	_	SpaceAfter=No
3			DET	DT	_	4	det	_	SpaceAfter=No
4			NOUN	n,名詞,描写,形質	_	5	clf	_	SpaceAfter=No
5			NOUN	n,名詞,主体,書物	_	6	obl:patient	_	SpaceAfter=No
6			VERB	v,動詞,行為,動作	_	0	root	_	SpaceAfter=No
7			VERB	v,動詞,変化,性質	_	6	flat:vv	_	SpaceAfter=No
8			PART	UH	_	6	discourse	_	SpaceAfter=No
9			PUNCT	s,記号,句点,*	_	6	punct	_	SpacesAfter=\n

# sent_id = 2
# text = 吾既讀是書也。
1			PRON	n,代名詞,人称,起格	Person=1|PronType=Prs	3	nsubj	_	SpaceAfter=No
2			ADV	v,副詞,時相,完了	AdvType=Tim|Aspect=Perf	3	advmod	_	SpaceAfter=No
3			VERB	v,動詞,行為,動作	_	0	root	_	SpaceAfter=No
4			PRON	n,代名詞,指示,*	PronType=Dem	5	det	_	SpaceAfter=No
5			NOUN	n,名詞,主体,書物	_	3	obj	_	SpaceAfter=No
6			PART	p,助詞,句末,*	_	3	discourse:sp	_	SpaceAfter=No
7			PUNCT	s,記号,句点,*	_	3	punct	_	SpacesAfter=\n

Usage via spaCy

If you have already installed spaCy 2.1.0 or later, you can use UD-Chinese via spaCy Language pipeline.

>>> import udchinese.spacy
>>> zh=udchinese.spacy.load()
>>> d=zh("我把这本书看完了。吾既讀是書也。")
>>> print(type(d))
<class 'spacy.tokens.doc.Doc'>
>>> print(udchinese.spacy.to_conllu(d))
# text = 我把这本书看完了。
1			PRON	n,代名詞,人称,止格	_	6	nsubj	_	SpaceAfter=No
2			ADP	BB	_	5	case	_	SpaceAfter=No
3			DET	DT	_	4	det	_	SpaceAfter=No
4			NOUN	n,名詞,描写,形質	_	5	clf	_	SpaceAfter=No
5			NOUN	n,名詞,主体,書物	_	6	obl:patient	_	SpaceAfter=No
6			VERB	v,動詞,行為,動作	_	0	root	_	SpaceAfter=No
7			VERB	v,動詞,変化,性質	_	6	flat:vv	_	SpaceAfter=No
8			PART	UH	_	6	discourse	_	SpaceAfter=No
9			PUNCT	s,記号,句点,*	_	6	punct	_	_

# text = 吾既讀是書也。
1			PRON	n,代名詞,人称,起格	_	3	nsubj	_	SpaceAfter=No
2			ADV	v,副詞,時相,完了	_	3	advmod	_	SpaceAfter=No
3			VERB	v,動詞,行為,動作	_	0	root	_	SpaceAfter=No
4			PRON	n,代名詞,指示,*	_	5	det	_	SpaceAfter=No
5			NOUN	n,名詞,主体,書物	_	3	obj	_	SpaceAfter=No
6			PART	p,助詞,句末,*	_	3	discourse:sp	_	SpaceAfter=No
7			PUNCT	s,記号,句点,*	_	3	punct	_	_

Installation for Linux

Binary-wheel is available for Linux, and is installed by default when you use pip:

pip install udchinese

Installation for Cygwin

Make sure to get gcc-g++ python37-pip python37-devel packages, and then:

pip3.7 install udchinese

Use python3.7 command in Cygwin instead of python.

Installation for Jupyter Notebook (Google Colaboratory)

!pip install udchinese

Author

Koichi Yasuoka (安岡孝一)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

udchinese-0.6.1-py3-none-any.whl (28.4 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page