Skip to main content

Tokenizer POS-tagger and Dependency-parser for Chinese (简体/繁體/文言文)

Project description

Current PyPI packages

UD-Chinese

Tokenizer, POS-Tagger, and Dependency-Parser for Chinese (简体/繁體/文言文), working on Universal Dependencies.

Basic usage

>>> import udchinese
>>> zh=udchinese.load()
>>> s=zh("我把这本书看完了。吾既讀是書也。")
>>> print(s)
# newdoc
# newpar
# sent_id = 1
# text = 我把这本书看完了。
1			PRON	n,代名詞,人称,止格	Person=1|PronType=Prs	6	nsubj	_	SpaceAfter=No
2			ADP	BB	_	5	case	_	SpaceAfter=No
3			DET	DT	_	4	det	_	SpaceAfter=No
4			NOUN	n,名詞,描写,形質	_	5	clf	_	SpaceAfter=No
5			NOUN	n,名詞,主体,書物	_	6	obl:patient	_	SpaceAfter=No
6			VERB	v,動詞,行為,動作	_	0	root	_	SpaceAfter=No
7			VERB	v,動詞,変化,性質	_	6	flat:vv	_	SpaceAfter=No
8			PART	UH	_	6	discourse	_	SpaceAfter=No
9			PUNCT	s,記号,句点,*	_	6	punct	_	SpacesAfter=\n

# sent_id = 2
# text = 吾既讀是書也。
1			PRON	n,代名詞,人称,起格	Person=1|PronType=Prs	3	nsubj	_	SpaceAfter=No
2			ADV	v,副詞,時相,完了	AdvType=Tim|Aspect=Perf	3	advmod	_	SpaceAfter=No
3			VERB	v,動詞,行為,動作	_	0	root	_	SpaceAfter=No
4			PRON	n,代名詞,指示,*	PronType=Dem	5	det	_	SpaceAfter=No
5			NOUN	n,名詞,主体,書物	_	3	obj	_	SpaceAfter=No
6			PART	p,助詞,句末,*	_	3	discourse:sp	_	SpaceAfter=No
7			PUNCT	s,記号,句点,*	_	3	punct	_	SpacesAfter=\n

Usage via spaCy

If you have already installed spaCy 2.1.0 or later, you can use UD-Chinese via spaCy Language pipeline.

>>> import udchinese.spacy
>>> zh=udchinese.spacy.load()
>>> d=zh("我把这本书看完了。吾既讀是書也。")
>>> print(type(d))
<class 'spacy.tokens.doc.Doc'>
>>> print(udchinese.spacy.to_conllu(d))
# text = 我把这本书看完了。
1			PRON	n,代名詞,人称,止格	_	6	nsubj	_	SpaceAfter=No
2			ADP	BB	_	5	case	_	SpaceAfter=No
3			DET	DT	_	4	det	_	SpaceAfter=No
4			NOUN	n,名詞,描写,形質	_	5	clf	_	SpaceAfter=No
5			NOUN	n,名詞,主体,書物	_	6	obl:patient	_	SpaceAfter=No
6			VERB	v,動詞,行為,動作	_	0	root	_	SpaceAfter=No
7			VERB	v,動詞,変化,性質	_	6	flat:vv	_	SpaceAfter=No
8			PART	UH	_	6	discourse	_	SpaceAfter=No
9			PUNCT	s,記号,句点,*	_	6	punct	_	_

# text = 吾既讀是書也。
1			PRON	n,代名詞,人称,起格	_	3	nsubj	_	SpaceAfter=No
2			ADV	v,副詞,時相,完了	_	3	advmod	_	SpaceAfter=No
3			VERB	v,動詞,行為,動作	_	0	root	_	SpaceAfter=No
4			PRON	n,代名詞,指示,*	_	5	det	_	SpaceAfter=No
5			NOUN	n,名詞,主体,書物	_	3	obj	_	SpaceAfter=No
6			PART	p,助詞,句末,*	_	3	discourse:sp	_	SpaceAfter=No
7			PUNCT	s,記号,句点,*	_	3	punct	_	_

Installation for Linux

Binary-wheel is available for Linux, and is installed by default when you use pip:

pip install udchinese

Installation for Cygwin

Make sure to get gcc-g++ python37-pip python37-devel packages, and then:

pip3.7 install udchinese

Use python3.7 command in Cygwin instead of python.

Installation for Jupyter Notebook (Google Colaboratory)

!pip install udchinese

Author

Koichi Yasuoka (安岡孝一)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

udchinese-0.6.1-py3-none-any.whl (28.4 MB view details)

Uploaded Python 3

File details

Details for the file udchinese-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: udchinese-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 28.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.2

File hashes

Hashes for udchinese-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ca80aefe3e8e7e8734438d19cf6ff0edd809657271d38da8d499afed7f5a7c18
MD5 9f9eec6ec38dfbe9694c2129374f95f5
BLAKE2b-256 96838ae6c250e668fafd2a542e7dba99c4d1f84d2961f531ba6b01a3a2eb99a0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page