HanLP: Han Language Processing
Project description
HanLP: Han Language Processing
中文 | docs | 1.x | forum | docker
The multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x, for advancing state-of-the-art deep learning techniques in both academia and industry. HanLP was designed from day one to be efficient, user friendly and extendable. It comes with pretrained models for 104 human languages including English, Chinese and many others.
Thanks to open-access corpora like Universal Dependencies and OntoNotes, HanLP 2.1 now offers 10 joint tasks on 104 languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing, constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR) parsing.
For end users, HanLP offers light-weighted RESTful APIs and native Python APIs.
RESTful APIs
Tiny packages in several KBs for agile development and mobile applications. An auth key is required and a free one can be applied here under CC BY-NC-SA 4.0 license.
Python
pip install hanlp_restful
Create a client with our API endpoint and your auth.
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth='your_auth', language='mul')
Java
Insert the following dependency into your pom.xml
.
<dependency>
<groupId>com.hankcs.hanlp.restful</groupId>
<artifactId>hanlp-restful</artifactId>
<version>0.0.2</version>
</dependency>
Create a client with our API endpoint and your auth.
HanLPClient HanLP = new HanLPClient("https://hanlp.hankcs.com/api", "your_auth", "mul");
Quick Start
No matter which language you uses, the same interface can be used to parse a document.
HanLP.parse("In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environment. 2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。")
Native APIs
pip install hanlp
HanLP requires Python 3.6 or later. GPU/TPU is suggested but not mandatory.
Quick Start
import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH)
HanLP(['In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environment.',
'2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。',
'2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。'])
In particular, the Python HanLPClient
can also be used as a callable function following the same semantics. See docs for more details.
Train Your Own Models
To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 9 minutes.
tokenizer = TransformerTaggingTokenizer()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.61'
tokenizer.fit(
SIGHAN2005_PKU_TRAIN_ALL,
SIGHAN2005_PKU_TEST, # Conventionally, no devset is used. See Tian et al. (2020).
save_dir,
'bert-base-chinese',
max_seq_len=300,
char_level=True,
hard_constraint=True,
sampler_builder=SortingSamplerBuilder(batch_size=32),
epochs=3,
adam_epsilon=1e-6,
warmup_steps=0.1,
weight_decay=0.01,
word_dropout=0.1,
seed=1609422632,
)
tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)
The result is guaranteed to be 96.66
as the random feed is fixed. Different from some overclaining papers and projects, HanLP promises every digit in our scores are reproducible. Any issues on reproducibility will be treated and solved as a top-priority fatal bug.
Performance
lang | corpora | model | tok | pos | ner | dep | con | srl | sdp | lem | fea | amr | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fine | coarse | ctb | pku | 863 | ud | pku | msra | ontonotes | SemEval16 | DM | PAS | PSD | |||||||||
mul | UD2.7 OntoNotes5 | small | 98.30 | - | - | - | - | 91.72 | - | - | 74.86 | 74.66 | 74.29 | 65.73 | - | 88.52 | 92.56 | 83.84 | 84.65 | 81.13 | - |
base | 99.59 | - | - | - | - | 95.95 | - | - | 80.31 | 85.84 | 80.22 | 74.61 | - | 93.23 | 95.16 | 86.57 | 92.91 | 90.30 | - | ||
zh | open | small | 97.25 | - | 96.66 | - | - | - | - | - | 95.00 | 84.57 | 87.62 | 73.40 | 84.57 | - | - | - | - | - | - |
base | 97.50 | - | 97.07 | - | - | - | - | - | 96.04 | 87.11 | 89.84 | 77.78 | 87.11 | - | - | - | - | - | - | ||
close | small | 96.70 | 95.93 | 96.87 | 97.56 | 95.05 | - | 96.22 | 95.74 | 76.79 | 84.44 | 88.13 | 75.81 | 74.28 | - | - | - | - | - | - | |
base | 97.52 | 96.44 | 96.99 | 97.59 | 95.29 | - | 96.48 | 95.72 | 77.77 | 85.29 | 88.57 | 76.52 | 73.76 | - | - | - | - | - | - |
- Multilingual models are temporary ones which will be replaced in one week.
- AMR models will be released once our paper gets accepted.
Citing
If you use HanLP in your research, please cite this repository.
@software{hanlp2,
author = {Han He},
title = {{HanLP: Han Language Processing}},
year = {2020},
url = {https://github.com/hankcs/HanLP},
}
License
Codes
HanLP is licensed under Apache License 2.0. You can use HanLP in your commercial products for free. We would appreciate it if you add a link to HanLP on your website.
Models
Unless specified, all models in HanLP are licensed under CC BY-NC-SA 4.0 .
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.