Skip to main content

HanLP: Han Language Processing

Project description

HanLP: Han Language Processing

中文 | 日本語 | docs | 1.x | forum | Open In Colab

The multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x, for advancing state-of-the-art deep learning techniques in both academia and industry. HanLP was designed from day one to be efficient, user friendly and extendable.

Thanks to open-access corpora like Universal Dependencies and OntoNotes, HanLP 2.1 now offers 10 joint tasks on 104 languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing, constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR) parsing.

For end users, HanLP offers light-weighted RESTful APIs and native Python APIs.

RESTful APIs

Tiny packages in several KBs for agile development and mobile applications. Although anonymous users are welcomed, an auth key is suggested and a free one can be applied here under the CC BY-NC-SA 4.0 license.

Python

pip install hanlp_restful

Create a client with our API endpoint and your auth.

from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='mul')

Java

Insert the following dependency into your pom.xml.

<dependency>
  <groupId>com.hankcs.hanlp.restful</groupId>
  <artifactId>hanlp-restful</artifactId>
  <version>0.0.6</version>
</dependency>

Create a client with our API endpoint and your auth.

HanLPClient HanLP = new HanLPClient("https://hanlp.hankcs.com/api", null, "mul");

Quick Start

No matter which language you use, the same interface can be used to parse a document.

HanLP.parse("In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments. 2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。")

See docs for visualization, annotation guidelines and more details.

Native APIs

pip install hanlp

HanLP requires Python 3.6 or later. GPU/TPU is suggested but not mandatory.

Quick Start

import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE)
print(HanLP(['In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments.',
             '2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。',
             '2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。']))

In particular, the Python HanLPClient can also be used as a callable function following the same semantics. See docs for visualization, annotation guidelines and more details.

Train Your Own Models

To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 6 minutes.

tokenizer = TransformerTaggingTokenizer()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.70'
tokenizer.fit(
    SIGHAN2005_PKU_TRAIN_ALL,
    SIGHAN2005_PKU_TEST,  # Conventionally, no devset is used. See Tian et al. (2020).
    save_dir,
    'bert-base-chinese',
    max_seq_len=300,
    char_level=True,
    hard_constraint=True,
    sampler_builder=SortingSamplerBuilder(batch_size=32),
    epochs=3,
    adam_epsilon=1e-6,
    warmup_steps=0.1,
    weight_decay=0.01,
    word_dropout=0.1,
    seed=1609836303,
)
tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)

The result is guaranteed to be 96.70 as the random feed is fixed. Different from some overclaiming papers and projects, HanLP promises every single digit in our scores is reproducible. Any issues on reproducibility will be treated and solved as a top-priority fatal bug.

Performance

langcorporamodeltokposnerdepconsrlsdplemfeaamr
finecoarsectbpku863.00udpkumsraontonotesSemEval16DMPASPSD
mulUD2.7
OntoNotes5
small98.62----93.23--74.4279.1076.8570.63-91.1993.6785.3487.7184.51-
base--------
zhopensmall97.25-96.66-----95.0084.5787.6273.4084.57------
base97.50-97.07-----96.0487.1189.8477.7887.11------
closesmall96.7095.9396.8797.5695.05-96.2295.7476.7984.4488.1375.8174.28------
base97.5296.4496.9997.5995.29-96.4895.7277.7785.2988.5776.5273.76------
ernie96.9597.2996.7697.6495.22-97.3196.4777.9585.6789.1778.5174.10------
  • AMR models will be released once our paper gets accepted.

Citing

If you use HanLP in your research, please cite this repository.

@software{hanlp2,
  author = {Han He},
  title = {{HanLP: Han Language Processing}},
  year = {2020},
  url = {https://github.com/hankcs/HanLP},
}

License

Codes

HanLP is licensed under Apache License 2.0. You can use HanLP in your commercial products for free. We would appreciate it if you add a link to HanLP on your website.

Models

Unless otherwise specified, all models in HanLP are licensed under CC BY-NC-SA 4.0.

References

https://hanlp.hankcs.com/docs/references.html

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hanlp-2.1.0a58.tar.gz (411.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hanlp-2.1.0a58-py3-none-any.whl (546.4 kB view details)

Uploaded Python 3

File details

Details for the file hanlp-2.1.0a58.tar.gz.

File metadata

  • Download URL: hanlp-2.1.0a58.tar.gz
  • Upload date:
  • Size: 411.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for hanlp-2.1.0a58.tar.gz
Algorithm Hash digest
SHA256 e512fbdaf4667cab681db69312c42c259b14d5cac3190d3b07e3c8463aa90097
MD5 6f9ed369f376cf25f9c5c2b264be1f5a
BLAKE2b-256 9d3ad985f2a4ed5060eedbf0eed59fb68b0b82f205bd5c4bcdabf0f7943c7e86

See more details on using hashes here.

File details

Details for the file hanlp-2.1.0a58-py3-none-any.whl.

File metadata

  • Download URL: hanlp-2.1.0a58-py3-none-any.whl
  • Upload date:
  • Size: 546.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for hanlp-2.1.0a58-py3-none-any.whl
Algorithm Hash digest
SHA256 f1b3f2992c9a22b76a0345448ef6979ec79fa90016e3abb8bad51e9fd1802907
MD5 e145ebe173fa739b5d6f4ac521b192e1
BLAKE2b-256 e5bace915d32f0a0c081ecfbffe9d643b959eb19ccd1e06e5edd1f596143a20c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page