Skip to main content

一个古汉语分词工具

Project description

(AnChinSeg)古汉语分词及词性标注工具 Word Segmentation and Part-of-speech for Ancient Chinese

基于2022年的分词文章,做了古汉语的分词和词性标注 这是一个非常粗糙朴素的分词和标注词性的工具 词性效果评估如下: P: 92.82 R: 92.85 F: 92.84 分词效果评估如下: P: 97.19 R: 97.22 F: 97.20

Citation

词性标注并没有发表论文,但是如果您使用了我们的工具进行了学术研究,可以引用以下论文,我们是在该论文的基础上实现的

@inproceedings{tang-su-2022-slepen,
    title = "That Slepen Al the Nyght with Open Ye! Cross-era Sequence Segmentation with Switch-memory",
    author = "Tang, Xuemei  and
      Su, Qi",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.540",
    doi = "10.18653/v1/2022.acl-long.540",
    pages = "7830--7840",
}

Requirements

环境配置请查看 requriments.txt

How to use it

1)请从百度网盘或者google drive下载模型model.dt放到model文件夹中 baidu链接: https://pan.baidu.com/s/1jIbqk5b4GYBEMAdBPVJwYg 提取码: dac4 google drive: https://drive.google.com/drive/folders/1zFK30h6PQYRDDZ2uEScLy0l5VoC7jXHU?usp=sharing

在文件夹下执行: #python segmenter.py --predict_data ./data/sample_data.txt --output_path ./data/output.txt (./data/sample_data.txt替换为你的需要分词的文件的路径,一行为一个句子,./data/output.txt替换为分词结果的存储位置)

最后分好词的格式如下: 端明殿_NA 学士_NA 兼_VT 翰林侍读_NA 学士_NA 朝散大夫_NA 右谏议大夫_NA 充_VT 集贤院_NA

4)词性标记参考台湾中央研究院 https://lingcorpus.iis.sinica.edu.tw/kiwi/dkiwi/middle_chinese_c_wordtype.html https://lingcorpus.iis.sinica.edu.tw/kiwi/akiwi/ancient_mandarin_chinese_c_wordtype.html

Contact

Please contact us at tangxuemei@polyu.edu.hk if you have any questions. Welcome to Research Center for Digital Humanities of Peking University! https://pkudh.org

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anchinsegmenter-0.9-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file anchinsegmenter-0.9-py3-none-any.whl.

File metadata

  • Download URL: anchinsegmenter-0.9-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for anchinsegmenter-0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 455e5c92cddbcdffa56589a526858036a365b9f05c74a2bfc251ad2fdfe18949
MD5 10c579e63bcbe688bf98422cea47f76a
BLAKE2b-256 be30f69e4616eda4195a865d5132a7f7d84350689092fbefd43beeac53e5b658

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page