Skip to main content

一个古汉语分词工具

Project description

(AnChinSeg)古汉语分词及词性标注工具 Word Segmentation and Part-of-speech for Ancient Chinese

基于2022年的分词文章,做了古汉语的分词和词性标注 这是一个非常粗糙朴素的分词和标注词性的工具 词性效果评估如下: P: 92.82 R: 92.85 F: 92.84 分词效果评估如下: P: 97.19 R: 97.22 F: 97.20

Citation

词性标注并没有发表论文,但是如果您使用了我们的工具进行了学术研究,可以引用以下论文,我们是在该论文的基础上实现的

@inproceedings{tang-su-2022-slepen,
    title = "That Slepen Al the Nyght with Open Ye! Cross-era Sequence Segmentation with Switch-memory",
    author = "Tang, Xuemei  and
      Su, Qi",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.540",
    doi = "10.18653/v1/2022.acl-long.540",
    pages = "7830--7840",
}

Requirements

环境配置请查看 requriments.txt

How to use it

1)请从百度网盘或者google drive下载模型model.dt放到model文件夹中 baidu链接: https://pan.baidu.com/s/1jIbqk5b4GYBEMAdBPVJwYg 提取码: dac4 google drive: https://drive.google.com/drive/folders/1zFK30h6PQYRDDZ2uEScLy0l5VoC7jXHU?usp=sharing

在文件夹下执行: #python segmenter.py --predict_data ./data/sample_data.txt --output_path ./data/output.txt (./data/sample_data.txt替换为你的需要分词的文件的路径,一行为一个句子,./data/output.txt替换为分词结果的存储位置)

最后分好词的格式如下: 端明殿_NA 学士_NA 兼_VT 翰林侍读_NA 学士_NA 朝散大夫_NA 右谏议大夫_NA 充_VT 集贤院_NA

4)词性标记参考台湾中央研究院 https://lingcorpus.iis.sinica.edu.tw/kiwi/dkiwi/middle_chinese_c_wordtype.html https://lingcorpus.iis.sinica.edu.tw/kiwi/akiwi/ancient_mandarin_chinese_c_wordtype.html

Contact

Please contact us at tangxuemei@polyu.edu.hk if you have any questions. Welcome to Research Center for Digital Humanities of Peking University! https://pkudh.org

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

classicalsplit-1.4.tar.gz (209.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

classicalsplit-1.4-py3-none-any.whl (238.0 kB view details)

Uploaded Python 3

File details

Details for the file classicalsplit-1.4.tar.gz.

File metadata

  • Download URL: classicalsplit-1.4.tar.gz
  • Upload date:
  • Size: 209.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for classicalsplit-1.4.tar.gz
Algorithm Hash digest
SHA256 989a3f5d49a2f2cd06d5e50341a9fe4ccc149566b49abdd4ac88f132aed8efe1
MD5 7279bedaa8fad915e0208521ede9305c
BLAKE2b-256 bd43c1e9f1552e9d94d1fc95d8f69de64a04986e09f37b468b9923c66af463d0

See more details on using hashes here.

File details

Details for the file classicalsplit-1.4-py3-none-any.whl.

File metadata

  • Download URL: classicalsplit-1.4-py3-none-any.whl
  • Upload date:
  • Size: 238.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for classicalsplit-1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 3715a863d0116a3c0ac47ec5778df60d074533beda21c64a886751f9019553df
MD5 de566e2b6586259005ed62b461567c4b
BLAKE2b-256 58c072f70f2820fa6580fec86b3bc1078fbff2d8b3e7651bb82316b9f798f12e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page