Skip to main content

一个古汉语分词工具

Project description

(AnChinSeg)古汉语分词及词性标注工具 Word Segmentation and Part-of-speech for Ancient Chinese

基于2022年的分词文章,做了古汉语的分词和词性标注 这是一个非常粗糙朴素的分词和标注词性的工具 词性效果评估如下: P: 92.82 R: 92.85 F: 92.84 分词效果评估如下: P: 97.19 R: 97.22 F: 97.20

Citation

词性标注并没有发表论文,但是如果您使用了我们的工具进行了学术研究,可以引用以下论文,我们是在该论文的基础上实现的

@inproceedings{tang-su-2022-slepen,
    title = "That Slepen Al the Nyght with Open Ye! Cross-era Sequence Segmentation with Switch-memory",
    author = "Tang, Xuemei  and
      Su, Qi",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.540",
    doi = "10.18653/v1/2022.acl-long.540",
    pages = "7830--7840",
}

Requirements

环境配置请查看 requriments.txt

How to use it

1)请从百度网盘或者google drive下载模型model.dt放到model文件夹中 baidu链接: https://pan.baidu.com/s/1jIbqk5b4GYBEMAdBPVJwYg 提取码: dac4 google drive: https://drive.google.com/drive/folders/1zFK30h6PQYRDDZ2uEScLy0l5VoC7jXHU?usp=sharing

在文件夹下执行: #python segmenter.py --predict_data ./data/sample_data.txt --output_path ./data/output.txt (./data/sample_data.txt替换为你的需要分词的文件的路径,一行为一个句子,./data/output.txt替换为分词结果的存储位置)

最后分好词的格式如下: 端明殿_NA 学士_NA 兼_VT 翰林侍读_NA 学士_NA 朝散大夫_NA 右谏议大夫_NA 充_VT 集贤院_NA

4)词性标记参考台湾中央研究院 https://lingcorpus.iis.sinica.edu.tw/kiwi/dkiwi/middle_chinese_c_wordtype.html https://lingcorpus.iis.sinica.edu.tw/kiwi/akiwi/ancient_mandarin_chinese_c_wordtype.html

Contact

Please contact us at tangxuemei@polyu.edu.hk if you have any questions. Welcome to Research Center for Digital Humanities of Peking University! https://pkudh.org

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

classicalsplit-1.3.tar.gz (209.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

classicalsplit-1.3-py3-none-any.whl (238.0 kB view details)

Uploaded Python 3

File details

Details for the file classicalsplit-1.3.tar.gz.

File metadata

  • Download URL: classicalsplit-1.3.tar.gz
  • Upload date:
  • Size: 209.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for classicalsplit-1.3.tar.gz
Algorithm Hash digest
SHA256 89336c5550912d76954f36045ccd841b5d67c741a987db27563d1b671f8cafd4
MD5 eaec5e2227e3b90efa0934907373f6f1
BLAKE2b-256 0d209ecbb0ecf28f159e9e1ae0dcac1e6e2b7ed81ab9f84a3ba31eee85ffb533

See more details on using hashes here.

File details

Details for the file classicalsplit-1.3-py3-none-any.whl.

File metadata

  • Download URL: classicalsplit-1.3-py3-none-any.whl
  • Upload date:
  • Size: 238.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for classicalsplit-1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b3b186661758a41cd5861fab483b1478f1a852ff3adb1df5f789139ae6f6ef9f
MD5 734c6a20038979e91bbacbd893ce924b
BLAKE2b-256 4b36a9497f763007ab2ee60ca10d2dfbd89bf44c9d246ecaf3a39fbf3d292c4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page