Skip to main content

A preprocessing and evaluation tools for Japanese cohesion analysis

Project description

Cohesion Tools

PyPI PyPI - Python Version test lint Ruff uv CodeFactor Grade license

Requirements

Installation

pip install cohesion-tools  # or cohesion-tools[eval]

Usage

Evaluating Predicted Documents

from pathlib import Path
from typing import List

from rhoknp import Document
from rhoknp.cohesion import ExophoraReferentType
from cohesion_tools.evaluators import CohesionEvaluator, CohesionScore

documents: List[Document] = [Document.from_knp(path.read_text()) for path in Path("your/dataset").glob("*.knp")]
predicted_documents = your_model(documents)

scorer = CohesionEvaluator(
    exophora_referent_types=[ExophoraReferentType(t) for t in ("著者", "読者", "不特定:人", "不特定:物")],
    pas_cases=["ガ", "ヲ", "ニ"],
)

score: CohesionScore = scorer.run(predicted_documents=predicted_documents, gold_documents=documents)
score.to_dict()  # Convert the evaluation result to a dictionary
score.export_csv("score.csv")  # Export the evaluation result to `score.csv`
score.export_txt("score.txt")  # Export the evaluation result to `score.txt`

Extracting Labels From Base Phrases

from pathlib import Path
from typing import Dict, List

from rhoknp import Document
from rhoknp.cohesion import ExophoraReferentType, Argument
from cohesion_tools.extractors import PasExtractor

pas_extractor = PasExtractor(
    cases=["ガ", "ヲ", "ニ"],
    exophora_referent_types=[ExophoraReferentType(t) for t in ("著者", "読者", "不特定:人", "不特定:物")],
)

examples = []
documents: List[Document] = [Document.from_knp(path.read_text()) for path in Path("your/dataset").glob("*.knp")]
for document in documents:
    for base_phrase in document.base_phrases:
        if pas_extractor.is_target(base_phrase) is True:
            rels: Dict[str, List[Argument]] = pas_extractor.extract_rels(base_phrase)
            examples.append(rels)

your_trainer.train(your_model, examples)

Reference

@inproceedings{ueda-etal-2020-bert,
  title     = {{BERT}-based Cohesion Analysis of {J}apanese Texts},
  author    = {Ueda, Nobuhiro  and
               Kawahara, Daisuke  and
               Kurohashi, Sadao},
  booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
  month     = dec,
  year      = {2020},
  address   = {Barcelona, Spain (Online)},
  publisher = {International Committee on Computational Linguistics},
  url       = {https://aclanthology.org/2020.coling-main.114},
  doi       = {10.18653/v1/2020.coling-main.114},
  pages     = {1323--1333},
  abstract  = {The meaning of natural language text is supported by cohesion among various kinds of entities, including coreference relations, predicate-argument structures, and bridging anaphora relations. However, predicate-argument structures for nominal predicates and bridging anaphora relations have not been studied well, and their analyses have been still very difficult. Recent advances in neural networks, in particular, self training-based language models including BERT (Devlin et al., 2019), have significantly improved many natural language processing tasks, making it possible to dive into the study on analysis of cohesion in the whole text. In this study, we tackle an integrated analysis of cohesion in Japanese texts. Our results significantly outperformed existing studies in each task, especially about 10 to 20 point improvement both for zero anaphora and coreference resolution. Furthermore, we also showed that coreference resolution is different in nature from the other tasks and should be treated specially.}
}
@inproceedings{ueda-etal-2023-kwja,
  title     = {{KWJA}: A Unified {J}apanese Analyzer Based on Foundation Models},
  author    = {Ueda, Nobuhiro  and
               Omura, Kazumasa  and
               Kodama, Takashi  and
               Kiyomaru, Hirokazu  and
               Murawaki, Yugo  and
               Kawahara, Daisuke  and
               Kurohashi, Sadao},
  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
  month     = jul,
  year      = {2023},
  address   = {Toronto, Canada},
  publisher = {Association for Computational Linguistics},
  url       = {https://aclanthology.org/2023.acl-demo.52},
  pages     = {538--548},
  abstract  = {We present KWJA, a high-performance unified Japanese text analyzer based on foundation models.KWJA supports a wide range of tasks, including typo correction, word segmentation, word normalization, morphological analysis, named entity recognition, linguistic feature tagging, dependency parsing, PAS analysis, bridging reference resolution, coreference resolution, and discourse relation analysis, making it the most versatile among existing Japanese text analyzers.KWJA solves these tasks in a multi-task manner but still achieves competitive or better performance compared to existing analyzers specialized for each task.KWJA is publicly available under the MIT license at https://github.com/ku-nlp/kwja.}
}

License

This software is released under the MIT License, see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cohesion_tools-0.7.6.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cohesion_tools-0.7.6-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file cohesion_tools-0.7.6.tar.gz.

File metadata

  • Download URL: cohesion_tools-0.7.6.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cohesion_tools-0.7.6.tar.gz
Algorithm Hash digest
SHA256 8c26dc34908e81456f9d516d0ceced27ed2c4e135572f1a4299d8e30d5107af7
MD5 636cd0e0a5330485dfd67f2ae59cc0a8
BLAKE2b-256 70c013310615a2d1aca5138839fb87a4163d515529e263c63d682f898df158ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for cohesion_tools-0.7.6.tar.gz:

Publisher: publish.yml on nobu-g/cohesion-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cohesion_tools-0.7.6-py3-none-any.whl.

File metadata

  • Download URL: cohesion_tools-0.7.6-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cohesion_tools-0.7.6-py3-none-any.whl
Algorithm Hash digest
SHA256 706c00f0320d1cd0a8c3beadf5d8ef88365aa6c0a7d47d16eb08a3bc859d91ad
MD5 fc8349d48e93694a19c40aebd63c4a5a
BLAKE2b-256 1bd9511958cab39ec4567e2427090372e5eafdb2d1f4261e8bd3e5613a6af518

See more details on using hashes here.

Provenance

The following attestation bundles were made for cohesion_tools-0.7.6-py3-none-any.whl:

Publisher: publish.yml on nobu-g/cohesion-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page