Skip to main content

Marathi Coreference Resolution using Hypergraphs

Project description

🧠 Marathi Coreference Resolution using Hypergraphs

This project performs coreference resolution in Marathi using a hypergraph-based approach.

It works in the following steps:

  1. Mention Detection: All possible noun/pronoun mentions are extracted from each sentence.
  2. 🧬 Gender Detection with Suffix Rules : Gender is predicted using common Marathi suffix patterns (like -ई, -का, -श) and Stanza-based linguistic analysis for accurate classification of unknown names.
  3. Similarity Scoring: Each mention pair is assigned a similarity score based on:
    • Gender match
    • Lexical overlap
    • Exact word match
    • Pronoun boosting
  4. Hyperedge Construction: All related mentions with high similarity scores are connected via hyperedges.
  5. Clustering: Pairs with relation and high scores (not just the maximum) are grouped together to form coreference clusters.

This method allows resolving pronouns like "तो", "तिने", or "त्यांनी" back to the correct noun (e.g., "राम", "सारिका", "मित्रांनी") using both linguistic signals and graph-based relationships.

Example : Step 1: Gender Detection

  • सारिकाने → female
  • तिने → female

Step 2: Hyperedge Creation

  • {'सारिकाने', 'तिने'} → Score: 2.0 (gender + pronoun boost)
  • {'तिने', 'बनवले'} → Score: 0.6
  • ...

Step 3: Clustering

  • Top cluster pair: तिने ↔ सारिकाने (Score: 2.0)
  • This is used to resolve that "तिने" refers to "सारिकाने"

Final output : सारिकाने जेवण बनवले, तिने चांगले जेवण बनवले.

📢 Data Source & Acknowledgements

We explicitly acknowledge and thank the L3Cube-Pune team for providing the underlying raw text used in this annotation project.

The raw news articles were sourced from their open-source repository, which acts as a foundational resource for Marathi NLP tasks. Our work builds upon this by adding the layer of semantic coreference annotations.


📊 Dataset Statistics

The following statistics describe the scale and density of the annotated corpus:

Metric Count
Total Processed Documents 490
Total Sampled Sentences 9,994
Unique Tokens (Vocabulary) 5,053
Annotated Coreference Pairs 12,963
Average Sentence Length 29.83 words

🧪 Data Structure & Format

The dataset is provided in JSON (JavaScript Object Notation) format, optimized for Hypergraph-based approaches.

JSON Schema Fields

Each file in the dataset follows this structure:

  • document_id: Unique identifier for the document.
  • sentences: A list containing the raw text of the sentences.
  • mentions: A list of all identified entities (Nouns/Pronouns) with the following metadata:
    • id: Unique mention ID.
    • text: The surface word (e.g., "पंतप्रधान").
    • sentence_index: Index of the sentence containing the mention.
    • start_char, end_char: Character-level spans of the mention.
  • clusters: A list of coreference chains. Each chain is a list of mention_ids that refer to the same underlying entity.

📂 DataSet Contents

The repository includes:

  • processed_documents/ — Raw Marathi text documents (Sourced from L3Cube-MahaCorpus).
  • annotated_documents/ — Gold-standard coreference annotations in JSON+CoNLL format.
  • schema.md — Annotation guidelines and tag definitions.

🎯 Annotation Guidelines

Each document is manually annotated for:

  • Named Entities
  • Pronouns (explicit + pro-drop)
  • Nominal mentions
  • Hyperedges / clusters representing entity chains

Annotations follow:

  • Gender agreement rules
  • Number consistency
  • Semantic context checks
  • Cross-sentence reference tracking

A full description of the annotation scheme is provided in schema.md.


🔍 Use Cases

This dataset is suitable for:

  • Coreference resolution model training/testing
  • Hypergraph-based NLP research
  • Benchmarking for low-resource Indian languages
  • Linguistic analysis
  • Fine-tuning transformer models (e.g., IndicBERT, MahaBERT)

📜 License

This dataset is released under the CC BY-NC 4.0 License (Non-commercial research usage permitted.)


🤝 Contributions

If you wish to add more annotations or help expand this corpus, feel free to open an issue or submit a pull request.


📧 Contact

For questions, collaboration, or academic use cases:

Tanishq Shinde Department of Computer Engineering Pune Institute of Computer Technology


⭐ Citation

If you use this dataset in academic work, please cite:

Shinde, T., Jangle, M., Bagwan, M. "Coreference Resolution for Marathi Text Using Hypergraph Method" PICT, 2025.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marathi_coref-0.1.1.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

marathi_coref-0.1.1-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file marathi_coref-0.1.1.tar.gz.

File metadata

  • Download URL: marathi_coref-0.1.1.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marathi_coref-0.1.1.tar.gz
Algorithm Hash digest
SHA256 67ae7945850b44698b4dea6db4740b8efa4d3dafcedd3d25e8d237898a87e1ef
MD5 e1d298e53e9a79320cce0a7b06cd070c
BLAKE2b-256 4a4aaabfd918014182b75255cf98a53ca902a882fb7213b99a773c9f9dcbd1f2

See more details on using hashes here.

File details

Details for the file marathi_coref-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: marathi_coref-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marathi_coref-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 163b7419f0ef08d09c1385908af195b10762cc6e80ffe0703cb2f26208b35208
MD5 c8f4a9bdea8f33d36fc0f6707160cf5b
BLAKE2b-256 e9e5045a03b429899cd3894197bcff166325bd7f72bc22b1fbe722a64fa22ca8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page