Skip to main content

Marathi Coreference Resolution using Hypergraphs

Project description

🧠 Marathi Coreference Resolution using Hypergraphs

This project performs coreference resolution in Marathi using a hypergraph-based approach.

It works in the following steps:

  1. Mention Detection: All possible noun/pronoun mentions are extracted from each sentence.
  2. 🧬 Gender Detection with Suffix Rules : Gender is predicted using common Marathi suffix patterns (like -ई, -का, -श) and Stanza-based linguistic analysis for accurate classification of unknown names.
  3. Similarity Scoring: Each mention pair is assigned a similarity score based on:
    • Gender match
    • Lexical overlap
    • Exact word match
    • Pronoun boosting
  4. Hyperedge Construction: All related mentions with high similarity scores are connected via hyperedges.
  5. Clustering: Pairs with relation and high scores (not just the maximum) are grouped together to form coreference clusters.

This method allows resolving pronouns like "तो", "तिने", or "त्यांनी" back to the correct noun (e.g., "राम", "सारिका", "मित्रांनी") using both linguistic signals and graph-based relationships.

Example : Step 1: Gender Detection

  • सारिकाने → female
  • तिने → female

Step 2: Hyperedge Creation

  • {'सारिकाने', 'तिने'} → Score: 2.0 (gender + pronoun boost)
  • {'तिने', 'बनवले'} → Score: 0.6
  • ...

Step 3: Clustering

  • Top cluster pair: तिने ↔ सारिकाने (Score: 2.0)
  • This is used to resolve that "तिने" refers to "सारिकाने"

Final output : सारिकाने जेवण बनवले, तिने चांगले जेवण बनवले.

📢 Data Source & Acknowledgements

We explicitly acknowledge and thank the L3Cube-Pune team for providing the underlying raw text used in this annotation project.

The raw news articles were sourced from their open-source repository, which acts as a foundational resource for Marathi NLP tasks. Our work builds upon this by adding the layer of semantic coreference annotations.


📊 Dataset Statistics

The following statistics describe the scale and density of the annotated corpus:

Metric Count
Total Processed Documents 490
Total Sampled Sentences 9,994
Unique Tokens (Vocabulary) 5,053
Annotated Coreference Pairs 12,963
Average Sentence Length 29.83 words

🧪 Data Structure & Format

The dataset is provided in JSON (JavaScript Object Notation) format, optimized for Hypergraph-based approaches.

JSON Schema Fields

Each file in the dataset follows this structure:

  • document_id: Unique identifier for the document.
  • sentences: A list containing the raw text of the sentences.
  • mentions: A list of all identified entities (Nouns/Pronouns) with the following metadata:
    • id: Unique mention ID.
    • text: The surface word (e.g., "पंतप्रधान").
    • sentence_index: Index of the sentence containing the mention.
    • start_char, end_char: Character-level spans of the mention.
  • clusters: A list of coreference chains. Each chain is a list of mention_ids that refer to the same underlying entity.

📂 DataSet Contents

The repository includes:

  • processed_documents/ — Raw Marathi text documents (Sourced from L3Cube-MahaCorpus).
  • annotated_documents/ — Gold-standard coreference annotations in JSON+CoNLL format.
  • schema.md — Annotation guidelines and tag definitions.

🎯 Annotation Guidelines

Each document is manually annotated for:

  • Named Entities
  • Pronouns (explicit + pro-drop)
  • Nominal mentions
  • Hyperedges / clusters representing entity chains

Annotations follow:

  • Gender agreement rules
  • Number consistency
  • Semantic context checks
  • Cross-sentence reference tracking

A full description of the annotation scheme is provided in schema.md.


🔍 Use Cases

This dataset is suitable for:

  • Coreference resolution model training/testing
  • Hypergraph-based NLP research
  • Benchmarking for low-resource Indian languages
  • Linguistic analysis
  • Fine-tuning transformer models (e.g., IndicBERT, MahaBERT)

📜 License

This dataset is released under the CC BY-NC 4.0 License (Non-commercial research usage permitted.)


🤝 Contributions

If you wish to add more annotations or help expand this corpus, feel free to open an issue or submit a pull request.


📧 Contact

For questions, collaboration, or academic use cases:

Mansi Jangle Department of Computer Engineering Pune Institute of Computer Technology


⭐ Citation

If you use this dataset in academic work, please cite:

Shinde, T., Jangle, M., Bagwan, M. "Coreference Resolution for Marathi Text Using Hypergraph Method" PICT, 2025.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marathi_coref-0.1.2.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

marathi_coref-0.1.2-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file marathi_coref-0.1.2.tar.gz.

File metadata

  • Download URL: marathi_coref-0.1.2.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marathi_coref-0.1.2.tar.gz
Algorithm Hash digest
SHA256 201f2eb420b065c43b2f601fde26839208c294b7521d21111399bee10124728f
MD5 6d4b609dd17b4648cad29b72df32f49e
BLAKE2b-256 c90f1076cf8fd3a2e90b09af8778e581e047024f1bf59aa09689e3f747ddca40

See more details on using hashes here.

File details

Details for the file marathi_coref-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: marathi_coref-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marathi_coref-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f3961f414b96629335133e1994c4309b6cf2dec2ab3339ff663941eef45ce437
MD5 26108f43a0037df5690498f6a5d540e1
BLAKE2b-256 8688fdd5045b7e9e2b0e5ef8c81624fdeb887c169e6f298efb2a03cd74266e8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page