Marathi Coreference Resolution using Hypergraphs
Project description
🧠 Marathi Coreference Resolution using Hypergraphs
This project performs coreference resolution in Marathi using a hypergraph-based approach.
It works in the following steps:
- Mention Detection: All possible noun/pronoun mentions are extracted from each sentence.
- 🧬 Gender Detection with Suffix Rules : Gender is predicted using common Marathi suffix patterns (like
-ई,-का,-श) and Stanza-based linguistic analysis for accurate classification of unknown names. - Similarity Scoring: Each mention pair is assigned a similarity score based on:
- Gender match
- Lexical overlap
- Exact word match
- Pronoun boosting
- Hyperedge Construction: All related mentions with high similarity scores are connected via hyperedges.
- Clustering: Pairs with relation and high scores (not just the maximum) are grouped together to form coreference clusters.
This method allows resolving pronouns like "तो", "तिने", or "त्यांनी" back to the correct noun (e.g., "राम", "सारिका", "मित्रांनी") using both linguistic signals and graph-based relationships.
Example : Step 1: Gender Detection
सारिकाने→ femaleतिने→ female
Step 2: Hyperedge Creation
{'सारिकाने', 'तिने'}→ Score: 2.0 (gender + pronoun boost){'तिने', 'बनवले'}→ Score: 0.6- ...
Step 3: Clustering
- Top cluster pair:
तिने ↔ सारिकाने(Score: 2.0) - This is used to resolve that "तिने" refers to "सारिकाने"
Final output : सारिकाने जेवण बनवले, तिने चांगले जेवण बनवले.
📢 Data Source & Acknowledgements
We explicitly acknowledge and thank the L3Cube-Pune team for providing the underlying raw text used in this annotation project.
- Source Corpus: L3Cube-MahaCorpus (news)
- Repository: L3Cube-Pune MarathiNLP
The raw news articles were sourced from their open-source repository, which acts as a foundational resource for Marathi NLP tasks. Our work builds upon this by adding the layer of semantic coreference annotations.
📊 Dataset Statistics
The following statistics describe the scale and density of the annotated corpus:
| Metric | Count |
|---|---|
| Total Processed Documents | 490 |
| Total Sampled Sentences | 9,994 |
| Unique Tokens (Vocabulary) | 5,053 |
| Annotated Coreference Pairs | 12,963 |
| Average Sentence Length | 29.83 words |
🧪 Data Structure & Format
The dataset is provided in JSON (JavaScript Object Notation) format, optimized for Hypergraph-based approaches.
JSON Schema Fields
Each file in the dataset follows this structure:
document_id: Unique identifier for the document.sentences: A list containing the raw text of the sentences.mentions: A list of all identified entities (Nouns/Pronouns) with the following metadata:id: Unique mention ID.text: The surface word (e.g., "पंतप्रधान").sentence_index: Index of the sentence containing the mention.start_char,end_char: Character-level spans of the mention.
clusters: A list of coreference chains. Each chain is a list ofmention_idsthat refer to the same underlying entity.
📂 DataSet Contents
The repository includes:
processed_documents/— Raw Marathi text documents (Sourced from L3Cube-MahaCorpus).annotated_documents/— Gold-standard coreference annotations in JSON+CoNLL format.schema.md— Annotation guidelines and tag definitions.
🎯 Annotation Guidelines
Each document is manually annotated for:
- Named Entities
- Pronouns (explicit + pro-drop)
- Nominal mentions
- Hyperedges / clusters representing entity chains
Annotations follow:
- Gender agreement rules
- Number consistency
- Semantic context checks
- Cross-sentence reference tracking
A full description of the annotation scheme is provided in schema.md.
🔍 Use Cases
This dataset is suitable for:
- Coreference resolution model training/testing
- Hypergraph-based NLP research
- Benchmarking for low-resource Indian languages
- Linguistic analysis
- Fine-tuning transformer models (e.g., IndicBERT, MahaBERT)
📜 License
This dataset is released under the CC BY-NC 4.0 License (Non-commercial research usage permitted.)
🤝 Contributions
If you wish to add more annotations or help expand this corpus, feel free to open an issue or submit a pull request.
📧 Contact
For questions, collaboration, or academic use cases:
Mansi Jangle Department of Computer Engineering Pune Institute of Computer Technology
⭐ Citation
If you use this dataset in academic work, please cite:
Shinde, T., Jangle, M., Bagwan, M. "Coreference Resolution for Marathi Text Using Hypergraph Method" PICT, 2025.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file marathi_coref-0.1.2.tar.gz.
File metadata
- Download URL: marathi_coref-0.1.2.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
201f2eb420b065c43b2f601fde26839208c294b7521d21111399bee10124728f
|
|
| MD5 |
6d4b609dd17b4648cad29b72df32f49e
|
|
| BLAKE2b-256 |
c90f1076cf8fd3a2e90b09af8778e581e047024f1bf59aa09689e3f747ddca40
|
File details
Details for the file marathi_coref-0.1.2-py3-none-any.whl.
File metadata
- Download URL: marathi_coref-0.1.2-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3961f414b96629335133e1994c4309b6cf2dec2ab3339ff663941eef45ce437
|
|
| MD5 |
26108f43a0037df5690498f6a5d540e1
|
|
| BLAKE2b-256 |
8688fdd5045b7e9e2b0e5ef8c81624fdeb887c169e6f298efb2a03cd74266e8b
|