Skip to main content

ParsiPy: NLP Toolkit for Historical Persian Texts in Python

Project description

ParsiPy: NLP Toolkit for Historical Persian Texts in Python


PyPI version built with Python3 GitHub repo size

Overview

ParsiPy is an NLP toolkit designed for analyzing historical Persian texts, including languages like Parsig (Pahlavi). It provides essential modules such as lemmatization, POS tagging, tokenization, and phoneme-to-grapheme conversion, making it a valuable resource for researchers working with low-resource languages. Beyond its practical applications, ParsiPy serves as a model for developing NLP tools tailored to linguistically rich yet underrepresented languages.

PyPI Counter
Github Stars
Branch main dev
CI

Installation

PyPI

Source code

Usage

To use ParsiPy's modules for analyzing texts in the Pahlavi language, you need to input your text in phonetic form.
To simplify the process, we have developed a pipeline module that works as follows.

Pipeline

In the following example, we use a passage from an ancient Parsig text containing advice for people at that time. Its rough English translation is: "Forget what is gone and do not worry about what has not yet come." [1]

You can easily apply tokenization, lemmatization, POS tagging, and phoneme-to-grapheme conversion to this text using the following code:

>>> from parsipy import pipeline, Task
>>> result = pipeline(sentence='ān uzīd frāmōš kun ud ān nē mad ēstēd rāy tēmār bēš ma bar',
                      tasks=[Task.TOKENIZER, Task.LEMMA, Task.POS, Task.P2T])

The result is a dictionary containing the outputs of all requested tasks:

{
    "tokenizer": [
        {"id": 0, "text": "ān"},
        {"id": 1, "text": "uzīd"},
        {"id": 2, "text": "frāmōš"},
        {"id": 3, "text": "kun"},
        {"id": 4, "text": "ud"},
        {"id": 5, "text": "ān"},
        {"id": 6, "text": "nē"},
        {"id": 7, "text": "mad"},
        {"id": 8, "text": "ēstēd"},
        {"id": 9, "text": "rāy"},
        {"id": 10, "text": "tēmār"},
        {"id": 11, "text": "bēš"},
        {"id": 12, "text": "ma"},
        {"id": 13, "text": "bar"}
    ],
    "lemma": [
        {"stem": "ān", "text": "ān"},
        {"stem": "uzīd", "text": "uzīd"},
        {"stem": "frāmōš", "text": "frāmōš"},
        {"stem": "kun", "text": "kun"},
        {"stem": "ud", "text": "ud"},
        {"stem": "ān", "text": "ān"},
        {"stem": "nē", "text": "nē"},
        {"stem": "mad", "text": "mad"},
        {"stem": "ēst", "text": "ēstēd"},
        {"stem": "rāy", "text": "rāy"},
        {"stem": "tēmār", "text": "tēmār"},
        {"stem": "bēš", "text": "bēš"},
        {"stem": "ma", "text": "ma"},
        {"stem": "bar", "text": "bar"}
    ],

    "POS": [
        {"POS": "DET", "text": "ān"},
        {"POS": "N", "text": "uzīd"},
        {"POS": "N", "text": "frāmōš"},
        {"POS": "V", "text": "kun"},
        {"POS": "CONJ", "text": "ud"},
        {"POS": "DET", "text": "ān"},
        {"POS": "ADV", "text": "nē"},
        {"POS": "V", "text": "mad"},
        {"POS": "V", "text": "ēstēd"},
        {"POS": "POST", "text": "rāy"},
        {"POS": "N", "text": "tēmār"},
        {"POS": "N", "text": "bēš"},
        {"POS": "ADV", "text": "ma"},
        {"POS": "N", "text": "bar"}
    ],
    "P2T": [
        {"text": "ān", "transliteration": "ZK"},
        {"text": "uzīd", "transliteration": "ʾwcyt"},
        {"text": "frāmōš", "transliteration": "plʾmwš"},
        {"text": "kun", "transliteration": "OḆYDWNt͟y"},
        {"text": "ud", "transliteration": "W"},
        {"text": "ān", "transliteration": "ZK"},
        {"text": "nē", "transliteration": "LA"},
        {"text": "mad", "transliteration": "mt"},
        {"text": "ēstēd", "transliteration": "YKOYMWyt'"},
        {"text": "rāy", "transliteration": "lʾd"},
        {"text": "tēmār", "transliteration": "tymʾl"},
        {"text": "bēš", "transliteration": "byš"},
        {"text": "ma", "transliteration": "AL"},
        {"text": "bar", "transliteration": "YḆLWN"}
    ]
}

Below is a brief explanation of each task:

Tokenization

This module splits a sentence into individual tokens, making it easier to process each word separately. Tokenization is a crucial first step for many NLP tasks.

Lemmatization

Lemmatization reduces words to their base or root forms, removing prefixes and suffixes. This is useful for standardizing different word variations.

POS

This module assigns a part-of-speech (POS) tag to each word in a sentence based on its grammatical role. The output provides essential grammatical information for further text analysis.

P2T

Since there is no widely accepted Unicode representation for the original Pahlavi script, digital texts are often written in a phonetic form. This module maps phonetic representations to their transliteration which is a middle-form between phonetic and their original characters. We also present a tool for converting the transliteration into the original text format.

For converting transliteration to Parsig font, you can use this exe file and font in Windows.

Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to parsipy@openscilab.com.

  • Please complete the issue template

References

1- گشتاسب, فرزانه, and حاجی پور. "توصیف و تبیین ماهیت عدالت خسرو انوشیروان در متون فارسی و جستجوی پیشینه آن در متون فارسی میانه." (فصلنامه مطالعات تاریخ فرهنگی) پژوهشنامه انجمن ایرانی تاریخ 14.53 (2022): 101-125.

Show your support

Star this repo

Give a ⭐️ if this project helped you!

Donate to our project

If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .

ParsiPy Donation

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

0.1 - 2025-03-21

Added

  • word_stemmer module
  • tokenizer module
  • p2t module
  • pos_tagger module
  • POSTaggerRuleBased class
  • POSTagger class

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsipy-0.1.tar.gz (130.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsipy-0.1-py3-none-any.whl (129.9 kB view details)

Uploaded Python 3

File details

Details for the file parsipy-0.1.tar.gz.

File metadata

  • Download URL: parsipy-0.1.tar.gz
  • Upload date:
  • Size: 130.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for parsipy-0.1.tar.gz
Algorithm Hash digest
SHA256 92fc48f20d9ba7e8c34c477774bc471143265eb492e1187c024731673fc5e5ca
MD5 ecaf1139b30a07351c05f69b9887e63f
BLAKE2b-256 d09381c758f1aab4bba0159e81def0eebd024743446a00b2fda7179e3e0f9d5b

See more details on using hashes here.

File details

Details for the file parsipy-0.1-py3-none-any.whl.

File metadata

  • Download URL: parsipy-0.1-py3-none-any.whl
  • Upload date:
  • Size: 129.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for parsipy-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 68efa712c924475e6f72a5bd8c773f8419c05a0df0780dd621a5c5c3ac5b0e76
MD5 2c41d8208627c61d9bb4cbf87b5aa152
BLAKE2b-256 707baa7d9cfc67376f49b719cd38231ce8820c3f152d00fb1f07508eb18d746d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page