Skip to main content
Help the Python Software Foundation raise $60,000 USD by December 31st!  Building the PSF Q4 Fundraiser

No project description provided

Project description

tnkeeh (تنقيح) is an Arabic preprocessing library for python. It was designed using re for creating quick replacement expressions for several examples.

Installation

pip install tnkeeh

Features

  • Quick cleaning
  • Segmentation
  • Normalization
  • Data splitting

Examples

Data Cleaning

import tnkeeh as tn
tn.clean_data(file_path = 'data.txt', save_path = 'cleaned_data.txt',)

Arguments

  • segment uses farasa for segmentation.
  • remove_diacritics removes all diacritics.
  • remove_special_chars removes all sepcial chars.
  • remove_english removes english alphabets and digits.
  • normalize match digits that have the same writing but different encodings.
  • remove tatweel tatweel character ـ is used a lot in arabic writing.

Data Splitting

Splits raw data into training and testing using the split_ratio

import tnkeeh as tn
tn.split_raw_data(data_path, split_ratio = 0.8)

Splits data and labels into training and testing using the split_ratio

import tnkeeh as tn
tn.split_classification_data(data_path, lbls_path, split_ratio = 0.8)

Splits input and target data with ration split_ratio. Commonly used for translation

tn.split_parallel_data('ar_data.txt','en_data.txt')

Data Reading

Read split data, depending if it was raw or classification

import tnkeeh as tn
train_data, test_data = tn.read_data(mode = 0)

Arguments

  • mode = 0 read raw data.
  • mode = 1 read labeled data.
  • mode = 2 read parallel data.

Contribution

This is an open source project where we encourage contributions from the community.

License

MIT license.

Citation

@misc{tnkeeh2020,
  author = {Zaid Alyafeai and Maged Saeed},
  title = {tkseem: A Preprocessing Library for Arabic.},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ARBML/tnkeeh}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for tnkeeh, version 0.0.3
Filename, size File type Python version Upload date Hashes
Filename, size tnkeeh-0.0.3-py3-none-any.whl (6.3 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size tnkeeh-0.0.3.tar.gz (5.8 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page