Skip to main content

No project description provided

Project description

tnkeeh (تنقيح) is an Arabic preprocessing library for python. It was designed using re for creating quick replacement expressions for several examples.

Installation

pip install tnkeeh

Features

  • Quick cleaning
  • Segmentation
  • Normalization
  • Data splitting

Examples

Data Cleaning

import tnkeeh as tn
tn.clean_data(file_path = 'data.txt', save_path = 'cleaned_data.txt',)

Arguments

  • segment uses farasa for segmentation.
  • remove_diacritics removes all diacritics.
  • remove_special_chars removes all sepcial chars.
  • remove_english removes english alphabets and digits.
  • normalize match digits that have the same writing but different encodings.
  • remove tatweel tatweel character ـ is used a lot in arabic writing.

Data Splitting

Splits raw data into training and testing using the split_ratio

import tnkeeh as tn
tn.split_raw_data(data_path, split_ratio = 0.8)

Splits data and labels into training and testing using the split_ratio

import tnkeeh as tn
tn.split_classification_data(data_path, lbls_path, split_ratio = 0.8)

Splits input and target data with ration split_ratio. Commonly used for translation

tn.split_parallel_data('ar_data.txt','en_data.txt')

Data Reading

Read split data, depending if it was raw or classification

import tnkeeh as tn
train_data, test_data = tn.read_data(mode = 0)

Arguments

  • mode = 0 read raw data.
  • mode = 1 read labeled data.
  • mode = 2 read parallel data.

Contribution

This is an open source project where we encourage contributions from the community.

License

MIT license.

Citation

@misc{tnkeeh2020,
  author = {Zaid Alyafeai and Maged Saeed},
  title = {tkseem: A Preprocessing Library for Arabic.},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ARBML/tnkeeh}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tnkeeh-0.0.2.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

tnkeeh-0.0.2-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file tnkeeh-0.0.2.tar.gz.

File metadata

  • Download URL: tnkeeh-0.0.2.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for tnkeeh-0.0.2.tar.gz
Algorithm Hash digest
SHA256 3b15ae9582d475580d6970566c5d32dc8470e849405b2d2559329650ba882224
MD5 a7685ab758f0447fe73e56373dcb861a
BLAKE2b-256 35bcd79133e6f86b5986c0084ea6c46599cf1ea7e82f7c0cde36f07338acf103

See more details on using hashes here.

File details

Details for the file tnkeeh-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: tnkeeh-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for tnkeeh-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 03aaf7279ea7b0121f0da500b8962cb07a703fa48f2f32bd489285d9664d91af
MD5 c8d144d462cce8e29fe5fac0a0b67fb0
BLAKE2b-256 f74cbc588193a98a5901b2b5dbd0387bf76565b877fe3b1956f91d805103d03e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page