Skip to main content

No project description provided

Project description

tnkeeh (تنقيح) is an Arabic preprocessing library for python. It was designed using re for creating quick replacement expressions for several examples.

Installation

pip install tnkeeh

Features

  • Quick cleaning
  • Segmentation
  • Normalization
  • Data splitting

Examples

Data Cleaning

import tnkeeh as tn
tn.clean_data(file_path = 'data.txt', save_path = 'cleaned_data.txt',)

Arguments

  • segment uses farasa for segmentation.
  • remove_diacritics removes all diacritics.
  • remove_special_chars removes all sepcial chars.
  • remove_english removes english alphabets and digits.
  • normalize match digits that have the same writing but different encodings.
  • remove tatweel tatweel character ـ is used a lot in arabic writing.

Data Splitting

Splits raw data into training and testing using the split_ratio

import tnkeeh as tn
tn.split_raw_data(data_path, split_ratio = 0.8)

Splits data and labels into training and testing using the split_ratio

import tnkeeh as tn
tn.split_classification_data(data_path, lbls_path, split_ratio = 0.8)

Splits input and target data with ration split_ratio. Commonly used for translation

tn.split_parallel_data('ar_data.txt','en_data.txt')

Data Reading

Read split data, depending if it was raw or classification

import tnkeeh as tn
train_data, test_data = tn.read_data(mode = 0)

Arguments

  • mode = 0 read raw data.
  • mode = 1 read labeled data.
  • mode = 2 read parallel data.

Contribution

This is an open source project where we encourage contributions from the community.

License

MIT license.

Citation

@misc{tnkeeh2020,
  author = {Zaid Alyafeai and Maged Saeed},
  title = {tkseem: A Preprocessing Library for Arabic.},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ARBML/tnkeeh}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tnkeeh-0.0.3.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tnkeeh-0.0.3-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file tnkeeh-0.0.3.tar.gz.

File metadata

  • Download URL: tnkeeh-0.0.3.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for tnkeeh-0.0.3.tar.gz
Algorithm Hash digest
SHA256 fbee60ea18bc298d119bab3542af1f7c86e85194face3187e5cbe478a23d9b82
MD5 bf5af54ce704b4c7b50bf67509dbf020
BLAKE2b-256 7b1da67ff49a6454962329ff1644fc8e024f812905a4b879d906fa636c284a90

See more details on using hashes here.

File details

Details for the file tnkeeh-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: tnkeeh-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for tnkeeh-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5f01d23e3c6110c440bf0c2e6cf1143c2e6100a0f03558149cdd270947fe233c
MD5 f6154a6244187e281f3589c79f6b466a
BLAKE2b-256 83451187f1e5c87dd91e3c5cb0908aa237c0dd68721d597a313a498c4dc04187

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page