Skip to main content

No project description provided

Project description

tnkeeh (تنقيح) is an Arabic preprocessing library for python. It was designed using re for creating quick replacement expressions for several examples.

Features

  • Quick cleaning
  • Segmentation
  • Normalization
  • Data splitting

Examples

Data Cleaning

import tnkeeh as tn
tn.clean_data(file_path = 'data.txt', save_path = 'cleaned_data.txt',)

Arguments

  • segment uses farasa for segmentation.
  • remove_diacritics removes all diacritics.
  • remove_special_chars removes all sepcial chars.
  • remove_english removes english alphabets and digits.
  • normalize match digits that have the same writing but different encodings.
  • remove tatweel tatweel character ـ is used a lot in arabic writing.

Data Splitting

Splits raw data into training and testing using the split_ratio

import tnkeeh as tn
tn.split_raw_data(data_path, split_ratio = 0.8)

Splits data and labels into training and testing using the split_ratio

import tnkeeh as tn
tn.split_classification_data(data_path, lbls_path, split_ratio = 0.8)

Splits input and target data with ration split_ratio. Commonly used for translation

tn.split_parallel_data('ar_data.txt','en_data.txt')

Data Reading

Read split data, depending if it was raw or classification

import tnkeeh as tn
train_data, test_data = tn.read_data(mode = 0)

Arguments

  • mode = 0 read raw data.
  • mode = 1 read labeled data.
  • mode = 2 read parallel data.

Contribution

This is an open source project where we encourage contributions from the community.

License

MIT license.

Citation

@misc{tnkeeh2020,
  author = {Zaid Alyafeai and Maged Saeed},
  title = {tkseem: A Preprocessing Library for Arabic.},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ARBML/tnkeeh}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tnkeeh-0.0.1.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

tnkeeh-0.0.1-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file tnkeeh-0.0.1.tar.gz.

File metadata

  • Download URL: tnkeeh-0.0.1.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for tnkeeh-0.0.1.tar.gz
Algorithm Hash digest
SHA256 1de36babba2fbe282a1a6c18f3d132b6b7c7c234f051fea94622446f637da0e7
MD5 d2b3ef730642a7fd2ac95d224999c175
BLAKE2b-256 1bfe429a3689d3702d896d1ea75fdc964436218c9e191824e86dbe740fcd3615

See more details on using hashes here.

File details

Details for the file tnkeeh-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: tnkeeh-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for tnkeeh-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 aab94e31a5ba538fe3a7458aa1d17193accd89fe984be9f5b9d0ad433ded49e2
MD5 ec08391fdd4d62dc52b5cadd74d337ae
BLAKE2b-256 9ca51125e37b29c6dee1fefee1d1e5262067b0f844d5e28c43f07971a1c82ada

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page