No project description provided
Project description
tnkeeh (تنقيح) is an Arabic preprocessing library for python. It was designed using re
for creating quick replacement expressions for several examples.
Features
- Quick cleaning
- Segmentation
- Normalization
- Data splitting
Examples
Data Cleaning
import tnkeeh as tn
tn.clean_data(file_path = 'data.txt', save_path = 'cleaned_data.txt',)
Arguments
segment
uses farasa for segmentation.remove_diacritics
removes all diacritics.remove_special_chars
removes all sepcial chars.remove_english
removes english alphabets and digits.normalize
match digits that have the same writing but different encodings.remove tatweel
tatweel characterـ
is used a lot in arabic writing.
Data Splitting
Splits raw data into training and testing using the split_ratio
import tnkeeh as tn
tn.split_raw_data(data_path, split_ratio = 0.8)
Splits data and labels into training and testing using the split_ratio
import tnkeeh as tn
tn.split_classification_data(data_path, lbls_path, split_ratio = 0.8)
Splits input and target data with ration split_ratio
. Commonly used for translation
tn.split_parallel_data('ar_data.txt','en_data.txt')
Data Reading
Read split data, depending if it was raw or classification
import tnkeeh as tn
train_data, test_data = tn.read_data(mode = 0)
Arguments
mode = 0
read raw data.mode = 1
read labeled data.mode = 2
read parallel data.
Contribution
This is an open source project where we encourage contributions from the community.
License
MIT license.
Citation
@misc{tnkeeh2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {tkseem: A Preprocessing Library for Arabic.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/tnkeeh}}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tnkeeh-0.0.1.tar.gz
.
File metadata
- Download URL: tnkeeh-0.0.1.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1de36babba2fbe282a1a6c18f3d132b6b7c7c234f051fea94622446f637da0e7 |
|
MD5 | d2b3ef730642a7fd2ac95d224999c175 |
|
BLAKE2b-256 | 1bfe429a3689d3702d896d1ea75fdc964436218c9e191824e86dbe740fcd3615 |
File details
Details for the file tnkeeh-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: tnkeeh-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aab94e31a5ba538fe3a7458aa1d17193accd89fe984be9f5b9d0ad433ded49e2 |
|
MD5 | ec08391fdd4d62dc52b5cadd74d337ae |
|
BLAKE2b-256 | 9ca51125e37b29c6dee1fefee1d1e5262067b0f844d5e28c43f07971a1c82ada |