No project description provided
Project description
tnkeeh (تنقيح) is an Arabic preprocessing library for python. It was designed using re
for creating quick replacement expressions for several examples.
Installation
pip install tnkeeh
Features
- Quick cleaning
- Segmentation
- Normalization
- Data splitting
Examples
Data Cleaning
import tnkeeh as tn
tn.clean_data(file_path = 'data.txt', save_path = 'cleaned_data.txt',)
Arguments
segment
uses farasa for segmentation.remove_diacritics
removes all diacritics.remove_special_chars
removes all sepcial chars.remove_english
removes english alphabets and digits.normalize
match digits that have the same writing but different encodings.remove_tatweel
tatweel characterـ
is used a lot in arabic writing.remove_repeated_chars
remove characters that appear three times in sequence.remove_html_elements
remove html elements in the form with their attirbutes.remove_links
remove links.remove_twitter_meta
remove twitter mentions, links and hashtags.remove_long_words
remove words longer than 15 chars.by_chunk
read files by chunks with sizechunk_size
.
HuggingFace datasets
import tnkeeh as tn
from datasets import load_dataset
dataset = load_dataset('metrec')
cleander = tn.Tnkeeh(remove_diacritics = True)
cleaned_dataset = cleander.clean_hf_dataset(dataset, 'text')
Data Splitting
Splits raw data into training and testing using the split_ratio
import tnkeeh as tn
tn.split_raw_data(data_path, split_ratio = 0.8)
Splits data and labels into training and testing using the split_ratio
import tnkeeh as tn
tn.split_classification_data(data_path, lbls_path, split_ratio = 0.8)
Splits input and target data with ration split_ratio
. Commonly used for translation
tn.split_parallel_data('ar_data.txt','en_data.txt')
Data Reading
Read split data, depending if it was raw or classification
import tnkeeh as tn
train_data, test_data = tn.read_data(mode = 0)
Arguments
mode = 0
read raw data.mode = 1
read labeled data.mode = 2
read parallel data.
Contribution
This is an open source project where we encourage contributions from the community.
License
MIT license.
Citation
@misc{tnkeeh2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {tkseem: A Preprocessing Library for Arabic.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/tnkeeh}}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tnkeeh-0.0.9.tar.gz
.
File metadata
- Download URL: tnkeeh-0.0.9.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e030b17542a7db4e36c8647f6521f24c785334a4caf0f1398030501bd76d8a0e |
|
MD5 | edcb671cc9a2f2dc93359d3699de8f06 |
|
BLAKE2b-256 | 9d2618717cf5fbf40297fa4f1c0cffdc47d679bdbda9bce032f62300fefbe998 |
File details
Details for the file tnkeeh-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: tnkeeh-0.0.9-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd0e086dca43533031e3fd7923d38e6dec62685a651a5127ce4494847fa930c2 |
|
MD5 | 80dbbc2314b7b816d50a1739dc4ae7cf |
|
BLAKE2b-256 | 2dfa6c9e4abdfd0c7327919bb1ba93bf5bafe0f908a11b422f535a4df6bb0d79 |