No project description provided
Project description
tnkeeh (تنقيح) is an Arabic preprocessing library for python. It was designed using re for creating quick replacement expressions for several examples.
Installation
pip install tnkeeh
Features
- Quick cleaning
- Segmentation
- Normalization
- Data splitting
Examples
Data Cleaning
import tnkeeh as tn
tn.clean_data(file_path = 'data.txt', save_path = 'cleaned_data.txt',)
Arguments
segmentuses farasa for segmentation.remove_diacriticsremoves all diacritics.remove_special_charsremoves all sepcial chars.remove_englishremoves english alphabets and digits.normalizematch digits that have the same writing but different encodings.remove_tatweeltatweel characterـis used a lot in arabic writing.remove_repeated_charsremove characters that appear three times in sequence.remove_html_elementsremove html elements in the form with their attirbutes.remove_linksremove links.remove_twitter_metaremove twitter mentions, links and hashtags.remove_long_wordsremove words longer than 15 chars.by_chunkread files by chunks with sizechunk_size.
HuggingFace datasets
import tnkeeh as tn
from datasets import load_dataset
dataset = load_dataset('metrec')
cleander = tn.Tnkeeh(remove_diacritics = True)
cleaned_dataset = cleander.clean_hf_dataset(dataset, 'text')
Data Splitting
Splits raw data into training and testing using the split_ratio
import tnkeeh as tn
tn.split_raw_data(data_path, split_ratio = 0.8)
Splits data and labels into training and testing using the split_ratio
import tnkeeh as tn
tn.split_classification_data(data_path, lbls_path, split_ratio = 0.8)
Splits input and target data with ration split_ratio. Commonly used for translation
tn.split_parallel_data('ar_data.txt','en_data.txt')
Data Reading
Read split data, depending if it was raw or classification
import tnkeeh as tn
train_data, test_data = tn.read_data(mode = 0)
Arguments
mode = 0read raw data.mode = 1read labeled data.mode = 2read parallel data.
Contribution
This is an open source project where we encourage contributions from the community.
License
MIT license.
Citation
@misc{tnkeeh2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {tkseem: A Preprocessing Library for Arabic.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/tnkeeh}}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tnkeeh-0.0.9.tar.gz.
File metadata
- Download URL: tnkeeh-0.0.9.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e030b17542a7db4e36c8647f6521f24c785334a4caf0f1398030501bd76d8a0e
|
|
| MD5 |
edcb671cc9a2f2dc93359d3699de8f06
|
|
| BLAKE2b-256 |
9d2618717cf5fbf40297fa4f1c0cffdc47d679bdbda9bce032f62300fefbe998
|
File details
Details for the file tnkeeh-0.0.9-py3-none-any.whl.
File metadata
- Download URL: tnkeeh-0.0.9-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd0e086dca43533031e3fd7923d38e6dec62685a651a5127ce4494847fa930c2
|
|
| MD5 |
80dbbc2314b7b816d50a1739dc4ae7cf
|
|
| BLAKE2b-256 |
2dfa6c9e4abdfd0c7327919bb1ba93bf5bafe0f908a11b422f535a4df6bb0d79
|