No project description provided
Project description
tkseem (تقسيم) is a tokenization library that encapsulates different approaches for tokenization and preprocessing of Arabic text. We provide different preprocessing, cleaning, normalization and tokenization algorithms for Arabic text.
Features
- Cleaning
- Normalization
- Segmentation
- Tokenization
Documentation
Please visit readthedocs for the full documentation.
Installation
pip install tkseem
Usage
Preprocessors
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.process_data('samples/data.txt', clean = True, segment = True, normalize = True)
Tokenization
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.process_data('samples/data.txt')
tokenizer.train()
tokenizer.tokenize("السلام عليكم")
tokenizer.encode("السلام عليكم")
tokenizer.decode([536, 829])
Large Files
import tokenizers as tk
# initialize
tokenizer = tk.WordTokenizer()
tokenizer.process_data('samples/data.txt')
# training
tokenizer.train(large_file = True)
Caching
tokenizer.tokenize(open('data/raw/train.txt').read(), cache = True)
Save and Load
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.process_data('samples/data.txt')
tokenizer.train()
# save the model
tokenizer.save_model('vocab.pl')
# load the model
tokenizer = tk.WordTokenizer()
tokenizer.load_model('vocab.pl')
Model Agnostic
import tokenizers as tk
import time
import seaborn as sns
import pandas as pd
def calc_time(fun):
start_time = time.time()
fun().train()
return time.time() - start_time
running_times = {}
running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Auto'] = calc_time(tk.AutoTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Char'] = calc_time(tk.CharacterTokenizer)
Notebooks
We show how to use tkseem
to train some nlp models.
Citation
@misc{tkseem2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {tkseem: A Tokenization Library for Arabic.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/tkseem}}
}
Contribution
This is an open source project where we encourage contributions from the community.
License
MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tkseem-0.0.1.tar.gz
(30.6 MB
view details)
Built Distribution
tkseem-0.0.1-py3-none-any.whl
(30.8 MB
view details)
File details
Details for the file tkseem-0.0.1.tar.gz
.
File metadata
- Download URL: tkseem-0.0.1.tar.gz
- Upload date:
- Size: 30.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95ccf142982fe8a6478d0c34f4e162144159cd597c214e7e3d436bf7a6ba5730 |
|
MD5 | aaa8b33d899624a5c7fbc79cb1216041 |
|
BLAKE2b-256 | dc560b9ceb2c32b138823e9fae0d5d165d4b99775be280afe308b499f6c5be3d |
File details
Details for the file tkseem-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: tkseem-0.0.1-py3-none-any.whl
- Upload date:
- Size: 30.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fb4d5a0d40afb3930ead4291e0ad395240d014a1f03146e7f787628f8bd7dbf |
|
MD5 | c8694e661108f22861979ac166c40a2f |
|
BLAKE2b-256 | 9f0401262fdd5ca443c0bd325653ebe071c35720f9e9f9e51a068871dfa17b20 |