No project description provided
Project description
tkseem (تقسيم) is a tokenization library that encapsulates different approaches for tokenization and preprocessing of Arabic text.
Documentation
Please visit readthedocs for the full documentation.
Installation
pip install tkseem
Usage
Tokenization
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
tokenizer.tokenize("السلام عليكم")
tokenizer.encode("السلام عليكم")
tokenizer.decode([536, 829])
Caching
tokenizer.tokenize(open('data/raw/train.txt').read(), use_cache = True)
Save and Load
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
# save the model
tokenizer.save_model('vocab.pl')
# load the model
tokenizer = tk.WordTokenizer()
tokenizer.load_model('vocab.pl')
Model Agnostic
import tkseem as tk
import time
import seaborn as sns
import pandas as pd
def calc_time(fun):
start_time = time.time()
fun().train()
return time.time() - start_time
running_times = {}
running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Char'] = calc_time(tk.CharacterTokenizer)
Notebooks
We show how to use tkseem
to train some nlp models.
Citation
@misc{tkseem2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {tkseem: A Tokenization Library for Arabic.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/tkseem}}
}
Contribution
This is an open source project where we encourage contributions from the community.
License
MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tkseem-0.0.3.tar.gz
(30.6 MB
view details)
Built Distribution
tkseem-0.0.3-py3-none-any.whl
(30.9 MB
view details)
File details
Details for the file tkseem-0.0.3.tar.gz
.
File metadata
- Download URL: tkseem-0.0.3.tar.gz
- Upload date:
- Size: 30.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09f0abbf087d5faeb9699e3f0d36d68ec7238bdac8240fed1e10297437044bdb |
|
MD5 | 5164bc8c8ade68f8732944b5385eebfa |
|
BLAKE2b-256 | 0b767f922c82315011d9c21fd329040dbc2841d9cd86f9cbd7ee92ace34f2925 |
File details
Details for the file tkseem-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: tkseem-0.0.3-py3-none-any.whl
- Upload date:
- Size: 30.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6da0591e19b5966dbd3e097cfab81193a13eaf9ff4ad9811f094d78e5d09b62e |
|
MD5 | 567a56e0d79dbf734bad52de11d2675f |
|
BLAKE2b-256 | dc56525371683b1e48c3ce31839fbe148f121b7651ac96094a0c6f11dc5e7109 |