No project description provided
Project description
tkseem (تقسيم) is a tokenization library that encapsulates different approaches for tokenization and preprocessing of Arabic text.
Documentation
Please visit readthedocs for the full documentation.
Installation
pip install tkseem
Usage
Tokenization
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
tokenizer.tokenize("السلام عليكم")
tokenizer.encode("السلام عليكم")
tokenizer.decode([536, 829])
Caching
tokenizer.tokenize(open('data/raw/train.txt').read(), use_cache = True)
Save and Load
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
# save the model
tokenizer.save_model('vocab.pl')
# load the model
tokenizer = tk.WordTokenizer()
tokenizer.load_model('vocab.pl')
Model Agnostic
import tkseem as tk
import time
import seaborn as sns
import pandas as pd
def calc_time(fun):
start_time = time.time()
fun().train()
return time.time() - start_time
running_times = {}
running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Char'] = calc_time(tk.CharacterTokenizer)
Notebooks
We show how to use tkseem to train some nlp models.
Citation
@misc{tkseem2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {tkseem: A Tokenization Library for Arabic.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/tkseem}}
}
Contribution
This is an open source project where we encourage contributions from the community.
License
MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tkseem-0.0.3.tar.gz.
File metadata
- Download URL: tkseem-0.0.3.tar.gz
- Upload date:
- Size: 30.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09f0abbf087d5faeb9699e3f0d36d68ec7238bdac8240fed1e10297437044bdb
|
|
| MD5 |
5164bc8c8ade68f8732944b5385eebfa
|
|
| BLAKE2b-256 |
0b767f922c82315011d9c21fd329040dbc2841d9cd86f9cbd7ee92ace34f2925
|
File details
Details for the file tkseem-0.0.3-py3-none-any.whl.
File metadata
- Download URL: tkseem-0.0.3-py3-none-any.whl
- Upload date:
- Size: 30.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6da0591e19b5966dbd3e097cfab81193a13eaf9ff4ad9811f094d78e5d09b62e
|
|
| MD5 |
567a56e0d79dbf734bad52de11d2675f
|
|
| BLAKE2b-256 |
dc56525371683b1e48c3ce31839fbe148f121b7651ac96094a0c6f11dc5e7109
|