Skip to main content

No project description provided

Project description

tkseem (تقسيم) is a tokenization library that encapsulates different approaches for tokenization and preprocessing of Arabic text. We provide different preprocessing, cleaning, normalization and tokenization algorithms for Arabic text.

Features

  • Cleaning
  • Normalization
  • Segmentation
  • Tokenization

Documentation

Please visit readthedocs for the full documentation.

Installation

pip install tkseem

Usage

Preprocessors

import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.process_data('samples/data.txt', clean = True, segment = True, normalize = True)

Tokenization

import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.process_data('samples/data.txt')
tokenizer.train()

tokenizer.tokenize("السلام عليكم")
tokenizer.encode("السلام عليكم")
tokenizer.decode([536, 829])

Large Files

import tokenizers as tk

# initialize
tokenizer = tk.WordTokenizer()
tokenizer.process_data('samples/data.txt')

# training 
tokenizer.train(large_file = True)

Caching

tokenizer.tokenize(open('data/raw/train.txt').read(), cache = True)

Save and Load

import tkseem as tk

tokenizer = tk.WordTokenizer()
tokenizer.process_data('samples/data.txt')
tokenizer.train()

# save the model
tokenizer.save_model('vocab.pl')

# load the model
tokenizer = tk.WordTokenizer()
tokenizer.load_model('vocab.pl')

Model Agnostic

import tokenizers as tk
import time 
import seaborn as sns
import pandas as pd

def calc_time(fun):
    start_time = time.time()
    fun().train()
    return time.time() - start_time

running_times = {}

running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Auto'] = calc_time(tk.AutoTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Char'] = calc_time(tk.CharacterTokenizer)

Notebooks

We show how to use tkseem to train some nlp models.

Name Description Notebook
Demo Explain the syntax of all tokenizers.
Sentiment Classification WordTokenizer for processing sentences and then train a classifier for sentiment classification.
Meter Classification CharacterTokenizer for meter classification using bidirectional GRUs.

Citation

@misc{tkseem2020,
  author = {Zaid Alyafeai and Maged Saeed},
  title = {tkseem: A Tokenization Library for Arabic.},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ARBML/tkseem}}
}

Contribution

This is an open source project where we encourage contributions from the community.

License

MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tkseem-0.0.1.tar.gz (30.6 MB view details)

Uploaded Source

Built Distribution

tkseem-0.0.1-py3-none-any.whl (30.8 MB view details)

Uploaded Python 3

File details

Details for the file tkseem-0.0.1.tar.gz.

File metadata

  • Download URL: tkseem-0.0.1.tar.gz
  • Upload date:
  • Size: 30.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for tkseem-0.0.1.tar.gz
Algorithm Hash digest
SHA256 95ccf142982fe8a6478d0c34f4e162144159cd597c214e7e3d436bf7a6ba5730
MD5 aaa8b33d899624a5c7fbc79cb1216041
BLAKE2b-256 dc560b9ceb2c32b138823e9fae0d5d165d4b99775be280afe308b499f6c5be3d

See more details on using hashes here.

File details

Details for the file tkseem-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: tkseem-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 30.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for tkseem-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4fb4d5a0d40afb3930ead4291e0ad395240d014a1f03146e7f787628f8bd7dbf
MD5 c8694e661108f22861979ac166c40a2f
BLAKE2b-256 9f0401262fdd5ca443c0bd325653ebe071c35720f9e9f9e51a068871dfa17b20

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page