TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for tokenizing Turkish texts.

These details have not been verified by PyPI

Project links

Project description

TS Tokenizer is a hybrid tokenizer designed specifically for tokenizing Turkish texts. It uses a hybrid (lexicon-based and rule-based) approach to split text into tokens.

Key Features:

Hybrid Tokenization: Combines lexicon-based and rule-based techniques to tokenize complex Turkish texts with precision.
Special Token Handling: Detects and processes mentions, hashtags, emails, URLs, dates, numbers, smileys, emoticons, and more.
Configurable Outputs: Offers multiple output formats, including plain tokens, tagged tokens, tokenized lines, and tagged lines, to suit diverse NLP workflows.
Multi-core Processing: Speeds up tokenization for large files with parallel processing.
Preprocess Handling: Handles corrupted Turkish text and punctuation gracefully using built-in fixes.
Command-Line Friendly: Use it directly from the terminal for file-based or piped input workflows.

On natural language processing (NLP), information retrieval, or text mining for Turkish, TS Tokenizer offers a reliable solution for tokenization.

Installation

You can install the ts-tokenizer package using pip.

pip install ts-tokenizer

Ensure you have Python 3.9 or higher installed on your system.

You can update current version using pip

pip install --upgrade ts-tokenizer

To remove package, use:

pip uninstall ts-tokenizer

You can also clone the repo locally.

git clone https://github.com/tanerim/ts_tokenizer.git
cd ts-tokenizer
pip install -e .

License

This project is licensed under the MIT License. See the LICENSE file for details.

Command line tool

You can use TS Tokenizer directly from the command line for both file inputs and pipeline processing:

Tokenize from a File:

ts-tokenizer input.txt

cat input.txt | ts-tokenizer

zcat input.txt.gz | ts-tokenizer

Help

Get detailed help for available options using:

Argument	Short	Description	Default
`--output`	`-o`	Specify the output format: `tokenized`, `lines`, `tagged`, `tagged_lines`.	`tokenized`
`--num-workers`	`-n`	Set the number of parallel workers for processing.	`CPU cores-1`
`--verbose`	`-v`	Enable verbose mode to display additional processing details.	Disabled
`--version`	`-V`	Display the current version of `ts-tokenizer`.	N/A
`--help`	`-h`	Show the help message and exit.

CLI Arguments

You can specify the output format using the -o option:

tokenized (default): Returns plain tokens, one per line.
tagged: Returns tokens with their tags.
lines: Returns tokenized lines as lists.
tagged_lines: Returns tokenized lines as a list of tuples (token, tag).

input_text = "Queen , 31.10.1975 tarihinde çıkardıðı A Night at the Opera albÃ¼mÃ¼yle dÃ¼nya mÃ¼ziðini deðiåÿtirdi ."

$ ts-tokenizer input_text

Queen
,
31.10.1975
tarihinde
çıkardığı
A
Night
at
the
Opera
albümüyle
dünya
müziğini
değiştirdi
.

Note that tags are not part-of-speech tags but they define the given string.

$ ts-tokenizer -o tagged input.txt

Queen	English_Word
,	Punc
31.10.1975	Date
tarihinde	Valid_Word
çıkardığı	Valid_Word
A	OOV
Night	English_Word
at	Valid_Word
the	English_Word
Opera	Valid_Word
albümüyle	Valid_Word
dünya	Valid_Word
müziğini	Valid_Word
değiştirdi	Valid_Word
.	Punc

The other two arguments are "lines" and "tagged_lines". The "lines" parameter reads input file line-by-line and returns a list for each line. Note that each line is defined by end-of-line markers in the given text.

$ ts-tokenizer -o lines input.txt

['Queen', ',', '31.10.1975', 'tarihinde', 'çıkardığı', 'A', 'Night', 'at', 'the', 'Opera', 'albümüyle', 'dünya', 'müziğini', 'değiştirdi', '.']

The "tagged_lines" parameter reads input file line-by-line and returns a list of tuples for each line. Note that each line is defined by end-of-line markers in the given text.

$ ts-tokenizer -o tagged_lines input.txt

 [('Queen', 'English_Word'), (',', 'Punc'), ('31.10.1975', 'Date'), ('tarihinde', 'Valid_Word'), ('çıkardığı', 'Valid_Word'), ('A', 'OOV'), ('Night', 'English_Word'), ('at', 'Valid_Word'), ('the', 'English_Word'), ('Opera', 'Valid_Word'), ('albümüyle', 'Valid_Word'), ('dünya', 'Valid_Word'), ('müziğini', 'Valid_Word'), ('değiştirdi', 'Valid_Word'), ('.', 'Punc')]

Parallel Processing

Use the -n option to set the number of parallel workers:

$ ts-tokenizer -n 2 -o tagged input_file

By default, TS Tokenizer uses [number of CPU cores - 1].

Using CLI Arguments with pipelines

You can use TS Tokenizer in bash pipelines, such as counting word frequencies:

Following sample returns calculated frequencies for the given file:

$ ts-tokenizer input.txt | sort | uniq -c | sort -n

To count tags:

$ ts-tokenizer -o tagged input.txt | cut -f2 | sort | uniq -c

1 Date
3 English_Word
2 Hashtag
2 Mention
1 Multi_Hyphenated
3 Numbered_Title
1 OOV
9 Punc
1 Single_Hyphenated
17 Valid_Word

To find a specific tag following command could be used.

$ ts-tokenizer -o tagged input.txt | cut -f1,2 | grep "Web_URL"

www.wikipedia.org	Web_URL
www.wim-wenders.com	Web_URL
www.winterwar.com.	Web_URL
www.wissenschaft.de:	Web_URL
www.wittingen.de	Web_URL
www.wlmqradio.com	Web_URL
www.worldstadiums.com	Web_URL
www.worldstatesmen.org	Web_URL

Classes

Below are samples to implement ts-tokenizer in Python

TSTokenizer

tokenized : Outputs a list of plain tokens extracted from the input text.

from ts_tokenizer import tokenize
Single_Line_Sample = "ParÃ§a ve bÃ¼tÃ¼n iliåÿkisi her zaman iåÿlevsel deðildir."
tokens = tokenize(Single_Line_Sample, "tokenized")
print(tokens)

Generated output is as follows:

Parça
ve
bütün
ilişkisi
her
zaman
işlevsel
değildir

tagged : Outputs tokens with associated tags. Please note that these are not POSTags. Check TokenHandler below for tag set.

from ts_tokenizer import tokenize
Single_Line_Sample = "ParÃ§a ve bÃ¼tÃ¼n iliåÿkisi her zaman iåÿlevsel deðildir."
tagged_tokens = tokenize(Single_Line_Sample, "tagged")
print(tokens)

Generated output is as follows:

Parça	Valid_Word
ve	Valid_Word
bütün	Valid_Word
ilişkisi	Valid_Word
her	Valid_Word
zaman	Valid_Word
işlevsel	Valid_Word
değildir	Valid_Word
.	Punc

lines : Maintains the structure of the input text, with each line's tokens grouped together. Please note that "line" is defined with end-of-line markers.

from ts_tokenizer import tokenize
Multi_Line_Sample = """
ATATÜRK'ün GENÇLÝÐE HÝTABESÝ 
Ey Türk gençliði! Birinci vazifen, Türk istiklâlini, Türk Cumhuriyet'ini, ilelebet, muhafaza ve müdafaa etmektir. 
Mevcudiyetinin ve istikbalinin yegâne temeli budur. 
Bu temel, senin, en kýymetli hazinendir.
Ýstikbalde dahi, seni bu hazineden mahrum etmek isteyecek, dahilî ve haricî bedhahlarýn olacaktýr. 
Bir gün, istiklâl ve cumhuriyeti müdafaa mecburiyetine düþersen, vazifeye atýlmak için, içinde bulunacaðýn vaziyetin imkân ve þeraitini düþünmeyeceksin! 
Bu imkân ve þerait, çok nâmüsait bir mahiyette tezahür edebilir. 
Ýstiklâl ve cumhuriyetine kastedecek düþmanlar, bütün dünyada emsali görülmemiþ bir galibiyetin mümessili olabilirler. 
Cebren ve hile ile aziz vatanýn, bütün kaleleri zaptedilmiþ, bütün tersanelerine girilmiþ, bütün ordularý daðýtýlmýþ ve memleketin her köþesi bilfiil iþgal edilmiþ olabilir. 
Bütün bu þeraitten daha elîm ve daha vahim olmak üzere, memleketin dahilinde, iktidara sahip olanlar gaflet ve dalâlet ve hattâ hýyanet içinde bulunabilirler.
Hatta bu iktidar sahipleri þahsî menfaatlerini, müstevlilerin siyasî emelleriyle tevhit edebilirler.
Millet, fakruzaruret içinde harap ve bîtap düþmüþ olabilir. Ey Türk istikbalinin evladý! Ýþte, bu ahval ve þerait içinde dahi, vazifen; Türk istiklâl ve cumhuriyetini kurtarmaktýr! 
Muhtaç olduðun kudret, damarlarýndaki asil kanda, mevcuttur!
"""
lines = tokenize(Multi_Line_Sample, "lines")
print(lines)

Generated output is as follows:

ATATÜRK'ün GENÇLİĞE HİTABESİ
Ey Türk gençliği ! Birinci vazifen , Türk istiklâlini , Türk Cumhuriyet'ini , ilelebet , muhafaza ve müdafaa etmektir .
Mevcudiyetinin ve istikbalinin yegâne temeli budur .
Bu temel , senin , en kıymetli hazinendir .
İstikbalde dahi , seni bu hazineden mahrum etmek isteyecek , dahilî ve haricî bedhahların olacaktır .
Bir gün , istiklâl ve cumhuriyeti müdafaa mecburiyetine düşersen , vazifeye atılmak için , içinde bulunacağın vaziyetin imkân ve şeraitini düşünmeyeceksin !
Bu imkân ve şerait , çok nâmüsait bir mahiyette tezahür edebilir .
İstiklâl ve cumhuriyetine kastedecek düşmanlar , bütün dünyada emsali görülmemiş bir galibiyetin mümessili olabilirler .
Cebren ve hile ile aziz vatanın , bütün kaleleri zaptedilmiş , bütün tersanelerine girilmiş , bütün orduları dağıtılmış ve memleketin her köşesi bilfiil işgal edilmiş olabilir .
Bütün bu şeraitten daha elîm ve daha vahim olmak üzere , memleketin dahilinde , iktidara sahip olanlar gaflet ve dalâlet ve hattâ hıyanet içinde bulunabilirler .
Hatta bu iktidar sahipleri şahsî menfaatlerini , müstevlilerin siyasî emelleriyle tevhit edebilirler .
Millet , fakruzaruret içinde harap ve bîtap düşmüş olabilir . Ey Türk istikbalinin evladı ! İşte , bu ahval ve şerait içinde dahi , vazifen ; Türk istiklâl ve cumhuriyetini kurtarmaktır !
Muhtaç olduğun kudret , damarlarındaki asil kanda , mevcuttur !

tagged_lines: Same as lines but includes tags for each token.

from ts_tokenizer import tokenize
Multi_Line_Sample = """
ATATÜRK'ün GENÇLÝÐE HÝTABESÝ 
Ey Türk gençliði! Birinci vazifen, Türk istiklâlini, Türk Cumhuriyet'ini, ilelebet, muhafaza ve müdafaa etmektir. 
Mevcudiyetinin ve istikbalinin yegâne temeli budur. 
Bu temel, senin, en kýymetli hazinendir.
Ýstikbalde dahi, seni bu hazineden mahrum etmek isteyecek, dahilî ve haricî bedhahlarýn olacaktýr. 
Bir gün, istiklâl ve cumhuriyeti müdafaa mecburiyetine düþersen, vazifeye atýlmak için, içinde bulunacaðýn vaziyetin imkân ve þeraitini düþünmeyeceksin! 
Bu imkân ve þerait, çok nâmüsait bir mahiyette tezahür edebilir. 
Ýstiklâl ve cumhuriyetine kastedecek düþmanlar, bütün dünyada emsali görülmemiþ bir galibiyetin mümessili olabilirler. 
Cebren ve hile ile aziz vatanýn, bütün kaleleri zaptedilmiþ, bütün tersanelerine girilmiþ, bütün ordularý daðýtýlmýþ ve memleketin her köþesi bilfiil iþgal edilmiþ olabilir. 
Bütün bu þeraitten daha elîm ve daha vahim olmak üzere, memleketin dahilinde, iktidara sahip olanlar gaflet ve dalâlet ve hattâ hýyanet içinde bulunabilirler.
Hatta bu iktidar sahipleri þahsî menfaatlerini, müstevlilerin siyasî emelleriyle tevhit edebilirler.
Millet, fakruzaruret içinde harap ve bîtap düþmüþ olabilir. Ey Türk istikbalinin evladý! Ýþte, bu ahval ve þerait içinde dahi, vazifen; Türk istiklâl ve cumhuriyetini kurtarmaktýr! 
Muhtaç olduðun kudret, damarlarýndaki asil kanda, mevcuttur!
"""
tagged_lines = tokenize(Multi_Line_Sample, "tagged_lines")
print(tagged_lines)

Generated output is as follows:

[("ATATÜRK'ün", 'Apostrophed'), ('GENÇLİĞE', 'Valid_Word'), ('HİTABESİ', 'Valid_Word')]
[('Ey', 'Valid_Word'), ('Türk', 'Valid_Word'), ('gençliği', 'Valid_Word'), ('!', 'Punc'), ('Birinci', 'Valid_Word'), ('vazifen', 'Valid_Word'), (',', 'Punc'), ('Türk', 'Valid_Word'), ('istiklâlini', 'Valid_Word'), (',', 'Punc'), ('Türk', 'Valid_Word'), ("Cumhuriyet'ini", 'Apostrophed'), (',', 'Punc'), ('ilelebet', 'Valid_Word'), (',', 'Punc'), ('muhafaza', 'Valid_Word'), ('ve', 'Valid_Word'), ('müdafaa', 'Valid_Word'), ('etmektir', 'Valid_Word'), ('.', 'Punc')]
[('Mevcudiyetinin', 'Valid_Word'), ('ve', 'Valid_Word'), ('istikbalinin', 'Valid_Word'), ('yegâne', 'Valid_Word'), ('temeli', 'Valid_Word'), ('budur', 'Valid_Word'), ('.', 'Punc')]
[('Bu', 'Valid_Word'), ('temel', 'Valid_Word'), (',', 'Punc'), ('senin', 'Valid_Word'), (',', 'Punc'), ('en', 'Valid_Word'), ('kıymetli', 'Valid_Word'), ('hazinendir', 'Valid_Word'), ('.', 'Punc')]
[('İstikbalde', 'Valid_Word'), ('dahi', 'Valid_Word'), (',', 'Punc'), ('seni', 'Valid_Word'), ('bu', 'Valid_Word'), ('hazineden', 'Valid_Word'), ('mahrum', 'Valid_Word'), ('etmek', 'Valid_Word'), ('isteyecek', 'Valid_Word'), (',', 'Punc'), ('dahilî', 'Valid_Word'), ('ve', 'Valid_Word'), ('haricî', 'Valid_Word'), ('bedhahların', 'Valid_Word'), ('olacaktır', 'Valid_Word'), ('.', 'Punc')]
[('Bir', 'Valid_Word'), ('gün', 'Valid_Word'), (',', 'Punc'), ('istiklâl', 'Valid_Word'), ('ve', 'Valid_Word'), ('cumhuriyeti', 'Valid_Word'), ('müdafaa', 'Valid_Word'), ('mecburiyetine', 'Valid_Word'), ('düşersen', 'Valid_Word'), (',', 'Punc'), ('vazifeye', 'Valid_Word'), ('atılmak', 'Valid_Word'), ('için', 'Valid_Word'), (',', 'Punc'), ('içinde', 'Valid_Word'), ('bulunacağın', 'Valid_Word'), ('vaziyetin', 'Valid_Word'), ('imkân', 'Valid_Word'), ('ve', 'Valid_Word'), ('şeraitini', 'Valid_Word'), ('düşünmeyeceksin', 'Valid_Word'), ('!', 'Punc')]
[('Bu', 'Valid_Word'), ('imkân', 'Valid_Word'), ('ve', 'Valid_Word'), ('şerait', 'Valid_Word'), (',', 'Punc'), ('çok', 'Valid_Word'), ('nâmüsait', 'Valid_Word'), ('bir', 'Valid_Word'), ('mahiyette', 'Valid_Word'), ('tezahür', 'Valid_Word'), ('edebilir', 'Valid_Word'), ('.', 'Punc')]
[('İstiklâl', 'Valid_Word'), ('ve', 'Valid_Word'), ('cumhuriyetine', 'Valid_Word'), ('kastedecek', 'Valid_Word'), ('düşmanlar', 'Valid_Word'), (',', 'Punc'), ('bütün', 'Valid_Word'), ('dünyada', 'Valid_Word'), ('emsali', 'Valid_Word'), ('görülmemiş', 'Valid_Word'), ('bir', 'Valid_Word'), ('galibiyetin', 'Valid_Word'), ('mümessili', 'Valid_Word'), ('olabilirler', 'Valid_Word'), ('.', 'Punc')]
[('Cebren', 'Valid_Word'), ('ve', 'Valid_Word'), ('hile', 'Valid_Word'), ('ile', 'Valid_Word'), ('aziz', 'Valid_Word'), ('vatanın', 'Valid_Word'), (',', 'Punc'), ('bütün', 'Valid_Word'), ('kaleleri', 'Valid_Word'), ('zaptedilmiş', 'OOV'), (',', 'Punc'), ('bütün', 'Valid_Word'), ('tersanelerine', 'Valid_Word'), ('girilmiş', 'Valid_Word'), (',', 'Punc'), ('bütün', 'Valid_Word'), ('orduları', 'Valid_Word'), ('dağıtılmış', 'Valid_Word'), ('ve', 'Valid_Word'), ('memleketin', 'Valid_Word'), ('her', 'Valid_Word'), ('köşesi', 'Valid_Word'), ('bilfiil', 'Valid_Word'), ('işgal', 'Valid_Word'), ('edilmiş', 'Valid_Word'), ('olabilir', 'Valid_Word'), ('.', 'Punc')]
[('Bütün', 'Valid_Word'), ('bu', 'Valid_Word'), ('şeraitten', 'Valid_Word'), ('daha', 'Valid_Word'), ('elîm', 'Valid_Word'), ('ve', 'Valid_Word'), ('daha', 'Valid_Word'), ('vahim', 'Valid_Word'), ('olmak', 'Valid_Word'), ('üzere', 'Valid_Word'), (',', 'Punc'), ('memleketin', 'Valid_Word'), ('dahilinde', 'Valid_Word'), (',', 'Punc'), ('iktidara', 'Valid_Word'), ('sahip', 'Valid_Word'), ('olanlar', 'Valid_Word'), ('gaflet', 'Valid_Word'), ('ve', 'Valid_Word'), ('dalâlet', 'Valid_Word'), ('ve', 'Valid_Word'), ('hattâ', 'Valid_Word'), ('hıyanet', 'Valid_Word'), ('içinde', 'Valid_Word'), ('bulunabilirler', 'Valid_Word'), ('.', 'Punc')]
[('Hatta', 'Valid_Word'), ('bu', 'Valid_Word'), ('iktidar', 'Valid_Word'), ('sahipleri', 'Valid_Word'), ('şahsî', 'Valid_Word'), ('menfaatlerini', 'Valid_Word'), (',', 'Punc'), ('müstevlilerin', 'Valid_Word'), ('siyasî', 'Valid_Word'), ('emelleriyle', 'Valid_Word'), ('tevhit', 'Valid_Word'), ('edebilirler', 'Valid_Word'), ('.', 'Punc')]
[('Millet', 'Valid_Word'), (',', 'Punc'), ('fakruzaruret', 'Valid_Word'), ('içinde', 'Valid_Word'), ('harap', 'Valid_Word'), ('ve', 'Valid_Word'), ('bîtap', 'Valid_Word'), ('düşmüş', 'Valid_Word'), ('olabilir', 'Valid_Word'), ('.', 'Punc'), ('Ey', 'Valid_Word'), ('Türk', 'Valid_Word'), ('istikbalinin', 'Valid_Word'), ('evladı', 'Valid_Word'), ('!', 'Punc'), ('İşte', 'Valid_Word'), (',', 'Punc'), ('bu', 'Valid_Word'), ('ahval', 'Valid_Word'), ('ve', 'Valid_Word'), ('şerait', 'Valid_Word'), ('içinde', 'Valid_Word'), ('dahi', 'Valid_Word'), (',', 'Punc'), ('vazifen', 'Valid_Word'), (';', 'Punc'), ('Türk', 'Valid_Word'), ('istiklâl', 'Valid_Word'), ('ve', 'Valid_Word'), ('cumhuriyetini', 'Valid_Word'), ('kurtarmaktır', 'Valid_Word'), ('!', 'Punc')]
[('Muhtaç', 'Valid_Word'), ('olduğun', 'Valid_Word'), ('kudret', 'Valid_Word'), (',', 'Punc'), ('damarlarındaki', 'Valid_Word'), ('asil', 'Valid_Word'), ('kanda', 'Valid_Word'), (',', 'Punc'), ('mevcuttur', 'Valid_Word'), ('!', 'Punc')]

CharFix

CharFix offers methods to correct corrupted Turkish text:

Fix Characters

from ts_tokenizer import CharFix

line = "ParÃ§a ve bÃ¼tÃ¼n iliåÿkisi her zaman iåÿlevsel deðildir."
print(CharFix.fix(line))  # Fixes corrupted characters

$ Parça ve bütün ilişkisi her zaman işlevsel değildir.

Lowercase

from ts_tokenizer import CharFix

line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.tr_lowercase(line))

$ istanbul ve ığdır ''arası'' 1528 km'dir.

Fix Quotes

from ts_tokenizer import CharFix

line = "İstanbul ve Iğdır ''arası'' 1528 km'dir."
print(CharFix.fix_quote(line))

$ İstanbul ve Iğdır "arası" 1528 km'dir.

TokenHandler

TokenHandler gets each given string and process it using methods defined under TokenPreProcess class. This process follows a strictly defined order and it is recursive. Each method could be called

from ts_tokenizer import TokenPreProcess

Below are the list of tags generated by tokenizer to process tokens:

#	Function	Sample	Used As Output	Tag
01	is_mention	@ts-tokenizer	Yes	Mention
02	is_hashtag	#ts-tokenizer	Yes	Hashtag
03	is_in_quotes	"ts-tokenizer"	No	-----
04	is_numbered_title	(1)	Yes	Numbered_Title
05	is_in_paranthesis	(bilgisayar)	No	-----
06	is_date_range	01.01.2024-01.01.2025	Yes	Date_Range
07	is_complex_punc	-yeniden,sonradan..	No	-----
08	is_date	22.02.2016	Yes	Date
09	is_hour	14.05	Yes	Hour
10	is_percentage_numbers	%75	Yes	Percentage_Numbers
11	is_percentage_numbers_chars	%75'lik	Yes	Percentage_Numbers
12	is_roman_number	XI	Yes	Roman_Number
13	is_bullet_list	•Giriş	Yes	Bullet_List
14	is_email	tanersezerr@gmail.com	Yes	Email
15	is_email_punc	tanersezerr@gmail.com.	No	-----
16	is_full_url	https://tscorpus.com	Yes	Full_URL
17	is_web_url	www.tscorpus.com	Yes	Web_URL
--	is_full_url	www.example.com'un	Yes	URL_Suffix
18	is_copyright	©tscorpus	Yes	Copyright
19	is_registered	tscorpus®	Yes	Registered
20	is_trademark	tscorpus™	Yes	Trademark
21	is_currency	100$	Yes	Currency
22	is_num_char_sequence	380A	No	-----
23	is_abbr	TBMM	Yes	Abbr
24	is_in_lexicon	bilgisayar	Yes	Valid_Word
25	is_in_exceptions	e-mail	Yes	Exception
26	is_in_eng_words	computer	Yes	English_Word
27	is_smiley	:)	Yes	Smiley
28	is_multiple_smiley	:):)	No	-----
29	is_emoticon	🍻	Yes	Emoticon
30	is_multiple_emoticon	🍻🍻	No	-----
31	is_multiple_smiley_in	hey:):)	No	-----
32	is_number	175.01	Yes	Number
33	is_apostrophed	Türkiye'nin	Yes	Apostrophed
34	is_single_punc	!	Yes	Punc
35	is_multi_punc	!!	No	-----
36	is_single_hyphenated	sabah-akşam	Yes	Single_Hyphenated
37	is_multi_hyphenated	çay-su-kahve	Yes	Multi-Hyphenated
38	is_single_underscored	Gel_Git	Yes	Single_Underscored
39	is_multi_underscored	Yarı_Yapılandırılmış_Mülakat	Yes	Multi_Underscored
40	is_one_char_fixable	bilgisa¬yar	Yes	One_Char_Fixed
42	is_three_or_more	heyyyyy	No	-----
43	is_fsp	bilgisayar.	No	-----
44	is_isp	.bilgisayar	No	-----
45	is_fmp	bilgisayar..	No	-----
46	is_imp	..bilgisayar	No	-----
47	is_msp	--bilgisayar--	No	-----
48	is_mssp	-bilgisayar-	No	-----
49	is_midsp	okul,öğrenci	No	-----
50	is_midmp	okul,öğrenci, öğretmen	No	-----
51	is_non_latin	한국드	No	Non_Latin

Performance

ts-tokenizer is optimized for efficient tokenization and takes advantage of multi-core processing for large-scale text. By default, the script utilizes all available CPU cores minus one, ensuring your system remains responsive while processing large datasets.

Performance Benchmarks:

The following benchmarks were conducted on different machines with the following specifications:

Processor	Cores	RAM	1 Million Tokens (Multi-Core)	Throughput (Multi-Core)	1 Million Tokens (Single-Core)	Throughput (Single-Core)
AMD Ryzen 7 5800H with Radeon Graphics (Laptop) 3.2 GHz / 4.4 Ghz	8 physical cores (16 threads)	16GB DDR4	~170 seconds	~5,800 tokens/second	~715 seconds	~1,400 tokens/second
AMD Ryzen 9 7950X3D with Radeon Graphics (Desktop) 4.2 Ghz / 5.7 Ghz	16 physical cores (32 threads)	96GB DDR5	~14 seconds	~71,500 tokens/second	~110 seconds	~9,090 tokens/second

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.23

May 13, 2026

0.1.22

Nov 27, 2025

0.1.21

Nov 27, 2025

0.1.20

Apr 28, 2025

0.1.19

Jan 30, 2025

0.1.18

Dec 25, 2024

0.1.17

Dec 3, 2024

0.1.16

Dec 3, 2024

0.1.15

Dec 2, 2024

0.1.14

Nov 29, 2024

0.1.13

Nov 28, 2024

0.1.12

Oct 18, 2024

0.1.11

Oct 17, 2024

0.1.10

Oct 17, 2024

0.1.9

Oct 17, 2024

0.1.8

Oct 17, 2024

0.1.7

Oct 17, 2024

0.1.6

Oct 17, 2024

0.1.5

Oct 17, 2024

0.1.4

Oct 17, 2024

0.1.3

Sep 4, 2024

0.1.2

Aug 31, 2024

0.1.1

Aug 31, 2024

0.1.0

May 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ts_tokenizer-0.1.23.tar.gz (10.1 MB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ts_tokenizer-0.1.23-py3-none-any.whl (10.0 MB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file ts_tokenizer-0.1.23.tar.gz.

File metadata

Download URL: ts_tokenizer-0.1.23.tar.gz
Upload date: May 13, 2026
Size: 10.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ts_tokenizer-0.1.23.tar.gz
Algorithm	Hash digest
SHA256	`79fd5be9deb22a357885ee649383664db389751f31a038d6a988495fef326dfc`
MD5	`8c675ae7360351f3df5873df5770c88b`
BLAKE2b-256	`c5ecd50e2c22820011e5d778effed9297893e64824c92d79cb698f075568b864`

See more details on using hashes here.

File details

Details for the file ts_tokenizer-0.1.23-py3-none-any.whl.

File metadata

Download URL: ts_tokenizer-0.1.23-py3-none-any.whl
Upload date: May 13, 2026
Size: 10.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ts_tokenizer-0.1.23-py3-none-any.whl
Algorithm	Hash digest
SHA256	`666e98e95cd45c6c7dc3a0ad8c1fe1c045537b5d9f55d1edd2cb784219fb82ce`
MD5	`cf445140c9dd180b2418cc953627777b`
BLAKE2b-256	`17eca416035a4d486cc8af98ad2f23cca87ee1ba116acbe83912e557a4683ea0`

See more details on using hashes here.

ts-tokenizer 0.1.23

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Key Features:

Installation

License

Command line tool

Tokenize from a File:

Help

CLI Arguments

Parallel Processing

Using CLI Arguments with pipelines

Classes

TSTokenizer

CharFix

Fix Characters

Lowercase

Fix Quotes

TokenHandler

Below are the list of tags generated by tokenizer to process tokens:

Performance

Performance Benchmarks:

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes