Unified Tokenizer

These details have not been verified by PyPI

Project links

Homepage

Project description

Unified Tokenizer

Introduction

When dealing with textual information used in some models (e.g. Bert), the first step is the tokenization. Transformers provides BertTokenizer to split words (multi-lingual) with WordPiece algorithm. However, in some cases, some texts may be entities which is not required to split, although this entity contains many sub-words; some are the arrays of entities, joined by some characters (e.g. |, ,).

Different cases require different tokenizer. So it comes Unified Tokenizer (or UniTok). You can either customize your tokenizer, or use our pre-defined tokenizers.

Installation

pip install UniTok

Usage

Here we use the first 10 lines of the training set of MINDlarge as an example (see news-sample.tsv file). Assume the data path is /home/ubuntu/news-sample.tsv.

Data Declaration (more info see MIND GitHub)

Each line in the file is the information about one piece of news.

It has 7 columns, which are divided by the tab symbol:

News ID
Category
SubCategory
Title
Abstract
URL
Title Entities (entities contained in the title of this news)
Abstract Entities (entites contained in the abstract of this news)

We only use its first 5 columns for demonstration.

Imports

import pandas as pd


from UniTok import UniTok, Column
from UniTok.tok import IdTok, EntTok, BertTok, SingT, ListT

Read data

df = pd.read_csv(
    filepath_or_buffer='/home/ubuntu/news-sample.tsv',
    sep='\t',
    names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
    usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
)

Initialize Tokenizers

id_tok = IdTok(name='news')  # for news id
cat_tok = EntTok(name='cat')  # for category & subcategory
txt_tok = BertTok(name='english', vocab_dir='bert-base-uncased')  # for title & abstract
cat_tok.vocab.reserve(100)  # first 100 tokens are reserved for some special usage in the downstream model, if any, and please be reminded that the first token is always PAD

Construct Unified Tokenizer

SingleTokenizer means it only omits one token id, while ListTokenizer generates a sequence of ids.

ut = UniTok()
ut.add_col(Column(
    name='nid',
    tokenizer=id_tok.as_sing(),
)).add_col(Column(
    name='cat',
    tokenizer=cat_tok.as_sing()
)).add_col(Column(
    name='subCat',
    tokenizer=cat_tok.as_sing(),
)).add_col(Column(
    name='title',
    tokenizer=txt_tok.as_list(),
)).add_col(Column(
    name='abs',
    tokenizer=txt_tok.as_list(),
)).read_file(df)

Here we leave the max_length of the output of the ListTokenizer behind.

Analyse Data

ut.analyse()

It shows the distribution of the length of each column (if using ListTokenizer). It will help us determine the max_length of the tokens for each column.

[ COLUMNS ]
[ COL: nid ]
[NOT ListTokenizer]

[ COL: cat ]
[NOT ListTokenizer]

[ COL: subCat ]
[NOT ListTokenizer]

[ COL: title ]
[ MIN: 6 ]
[ MAX: 16 ]
[ AVG: 12 ]
[ X-INT: 1 ]
[ Y-INT: 0 ]
       |   
       |   
       |   
       |   
       || |
       || |
       || |
| |  | || |
| |  | || |
| |  | || |
-----------

[ COL: abs ]
100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
[ MIN: 0 ]
[ MAX: 46 ]
[ AVG: 21 ]
[ X-INT: 1 ]
[ Y-INT: 0 ]
|                                              
|                                              
|                                              
|                                              
|                                              
|               | | ||    ||               |  |
|               | | ||    ||               |  |
|               | | ||    ||               |  |
|               | | ||    ||               |  |
|               | | ||    ||               |  |
-----------------------------------------------

[ VOCABS ]
[ VOC: news with  10 tokens ]
[ COL: nid ]

[ VOC: cat with  112 tokens ]
[ COL: cat, subCat ]

[ VOC: english with  30522 tokens ]
[ COL: title, abs ]

ReConstruct Unified Tokenizer

ut = UniTok()
ut.add_col(Column(
    name='nid',
    tokenizer=id_tok.as_sing(),
)).add_col(Column(
    name='cat',
    tokenizer=cat_tok.as_sing()
)).add_col(Column(
    name='subCat',
    tokenizer=cat_tok.as_sing(),
)).add_col(Column(
    name='title',
    tokenizer=txt_tok.as_list(max_length=10),
)).add_col(Column(
    name='abs',
    tokenizer=txt_tok.as_list(max_length=30),
)).read_file(df)

In this step, we set max_length of each column. If max_length is not set, we will keep the whole sequence and not truncate it.

Tokenize and Store

ut.tokenize()
ut.store_data('TokenizedData')

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

4.4.4

Feb 4, 2026

4.4.3

Jul 1, 2025

4.4.2

Jun 25, 2025

4.4.1

Jun 25, 2025

4.4.0

Jun 25, 2025

4.3.9

Jun 14, 2025

4.3.8

Jun 1, 2025

4.3.7

Mar 27, 2025

4.3.6

Jan 30, 2025

4.3.5

Jan 30, 2025

4.3.4

Jan 30, 2025

4.3.3

Jan 24, 2025

4.3.2

Jan 18, 2025

4.3.1

Jan 12, 2025

4.3.0

Jan 12, 2025

4.2.5

Jan 7, 2025

4.0.3

Dec 26, 2024

4.0.0

Dec 23, 2024

3.5.3

Nov 24, 2024

3.5.2

Mar 25, 2024

3.5.1

Jan 15, 2024

3.5.0

Nov 19, 2023

3.4.9

Nov 19, 2023

3.4.8

Nov 4, 2023

3.4.5

Oct 17, 2023

3.4.0

Sep 14, 2023

3.3.8

Sep 9, 2023

3.2.2

Sep 4, 2023

3.1.9

Jul 10, 2023

3.1.7

Apr 21, 2023

3.1.6

Apr 21, 2023

3.1.5

Apr 21, 2023

3.1.4

Apr 20, 2023

3.1.3

Apr 20, 2023

3.1.2

Apr 20, 2023

3.1.1

Apr 18, 2023

3.1.0

Apr 18, 2023

3.0.13

Apr 17, 2023

3.0.12

Mar 28, 2023

3.0.11

Mar 27, 2023

2.4.3.2

Jan 12, 2023

2.4.2.4

Jan 11, 2023

2.4.2.3

Jan 10, 2023

2.4.2.1

Jan 8, 2023

2.4.1a0 pre-release

Oct 25, 2022

2.3.1.2

Jul 14, 2022

2.3.1.1

Jun 20, 2022

2.3.1.0

Jun 15, 2022

2.3.0.3

May 26, 2022

2.3.0.2

May 7, 2022

2.2.2.5

Jan 23, 2022

2.2.2.4

Jan 17, 2022

2.2.2.3

Jan 17, 2022

2.2.2.2

Dec 11, 2021

This version

2.2.2.0

Dec 9, 2021

2.2.1.4

Nov 30, 2021

2.2.1.3

Nov 25, 2021

2.2.1.2

Nov 25, 2021

2.2.1.1

Nov 25, 2021

2.2.0.3

Nov 18, 2021

2.2.0.2

Nov 18, 2021

2.2.0.1

Nov 16, 2021

2.2.0

Nov 16, 2021

2.1.9.4

Oct 22, 2021

2.1.9.3

Oct 1, 2021

2.1.9.2

Oct 1, 2021

2.1.9.1

Sep 29, 2021

2.1.9

Sep 29, 2021

2.1.8

Sep 29, 2021

2.1.7

Sep 28, 2021

2.1.3

Sep 28, 2021

0.0.4

Sep 12, 2021

0.0.3

Sep 12, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

UniTok-2.2.2.0.tar.gz (11.6 kB view details)

Uploaded Dec 9, 2021 Source

File details

Details for the file UniTok-2.2.2.0.tar.gz.

File metadata

Download URL: UniTok-2.2.2.0.tar.gz
Upload date: Dec 9, 2021
Size: 11.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.6.2

File hashes

Hashes for UniTok-2.2.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9798bc4e3f1c17d54d84081e6b2f0eb82906ed05f7471d37ee02fb1c44746591`
MD5	`f710da7ec7d10a50cd3e8afe190fd2c4`
BLAKE2b-256	`584d6367297f6007c6d9448254860e71396539ab47f919bd68799a414ff059f0`

See more details on using hashes here.

unitok 2.2.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Unified Tokenizer

Introduction

Installation

Usage

Data Declaration (more info see MIND GitHub)

Imports

Read data

Initialize Tokenizers

Construct Unified Tokenizer

Analyse Data

ReConstruct Unified Tokenizer

Tokenize and Store

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes