Unified Tokenizer

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Unified Tokenizer

Introduction

Unified Tokenizer, shortly UniTok, offers pre-defined various tokenizers in dealing with textual data. It is a central data processing tool that allows algorithm engineers to focus more on the algorithm itself instead of tedious data preprocessing.

It incorporates the BERT tokenizer from the transformers library, while it supports custom via the general word segmentation module (i.e., the BaseTok class).

Installation

pip install UnifiedTokenizer

Usage

We use the head of the training set of the MINDlarge dataset as an example (see news-sample.tsv file).

Data Declaration (more info see MIND GitHub)

Each line in the file is a piece of news, including 7 features, which are divided by the tab (\t) symbol:

News ID
Category
SubCategory
Title
Abstract
URL
Title Entities (entities contained in the title of this news)
Abstract Entities (entities contained in the abstract of this news)

We only use its first 5 columns for demonstration.

Pre-defined Tokenizers

Tokenizer	Description	Parameters
BertTok	Provided by the ``transformers` library, using the WordPiece strategy	`vocab_dir`
EntTok	The column data is regarded as an entire token	/
IdTok	A specific version of EntTok, required to be identical	/
SplitTok	Tokens are joined by separators like tab, space	`sep`

Imports

import pandas as pd


from UniTok import UniTok, Column
from UniTok.tok import IdTok, EntTok, BertTok

Read data

df = pd.read_csv(
    filepath_or_buffer='path/news-sample.tsv',
    sep='\t',
    names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
    usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
)

Construct UniTok

from UniTok import UniTok, Column
from UniTok.tok import EntTok, BertTok

cat_tok = EntTok(name='cat')  # one tokenizer for both cat and subCat
text_tok = BertTok(name='english', vocab_dir='bert-base-uncased')  # specify the bert vocab

unitok = UniTok().add_index_col(
    name='nid'
).add_col(Column(
    name='cat',
    tokenizer=cat_tok.as_sing()
)).add_col(Column(
    name='subCat',
    tokenizer=cat_tok.as_sing(),
)).add_col(Column(
    name='title',
    tokenizer=text_tok.as_list(),
)).add_col(Column(
    name='abs',
    tokenizer=text_tok.as_list(),
)).read_file(df)

Analyse Data

unitok.analyse()

It shows the distribution of the length of each column (if using ListTokenizer). It will help us determine the max_length of the tokens for each column.

[ COLUMNS ]
[ COL: nid ]
[NOT ListTokenizer]

[ COL: cat ]
[NOT ListTokenizer]

[ COL: subCat ]
[NOT ListTokenizer]

[ COL: title ]
[ MIN: 6 ]
[ MAX: 16 ]
[ AVG: 12 ]
[ X-INT: 1 ]
[ Y-INT: 0 ]
       |   
       |   
       |   
       |   
       || |
       || |
       || |
| |  | || |
| |  | || |
| |  | || |
-----------

[ COL: abs ]
100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
[ MIN: 0 ]
[ MAX: 46 ]
[ AVG: 21 ]
[ X-INT: 1 ]
[ Y-INT: 0 ]
|                                              
|                                              
|                                              
|                                              
|                                              
|               | | ||    ||               |  |
|               | | ||    ||               |  |
|               | | ||    ||               |  |
|               | | ||    ||               |  |
|               | | ||    ||               |  |
-----------------------------------------------

[ VOCABS ]
[ VOC: news with  10 tokens ]
[ COL: nid ]

[ VOC: cat with  112 tokens ]
[ COL: cat, subCat ]

[ VOC: english with  30522 tokens ]
[ COL: title, abs ]

ReConstruct Unified Tokenizer

unitok = UniTok().add_index_col(
    name='nid'
).add_col(Column(
    name='cat',
    tokenizer=cat_tok.as_sing()
)).add_col(Column(
    name='subCat',
    tokenizer=cat_tok.as_sing(),
)).add_col(Column(
    name='title',
    tokenizer=text_tok.as_list(max_length=10),
)).add_col(Column(
    name='abs',
    tokenizer=text_tok.as_list(max_length=30),
)).read_file(df)

In this step, we set max_length of each column. If max_length is not set, we will keep the whole sequence and not truncate it.

Tokenize and Store

unitok.tokenize()
unitok.store_data('TokenizedData')

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

3.5.2

Mar 25, 2024

3.5.1

Jan 15, 2024

3.5.0

Nov 19, 2023

3.4.9

Nov 19, 2023

3.4.8

Nov 4, 2023

3.4.5

Oct 17, 2023

3.4.0

Sep 14, 2023

3.3.8

Sep 9, 2023

3.2.2

Sep 4, 2023

3.1.9

Jul 10, 2023

3.1.7

Apr 21, 2023

3.1.6

Apr 21, 2023

3.1.5

Apr 21, 2023

3.1.4

Apr 20, 2023

3.1.3

Apr 20, 2023

3.1.2

Apr 20, 2023

3.1.1

Apr 18, 2023

3.1.0

Apr 18, 2023

3.0.13

Apr 17, 2023

3.0.12

Mar 28, 2023

3.0.11

Mar 27, 2023

This version

2.4.3.2

Jan 12, 2023

2.4.2.4

Jan 11, 2023

2.4.2.3

Jan 10, 2023

2.4.2.1

Jan 8, 2023

2.4.1a0 pre-release

Oct 25, 2022

2.3.1.2

Jul 14, 2022

2.3.1.1

Jun 20, 2022

2.3.1.0

Jun 15, 2022

2.3.0.3

May 26, 2022

2.3.0.2

May 7, 2022

2.2.2.5

Jan 23, 2022

2.2.2.4

Jan 17, 2022

2.2.2.3

Jan 17, 2022

2.2.2.2

Dec 11, 2021

2.2.2.0

Dec 9, 2021

2.2.1.4

Nov 30, 2021

2.2.1.3

Nov 25, 2021

2.2.1.2

Nov 25, 2021

2.2.1.1

Nov 25, 2021

2.2.0.3

Nov 18, 2021

2.2.0.2

Nov 18, 2021

2.2.0.1

Nov 16, 2021

2.2.0

Nov 16, 2021

2.1.9.4

Oct 22, 2021

2.1.9.3

Oct 1, 2021

2.1.9.2

Oct 1, 2021

2.1.9.1

Sep 29, 2021

2.1.9

Sep 29, 2021

2.1.8

Sep 29, 2021

2.1.7

Sep 28, 2021

2.1.3

Sep 28, 2021

0.0.4

Sep 12, 2021

0.0.3

Sep 12, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

UniTok-2.4.3.2.tar.gz (11.7 kB view hashes)

Uploaded Jan 12, 2023 Source

Hashes for UniTok-2.4.3.2.tar.gz

Hashes for UniTok-2.4.3.2.tar.gz
Algorithm	Hash digest
SHA256	`93082bb262951ce6de7267698b0fdaa0dda45bf8c958695dcd2a435a86dc5b80`
MD5	`e5637401bab1aac55300671c83d7e058`
BLAKE2b-256	`cdd5d2b391e021177f62abc2affc1e91662a3272d07a60ad0c4246a0f0235b30`