Unified Tokenizer

These details have not been verified by PyPI

Project links

Homepage

Project description

UniTok v4

Unified preprocessing for heterogeneous ML tables: text, categorical, and numerical columns in one pipeline.

Python package: unitok
Current package version: 4.4.2 (from setup.py)
Legacy v3 docs: README_v3.md

Why UniTok

UniTok turns raw tabular data into model-ready numeric tables while preserving:

Consistent vocabularies across multiple datasets
Clear feature definitions (column -> tokenizer -> output feature)
Reproducible metadata and saved artifacts
Simple unions across datasets via shared keys

Core Ideas

UniTok: Orchestrates preprocessing lifecycle and holds processed data.
Feature: Binds a column to a tokenizer and output name.
Tokenizer: Encodes objects to ids (entity, split, digit, transformers).
Vocab: Global index for tokens; shared across datasets.
Meta: Stores schema, tokenizers, vocabularies, and feature definitions.
State: initialized -> tokenized -> organized.

Install

pip install unitok

Requirements: Python 3.7+, pandas, transformers, tqdm, rich.

Quickstart

import pandas as pd
from unitok import UniTok, Vocab
from unitok.tokenizer import BertTokenizer, TransformersTokenizer, EntityTokenizer, SplitTokenizer, DigitTokenizer

item = pd.read_csv(
    'news-sample.tsv', sep='\t',
    names=['nid', 'category', 'subcategory', 'title', 'abstract'],
    usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],
)
item['abstract'] = item['abstract'].fillna('')

user = pd.read_csv(
    'user-sample.tsv', sep='\t',
    names=['uid', 'history'],
)

interaction = pd.read_csv(
    'interaction-sample.tsv', sep='\t',
    names=['uid', 'nid', 'click'],
)

item_vocab = Vocab(name='nid')
user_vocab = Vocab(name='uid')

with UniTok() as item_ut:
    bert = BertTokenizer(vocab='bert')
    llama = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')

    item_ut.add_feature(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid', key=True)
    item_ut.add_feature(tokenizer=bert, column='title', name='title@bert', truncate=20)
    item_ut.add_feature(tokenizer=llama, column='title', name='title@llama', truncate=20)
    item_ut.add_feature(tokenizer=bert, column='abstract', name='abstract@bert', truncate=50)
    item_ut.add_feature(tokenizer=llama, column='abstract', name='abstract@llama', truncate=50)
    item_ut.add_feature(tokenizer=EntityTokenizer(vocab='category'), column='category')
    item_ut.add_feature(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')

with UniTok() as user_ut:
    user_ut.add_feature(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid', key=True)
    user_ut.add_feature(tokenizer=SplitTokenizer(vocab=item_vocab, sep=','), column='history', truncate=30)

with UniTok() as inter_ut:
    inter_ut.add_index_feature(name='index')
    inter_ut.add_feature(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid')
    inter_ut.add_feature(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid')
    inter_ut.add_feature(tokenizer=DigitTokenizer(vocab='click', vocab_size=2), column='click')

item_ut.tokenize(item).save('sample-ut/item')
item_vocab.deny_edit()
user_ut.tokenize(user).save('sample-ut/user')
inter_ut.tokenize(interaction).save('sample-ut/interaction')

Loading Saved Data

from unitok import UniTok

ut = UniTok.load('sample-ut/item')
print(len(ut))
print(ut[0])

Combining Datasets (Union)

with inter_ut:
    inter_ut.union(user_ut)
    print(inter_ut[0])

Soft union (default): links tables and resolves on access
Hard union: materializes merged columns

CLI

Summarize a saved table:

unitok path/to/data

Add a feature into an existing table (integrate):

unitok integrate path/to/data --file data.tsv --column title --name title@bert \
  --vocab bert --tokenizer transformers --t.key bert-base-uncased

Remove a feature from a saved table:

unitok remove path/to/data --name title@bert

Data Artifacts

Saved directories include:

meta.json with schema, tokenizers, vocabularies
data.pkl with tokenized columns
*.vocab pickled vocabularies

Migration From v3

If you have v3 artifacts:

unidep-upgrade-v4 <path>

Notes and Constraints

Key feature must be atomic (tokenizer returns a single id, not a list).
Shared vocabularies must match for unions.
truncate=None means an atomic feature; list features must use a truncate.
Feature supersedes the deprecated Job class.

Repository Layout (High-Level)

unitok/ core library
UniTokv3/ legacy v3 code
dist/ built distributions
setup.py, requirements.txt

License

MIT License. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

4.4.4

Feb 4, 2026

4.4.3

Jul 1, 2025

4.4.2

Jun 25, 2025

4.4.1

Jun 25, 2025

4.4.0

Jun 25, 2025

4.3.9

Jun 14, 2025

4.3.8

Jun 1, 2025

4.3.7

Mar 27, 2025

4.3.6

Jan 30, 2025

4.3.5

Jan 30, 2025

4.3.4

Jan 30, 2025

4.3.3

Jan 24, 2025

4.3.2

Jan 18, 2025

4.3.1

Jan 12, 2025

4.3.0

Jan 12, 2025

4.2.5

Jan 7, 2025

4.0.3

Dec 26, 2024

4.0.0

Dec 23, 2024

3.5.3

Nov 24, 2024

3.5.2

Mar 25, 2024

3.5.1

Jan 15, 2024

3.5.0

Nov 19, 2023

3.4.9

Nov 19, 2023

3.4.8

Nov 4, 2023

3.4.5

Oct 17, 2023

3.4.0

Sep 14, 2023

3.3.8

Sep 9, 2023

3.2.2

Sep 4, 2023

3.1.9

Jul 10, 2023

3.1.7

Apr 21, 2023

3.1.6

Apr 21, 2023

3.1.5

Apr 21, 2023

3.1.4

Apr 20, 2023

3.1.3

Apr 20, 2023

3.1.2

Apr 20, 2023

3.1.1

Apr 18, 2023

3.1.0

Apr 18, 2023

3.0.13

Apr 17, 2023

3.0.12

Mar 28, 2023

3.0.11

Mar 27, 2023

2.4.3.2

Jan 12, 2023

2.4.2.4

Jan 11, 2023

2.4.2.3

Jan 10, 2023

2.4.2.1

Jan 8, 2023

2.4.1a0 pre-release

Oct 25, 2022

2.3.1.2

Jul 14, 2022

2.3.1.1

Jun 20, 2022

2.3.1.0

Jun 15, 2022

2.3.0.3

May 26, 2022

2.3.0.2

May 7, 2022

2.2.2.5

Jan 23, 2022

2.2.2.4

Jan 17, 2022

2.2.2.3

Jan 17, 2022

2.2.2.2

Dec 11, 2021

2.2.2.0

Dec 9, 2021

2.2.1.4

Nov 30, 2021

2.2.1.3

Nov 25, 2021

2.2.1.2

Nov 25, 2021

2.2.1.1

Nov 25, 2021

2.2.0.3

Nov 18, 2021

2.2.0.2

Nov 18, 2021

2.2.0.1

Nov 16, 2021

2.2.0

Nov 16, 2021

2.1.9.4

Oct 22, 2021

2.1.9.3

Oct 1, 2021

2.1.9.2

Oct 1, 2021

2.1.9.1

Sep 29, 2021

2.1.9

Sep 29, 2021

2.1.8

Sep 29, 2021

2.1.7

Sep 28, 2021

2.1.3

Sep 28, 2021

0.0.4

Sep 12, 2021

0.0.3

Sep 12, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unitok-4.4.4.tar.gz (38.3 kB view details)

Uploaded Feb 4, 2026 Source

File details

Details for the file unitok-4.4.4.tar.gz.

File metadata

Download URL: unitok-4.4.4.tar.gz
Upload date: Feb 4, 2026
Size: 38.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for unitok-4.4.4.tar.gz
Algorithm	Hash digest
SHA256	`83dcbcac99959f7aec3d391c1a0b0d18f83adb213c5599cbc995ba1053cf1403`
MD5	`a5f7a273021e9085f15172e13ade6a47`
BLAKE2b-256	`6aef3d962ae5395f6405e70eed4bb6c4a9c9b982b0e5b4ebca981879f14ef399`

See more details on using hashes here.

unitok 4.4.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

UniTok v4

Why UniTok

Core Ideas

Install

Quickstart

Loading Saved Data

Combining Datasets (Union)

CLI

Data Artifacts

Migration From v3

Notes and Constraints

Repository Layout (High-Level)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes