A Python library for Unicode text processing, including punctuation handling and CJK character detection

Project description

uni-text

A lightweight Python library for Unicode text processing, including punctuation detection, punctuation removal, and CJK character detection.

Features

Simple API: Static methods for all text processing operations
Punctuation handling: Detect and remove punctuation marks with customizable rules
CJK support: Detect Chinese, Japanese, and Korean characters
Unicode-based: Uses Unicode categories for accurate character classification
Zero dependencies: Uses only Python standard library (unicodedata)
Type safety: Full type hints for better IDE support

Installation

pip install uni-text

Quick Start

Check if a character is punctuation

from uni_text import UniText

# Check if a character is punctuation
is_punc = UniText.is_punctuation(",")  # True
is_punc = UniText.is_punctuation("a")  # False

Remove punctuation from text

from uni_text import UniText

# Remove punctuation (preserves apostrophes in contractions, %, and -)
text = "Hello, world! Don't worry."
cleaned = UniText.remove_punctuations(text)
# Result: "Hello world Don't worry"

Remove punctuation in Chinese context

from uni_text import UniText

# Remove punctuation (also preserves decimal points)
text = "价格是 99.5 元，很便宜。"
cleaned = UniText.remove_punctuations_in_zh(text)
# Result: "价格是 99.5 元很便宜"

Remove consecutive punctuation

from uni_text import UniText

# Remove consecutive punctuation marks, keeping only the first
text = "你好，，，世界"
cleaned = UniText.remove_consecutive_punctuations(text)
# Result: "你好，世界"

Check if a code point is CJK character

from uni_text import UniText

# Check if a Unicode code point is a CJK character
code_point = ord("中")
is_cjk = UniText.is_cjk_character(code_point)  # True

code_point = ord("A")
is_cjk = UniText.is_cjk_character(code_point)  # False

API Reference

UniText

Main utility class for Unicode text processing. All methods are static.

`UniText.is_punctuation(char)`

Check if a character is a punctuation mark.

Parameters:

char (str): Character to check (must be a single character).

Returns:

bool: True if the character is punctuation, False otherwise.

Example:

from uni_text import UniText

is_punc = UniText.is_punctuation(",")  # True
is_punc = UniText.is_punctuation("a")  # False

`UniText.remove_punctuations(text)`

Remove punctuation marks from text.

Uses Unicode category to remove punctuation, but preserves the following special cases:

Apostrophes (') in English contractions, e.g., "don't"
Percent signs (%)
Hyphens/dashes (-)

Parameters:

text (str): Original text (may contain punctuation).

Returns:

str: Text with punctuation removed (special characters preserved).

Example:

from uni_text import UniText

text = "Hello, world! Don't worry - it's 100% safe."
cleaned = UniText.remove_punctuations(text)
# Result: "Hello world Don't worry - it's 100% safe"

`UniText.remove_punctuations_in_zh(text)`

Remove punctuation marks from text (Chinese context).

Similar to remove_punctuations(), but also preserves decimal points (.) in addition to the other special characters.

Parameters:

text (str): Original text (may contain punctuation).

Returns:

str: Text with punctuation removed (special characters preserved).

Example:

from uni_text import UniText

text = "价格是 99.5 元，很便宜。"
cleaned = UniText.remove_punctuations_in_zh(text)
# Result: "价格是 99.5 元很便宜"

`UniText.remove_consecutive_punctuations(text)`

Remove consecutive punctuation marks, keeping only the first one.

When consecutive punctuation marks are encountered (regardless of whether they are the same), only the first one is kept, and all subsequent consecutive punctuation marks are removed.

Parameters:

text (str): Original text (may contain repeated punctuation).

Returns:

str: Text with consecutive punctuation removed.

Examples:

from uni_text import UniText

# Same punctuation repeated
text = "你好，，，世界"
cleaned = UniText.remove_consecutive_punctuations(text)
# Result: "你好，世界"

# Different punctuation marks
text = "测试！？。结束"
cleaned = UniText.remove_consecutive_punctuations(text)
# Result: "测试！结束"

`UniText.is_cjk_character(code_point)`

Check if a Unicode code point is a CJK character.

CJK character ranges include:

CJK Unified Ideographs: 0x4E00-0x9FFF
CJK Extension A: 0x3400-0x4DBF
CJK Compatibility Ideographs: 0xF900-0xFAFF
Hiragana: 0x3040-0x309F
Katakana: 0x30A0-0x30FF
Hangul Syllables: 0xAC00-0xD7AF

Parameters:

code_point (int): Unicode code point (integer).

Returns:

bool: True if the code point is a CJK character, False otherwise.

Example:

from uni_text import UniText

# Chinese character
code_point = ord("中")
is_cjk = UniText.is_cjk_character(code_point)  # True

# Japanese hiragana
code_point = ord("あ")
is_cjk = UniText.is_cjk_character(code_point)  # True

# Korean character
code_point = ord("한")
is_cjk = UniText.is_cjk_character(code_point)  # True

# Latin character
code_point = ord("A")
is_cjk = UniText.is_cjk_character(code_point)  # False

Requirements

Python >= 3.10

No external dependencies required. This package uses only Python standard library modules (unicodedata).

License

MIT License

Project details

Release history Release notifications | RSS feed

1.1.2

Mar 8, 2026

1.1.1

Mar 5, 2026

1.1.0

Mar 5, 2026

1.0.0

Jan 11, 2026

0.3.0

Jan 3, 2026

0.2.1

Jan 3, 2026

0.2.0

Jan 3, 2026

This version

0.1.0

Jan 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uni_text-0.1.0.tar.gz (5.2 kB view details)

Uploaded Jan 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uni_text-0.1.0-py3-none-any.whl (5.8 kB view details)

Uploaded Jan 3, 2026 Python 3

File details

Details for the file uni_text-0.1.0.tar.gz.

File metadata

Download URL: uni_text-0.1.0.tar.gz
Upload date: Jan 3, 2026
Size: 5.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for uni_text-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0d9633a64f2ca9c7720ffcbea2f62fa5ff02001591248699d451d963a0b67a21`
MD5	`1559889813f02c8493933223adc6f64c`
BLAKE2b-256	`755a3dd8df8cb6982e0d7d952f17348b0b737ec6c2608243243a9a6c93ea4002`

See more details on using hashes here.

File details

Details for the file uni_text-0.1.0-py3-none-any.whl.

File metadata

Download URL: uni_text-0.1.0-py3-none-any.whl
Upload date: Jan 3, 2026
Size: 5.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for uni_text-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`52870803a5660a6eb242fc85886a95f01edbc79f3da293df9e6adea9da26567b`
MD5	`73d8ebefcab5358e206eec2fadb2e867`
BLAKE2b-256	`c437c1b570d22f6d31174223241680944252510430c78bef2c3e39c495a8d50d`

See more details on using hashes here.

uni-text 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

uni-text

Features

Installation

Quick Start

Check if a character is punctuation

Remove punctuation from text

Remove punctuation in Chinese context

Remove consecutive punctuation

Check if a code point is CJK character

API Reference

UniText

UniText.is_punctuation(char)

UniText.remove_punctuations(text)

UniText.remove_punctuations_in_zh(text)

UniText.remove_consecutive_punctuations(text)

UniText.is_cjk_character(code_point)

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`UniText.is_punctuation(char)`

`UniText.remove_punctuations(text)`

`UniText.remove_punctuations_in_zh(text)`

`UniText.remove_consecutive_punctuations(text)`

`UniText.is_cjk_character(code_point)`