A Python library for Unicode text processing, including punctuation handling and CJK character detection
Project description
uni-text
A lightweight Python library for Unicode text processing, including punctuation detection, punctuation removal, and CJK character detection.
Features
- Simple API: Static methods for all text processing operations
- Punctuation handling: Detect and remove punctuation marks with customizable rules
- CJK support: Detect Chinese, Japanese, and Korean characters
- Unicode-based: Uses Unicode categories for accurate character classification
- Zero dependencies: Uses only Python standard library (
unicodedata) - Type safety: Full type hints for better IDE support
Installation
pip install uni-text
Quick Start
Check if a character is punctuation
from uni_text import UniText
# Check if a character is punctuation
is_punc = UniText.is_punctuation(",") # True
is_punc = UniText.is_punctuation("a") # False
Remove punctuation from text
from uni_text import UniText
# Remove punctuation (preserves apostrophes in contractions, %, and -)
text = "Hello, world! Don't worry."
cleaned = UniText.remove_punctuations(text)
# Result: "Hello world Don't worry"
Remove punctuation in Chinese context
from uni_text import UniText
# Remove punctuation (also preserves decimal points)
text = "价格是 99.5 元,很便宜。"
cleaned = UniText.remove_punctuations_in_zh(text)
# Result: "价格是 99.5 元很便宜"
Remove consecutive punctuation
from uni_text import UniText
# Remove consecutive punctuation marks, keeping only the first
text = "你好,,,世界"
cleaned = UniText.remove_consecutive_punctuations(text)
# Result: "你好,世界"
Check if a code point is CJK character
from uni_text import UniText
# Check if a Unicode code point is a CJK character
code_point = ord("中")
is_cjk = UniText.is_cjk_character(code_point) # True
code_point = ord("A")
is_cjk = UniText.is_cjk_character(code_point) # False
API Reference
UniText
Main utility class for Unicode text processing. All methods are static.
UniText.is_punctuation(char)
Check if a character is a punctuation mark.
Parameters:
char(str): Character to check (must be a single character).
Returns:
- bool:
Trueif the character is punctuation,Falseotherwise.
Example:
from uni_text import UniText
is_punc = UniText.is_punctuation(",") # True
is_punc = UniText.is_punctuation("a") # False
UniText.remove_punctuations(text)
Remove punctuation marks from text.
Uses Unicode category to remove punctuation, but preserves the following special cases:
- Apostrophes (
') in English contractions, e.g., "don't" - Percent signs (
%) - Hyphens/dashes (
-)
Parameters:
text(str): Original text (may contain punctuation).
Returns:
- str: Text with punctuation removed (special characters preserved).
Example:
from uni_text import UniText
text = "Hello, world! Don't worry - it's 100% safe."
cleaned = UniText.remove_punctuations(text)
# Result: "Hello world Don't worry - it's 100% safe"
UniText.remove_punctuations_in_zh(text)
Remove punctuation marks from text (Chinese context).
Similar to remove_punctuations(), but also preserves decimal points (.) in addition to the other special characters.
Parameters:
text(str): Original text (may contain punctuation).
Returns:
- str: Text with punctuation removed (special characters preserved).
Example:
from uni_text import UniText
text = "价格是 99.5 元,很便宜。"
cleaned = UniText.remove_punctuations_in_zh(text)
# Result: "价格是 99.5 元很便宜"
UniText.remove_consecutive_punctuations(text)
Remove consecutive punctuation marks, keeping only the first one.
When consecutive punctuation marks are encountered (regardless of whether they are the same), only the first one is kept, and all subsequent consecutive punctuation marks are removed.
Parameters:
text(str): Original text (may contain repeated punctuation).
Returns:
- str: Text with consecutive punctuation removed.
Examples:
from uni_text import UniText
# Same punctuation repeated
text = "你好,,,世界"
cleaned = UniText.remove_consecutive_punctuations(text)
# Result: "你好,世界"
# Different punctuation marks
text = "测试!?。结束"
cleaned = UniText.remove_consecutive_punctuations(text)
# Result: "测试!结束"
UniText.is_cjk_character(code_point)
Check if a Unicode code point is a CJK character.
CJK character ranges include:
- CJK Unified Ideographs: 0x4E00-0x9FFF
- CJK Extension A: 0x3400-0x4DBF
- CJK Compatibility Ideographs: 0xF900-0xFAFF
- Hiragana: 0x3040-0x309F
- Katakana: 0x30A0-0x30FF
- Hangul Syllables: 0xAC00-0xD7AF
Parameters:
code_point(int): Unicode code point (integer).
Returns:
- bool:
Trueif the code point is a CJK character,Falseotherwise.
Example:
from uni_text import UniText
# Chinese character
code_point = ord("中")
is_cjk = UniText.is_cjk_character(code_point) # True
# Japanese hiragana
code_point = ord("あ")
is_cjk = UniText.is_cjk_character(code_point) # True
# Korean character
code_point = ord("한")
is_cjk = UniText.is_cjk_character(code_point) # True
# Latin character
code_point = ord("A")
is_cjk = UniText.is_cjk_character(code_point) # False
Requirements
- Python >= 3.10
No external dependencies required. This package uses only Python standard library modules (unicodedata).
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uni_text-0.1.0.tar.gz.
File metadata
- Download URL: uni_text-0.1.0.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d9633a64f2ca9c7720ffcbea2f62fa5ff02001591248699d451d963a0b67a21
|
|
| MD5 |
1559889813f02c8493933223adc6f64c
|
|
| BLAKE2b-256 |
755a3dd8df8cb6982e0d7d952f17348b0b737ec6c2608243243a9a6c93ea4002
|
File details
Details for the file uni_text-0.1.0-py3-none-any.whl.
File metadata
- Download URL: uni_text-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52870803a5660a6eb242fc85886a95f01edbc79f3da293df9e6adea9da26567b
|
|
| MD5 |
73d8ebefcab5358e206eec2fadb2e867
|
|
| BLAKE2b-256 |
c437c1b570d22f6d31174223241680944252510430c78bef2c3e39c495a8d50d
|