Tool to analyze unicode content
Project description
Unicode Stats
Fast analysis of Unicode symbols
Purpose: quickly detect hallucinations such as hieroglyphs and Ukrainian language in various benchmarks
Fast and convenient wrapper for https://www.unicode.org/Public/UNIDATA/Blocks.txt
Usage
Extract Unicode Blocks / Detect Language
from unicode_stats import unicode_block_parser
# Get all symbols
example_text = 'краї́中land'
print(unicode_block_parser.get_stats(example_text))
# > {'Cyrillic (Russian)': {'n': 3, 'symbols': 'арк'}, 'Cyrillic (Ukranian)': {'n': 1, 'symbols': 'ї'}, 'Combining Diacritical Marks': {'n': 1, 'symbols': '́'}, 'CJK Unified Ideographs': {'n': 1, 'symbols': '中'}, 'Basic Latin': {'n': 4, 'symbols': 'lnda'}}
# Get main language
example_text = 'краї́'
print(unicode_block_parser.get_lang(example_text))
# > Cyrillic (Ukranian)
# Get all languages
example_text = 'краї́'
print(unicode_block_parser.get_lang(example_text, return_main_lang=False))
# > ru,Cyrillic (Ukranian),Combining Diacritical Marks
unicode_block_parser.get_single_block("х")
# > Cyrillic (Russian)
Generate Statistics for JSONL Files
Python
from unicode_stats.aggregation import AggregatedUnicodeBlockParser
agregated_parser = AggregatedUnicodeBlockParser(columns = "qwen", max_lines=1)
agregated_parser.get_stats("3model_cp.jsonl")
| block | column | n | rate | symbols | n_symbols | rows | example_first | example_last | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Cyrilic (Russian) | qwen | 2161 | 0.777 | оеаи | 57234 | [0, 1 | Чтобы | Для р |
| 1 | Basic | qwen | 2778 | 1 | оеаи | 57234 | [0, 1 | Чтобы | Для р |
Bash
unicode_stats 3model_cp.jsonl --columns="qwen"
Aggregated statistics saved to 3model_cp.csv
Installation
pip install dist/unicode_stats-{version}-py3-none-any.whl
Building from Source
python -m build
If tests fail, the package will not build
Running Tests
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unicode_stats-0.3.0.tar.gz.
File metadata
- Download URL: unicode_stats-0.3.0.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ef87d5a2f44f010711e0c38b85fd17d453f4c6e6463080de895201eab291ed3
|
|
| MD5 |
ecb7fcecf039bb746c534ea44420d605
|
|
| BLAKE2b-256 |
4f038490a9eae6a95c7ff79ac12fe98f54a6c5b60e9dbf29c1a759c87600d8d8
|
File details
Details for the file unicode_stats-0.3.0-py3-none-any.whl.
File metadata
- Download URL: unicode_stats-0.3.0-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba14f084c9fed69388a28909604da974cfaf5754cf2ece76b9affd4286c9cb03
|
|
| MD5 |
aa686499ae17e63be81b8f049e057a22
|
|
| BLAKE2b-256 |
b02bc805d640a4fe0843dfd0b1cacab5666c42a0e75b1d50ee9a83c16f841b97
|