Skip to main content

Tool to analyze unicode content

Project description

Unicode Stats

Fast analysis of Unicode symbols

Purpose: quickly detect hallucinations such as hieroglyphs and Ukrainian language in various benchmarks

Fast and convenient wrapper for https://www.unicode.org/Public/UNIDATA/Blocks.txt

Usage

Extract Unicode Blocks / Detect Language

from unicode_stats import unicode_block_parser

# Get all symbols
example_text = 'краї́中land'
print(unicode_block_parser.get_stats(example_text))
# > {'Cyrillic (Russian)': {'n': 3, 'symbols': 'арк'}, 'Cyrillic (Ukranian)': {'n': 1, 'symbols': 'ї'}, 'Combining Diacritical Marks': {'n': 1, 'symbols': '́'}, 'CJK Unified Ideographs': {'n': 1, 'symbols': '中'}, 'Basic Latin': {'n': 4, 'symbols': 'lnda'}}

# Get main language
example_text = 'краї́'
print(unicode_block_parser.get_lang(example_text))
# > Cyrillic (Ukranian)

# Get all languages
example_text = 'краї́'
print(unicode_block_parser.get_lang(example_text, return_main_lang=False))
# > ru,Cyrillic (Ukranian),Combining Diacritical Marks

unicode_block_parser.get_single_block("х")
# > Cyrillic (Russian)

Generate Statistics for JSONL Files

Python

from unicode_stats.aggregation import AggregatedUnicodeBlockParser
agregated_parser = AggregatedUnicodeBlockParser(columns = "qwen", max_lines=1)
agregated_parser.get_stats("3model_cp.jsonl")
block column n rate symbols n_symbols rows example_first example_last
0 Cyrilic (Russian) qwen 2161 0.777 оеаи 57234 [0, 1 Чтобы Для р
1 Basic qwen 2778 1 оеаи 57234 [0, 1 Чтобы Для р

Bash

unicode_stats 3model_cp.jsonl --columns="qwen"

Aggregated statistics saved to 3model_cp.csv

Installation

pip install dist/unicode_stats-{version}-py3-none-any.whl

Building from Source

python -m build

If tests fail, the package will not build

Running Tests

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicode_stats-0.3.1.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unicode_stats-0.3.1-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file unicode_stats-0.3.1.tar.gz.

File metadata

  • Download URL: unicode_stats-0.3.1.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for unicode_stats-0.3.1.tar.gz
Algorithm Hash digest
SHA256 080d2cc0bcd59fc07fa03200db71b3a72b6f1c7b4f87657a97da854b9a5532d2
MD5 a4b12956811f0b70e3a010e1578d9999
BLAKE2b-256 2a94382e52af7641b526d7d4650ecbf7e1abe9719ef25e4a519dd67b980fa744

See more details on using hashes here.

File details

Details for the file unicode_stats-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: unicode_stats-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for unicode_stats-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 549d62e7fe8b1374a243d106d7638b80a8e94f8813cf80aeabeda9c1495ae96a
MD5 a5682bd0f9dabb9de0c5b0e1dc6ebf7b
BLAKE2b-256 b319360c6b39ec10781bc0de334eef976119c4d8db44f912320d6fcc8a1c90f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page