Skip to main content

Tool to analyze unicode content

Project description

Unicode Stats

Fast analysis of Unicode symbols

Purpose: quickly detect hallucinations such as hieroglyphs and Ukrainian language in various benchmarks

Fast and convenient wrapper for https://www.unicode.org/Public/UNIDATA/Blocks.txt

Usage

Extract Unicode Blocks / Detect Language

from unicode_stats import unicode_block_parser

# Get all symbols
example_text = 'краї́中land'
print(unicode_block_parser.get_stats(example_text))
# > {'Cyrillic (Russian)': {'n': 3, 'symbols': 'арк'}, 'Cyrillic (Ukranian)': {'n': 1, 'symbols': 'ї'}, 'Combining Diacritical Marks': {'n': 1, 'symbols': '́'}, 'CJK Unified Ideographs': {'n': 1, 'symbols': '中'}, 'Basic Latin': {'n': 4, 'symbols': 'lnda'}}

# Get main language
example_text = 'краї́'
print(unicode_block_parser.get_lang(example_text))
# > Cyrillic (Ukranian)

# Get all languages
example_text = 'краї́'
print(unicode_block_parser.get_lang(example_text, return_main_lang=False))
# > ru,Cyrillic (Ukranian),Combining Diacritical Marks

unicode_block_parser.get_single_block("х")
# > Cyrillic (Russian)

Generate Statistics for JSONL Files

Python

from unicode_stats.aggregation import AggregatedUnicodeBlockParser
agregated_parser = AggregatedUnicodeBlockParser(columns = "qwen", max_lines=1)
agregated_parser.get_stats("3model_cp.jsonl")
block column n rate symbols n_symbols rows example_first example_last
0 Cyrilic (Russian) qwen 2161 0.777 оеаи 57234 [0, 1 Чтобы Для р
1 Basic qwen 2778 1 оеаи 57234 [0, 1 Чтобы Для р

Bash

unicode_stats 3model_cp.jsonl --columns="qwen"

Aggregated statistics saved to 3model_cp.csv

Installation

pip install dist/unicode_stats-{version}-py3-none-any.whl

Building from Source

python -m build

If tests fail, the package will not build

Running Tests

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicode_stats-0.3.3.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unicode_stats-0.3.3-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file unicode_stats-0.3.3.tar.gz.

File metadata

  • Download URL: unicode_stats-0.3.3.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for unicode_stats-0.3.3.tar.gz
Algorithm Hash digest
SHA256 2f17f0cf4a944f639807f948733fd34995d0f2253983310f682af9371f855b65
MD5 161d32ac55b0f4d94d1dccea082b0dcb
BLAKE2b-256 24a46c6b95a2be0b51f85378a6e8211850684a7370a93c875d5d68a490a728e2

See more details on using hashes here.

File details

Details for the file unicode_stats-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: unicode_stats-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for unicode_stats-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d4abbe324b401b1a62ef1a807ae4e159570d96b3888424d45bc94de4cf3af7f7
MD5 434d635810cf4979603c44290c2ae064
BLAKE2b-256 15dfb5a41ba8c0d59af589cf2c376db10311d823fe8b02c975ff72298ec6303a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page