Skip to main content

Tool to analyze unicode content

Project description

Unicode Stats

Fast analysis of Unicode symbols

Purpose: quickly detect hallucinations such as hieroglyphs and Ukrainian language in various benchmarks

Fast and convenient wrapper for https://www.unicode.org/Public/UNIDATA/Blocks.txt

Usage

Extract Unicode Blocks / Detect Language

from unicode_stats import unicode_block_parser

# Get all symbols
example_text = 'краї́中land'
print(unicode_block_parser.get_stats(example_text))
# > {'Cyrillic (Russian)': {'n': 3, 'symbols': 'арк'}, 'Cyrillic (Ukranian)': {'n': 1, 'symbols': 'ї'}, 'Combining Diacritical Marks': {'n': 1, 'symbols': '́'}, 'CJK Unified Ideographs': {'n': 1, 'symbols': '中'}, 'Basic Latin': {'n': 4, 'symbols': 'lnda'}}

# Get main language
example_text = 'краї́'
print(unicode_block_parser.get_lang(example_text))
# > Cyrillic (Ukranian)

# Get all languages
example_text = 'краї́'
print(unicode_block_parser.get_lang(example_text, return_main_lang=False))
# > ru,Cyrillic (Ukranian),Combining Diacritical Marks

unicode_block_parser.get_single_block("х")
# > Cyrillic (Russian)

Generate Statistics for JSONL Files

Python

from unicode_stats.aggregation import AggregatedUnicodeBlockParser
agregated_parser = AggregatedUnicodeBlockParser(columns = "qwen", max_lines=1)
agregated_parser.get_stats("3model_cp.jsonl")
block column n rate symbols n_symbols rows example_first example_last
0 Cyrilic (Russian) qwen 2161 0.777 оеаи 57234 [0, 1 Чтобы Для р
1 Basic qwen 2778 1 оеаи 57234 [0, 1 Чтобы Для р

Bash

unicode_stats 3model_cp.jsonl --columns="qwen"

Aggregated statistics saved to 3model_cp.csv

Installation

pip install dist/unicode_stats-{version}-py3-none-any.whl

Building from Source

python -m build

If tests fail, the package will not build

Running Tests

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicode_stats-0.3.0.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unicode_stats-0.3.0-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file unicode_stats-0.3.0.tar.gz.

File metadata

  • Download URL: unicode_stats-0.3.0.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for unicode_stats-0.3.0.tar.gz
Algorithm Hash digest
SHA256 7ef87d5a2f44f010711e0c38b85fd17d453f4c6e6463080de895201eab291ed3
MD5 ecb7fcecf039bb746c534ea44420d605
BLAKE2b-256 4f038490a9eae6a95c7ff79ac12fe98f54a6c5b60e9dbf29c1a759c87600d8d8

See more details on using hashes here.

File details

Details for the file unicode_stats-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: unicode_stats-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for unicode_stats-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba14f084c9fed69388a28909604da974cfaf5754cf2ece76b9affd4286c9cb03
MD5 aa686499ae17e63be81b8f049e057a22
BLAKE2b-256 b02bc805d640a4fe0843dfd0b1cacab5666c42a0e75b1d50ee9a83c16f841b97

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page