Skip to main content

Text to Unicode code points breakdown

Project description

Downloads PyPI Coverage Status Code style: black

CLI UTF-8 decomposer for text analysis capable of displaying Unicode code point names and categories, along with ASCII control characters, UTF-16 surrogate pair pieces, invalid UTF-8 sequences parts as separate bytes, etc.

Motivation

A necessity for a tool that can quickly identify otherwise indistinguishable Unicode code points.

Installation

pipx install holms

Basic usage

Usage: holms [OPTIONS] [INPUT]

  Read data from INPUT file, find all valid UTF-8 byte sequences, decode them and display as
  separate Unicode code points. Use '-' as INPUT to read from stdin instead.
example001 example004 example002 example003
Plain text output
> holms -S -u -
1₂³⅘↉⏨
0  U+  31 ▕ 1 ▏Nd DIGIT ONE                   
1  U+2082 ▕ ₂ ▏No SUBSCRIPT TWO               
4  U+  B3 ▕ ³ ▏No SUPERSCRIPT THREE           
6  U+2158 ▕ ⅘ ▏No VULGAR FRACTION FOUR FIFTHS
9  U+2189 ▕ ↉ ▏No VULGAR FRACTION ZERO THIRDS
c  U+23E8 ▕ ⏨ ▏So DECIMAL EXPONENT SYMBOL

> holms -S -u -
aаͣāãâȧäåₐᵃa
00  U+  61 ▕ a ▏Ll LATIN SMALL LETTER A                 
01  U+ 430 ▕ а ▏Ll CYRILLIC SMALL LETTER A              
03  U+ 363 ▕  ͣ ▏Mn COMBINING LATIN SMALL LETTER A       
05  U+ 101 ▕ ā ▏Ll LATIN SMALL LETTER A WITH MACRON     
07  U+  E3 ▕ ã ▏Ll LATIN SMALL LETTER A WITH TILDE      
09  U+  E2 ▕ â ▏Ll LATIN SMALL LETTER A WITH CIRCUMFLEX
0b  U+ 227 ▕ ȧ ▏Ll LATIN SMALL LETTER A WITH DOT ABOVE  
0d  U+  E4 ▕ ä ▏Ll LATIN SMALL LETTER A WITH DIAERESIS  
0f  U+  E5 ▕ å ▏Ll LATIN SMALL LETTER A WITH RING ABOVE
11  U+2090 ▕ ₐ ▏Lm LATIN SUBSCRIPT SMALL LETTER A       
14  U+1D43 ▕ ᵃ ▏Lm MODIFIER LETTER SMALL A              
17  U+FF41 ▕a ▏Ll FULLWIDTH LATIN SMALL LETTER A

> holms -S -u -
🌯👄🤡🎈🐳🐍
00  U+1F32F ▕🌯 ▏So BURRITO          
04  U+1F444 ▕👄 ▏So MOUTH            
08  U+1F921 ▕🤡 ▏So CLOWN FACE       
0c  U+1F388 ▕🎈 ▏So BALLOON          
10  U+1F433 ▕🐳 ▏So SPOUTING WHALE   
14  U+1F40D ▕🐍 ▏So SNAKE

> holms -S -u -
%‰∞8᪲?¿‽⚠⚠️
00  U+  25 ▕ % ▏Po PERCENT SIGN           
01  U+2030 ▕ ‰ ▏Po PER MILLE SIGN         
04  U+221E ▕ ∞ ▏Sm INFINITY               
07  U+  38 ▕ 8 ▏Nd DIGIT EIGHT            
08  U+1AB2 ▕  ᪲ ▏Mn COMBINING INFINITY     
0b  U+  3F ▕ ? ▏Po QUESTION MARK          
0c  U+  BF ▕ ¿ ▏Po INVERTED QUESTION MARK
0e  U+203D ▕ ‽ ▏Po INTERROBANG            
11  U+26A0 ▕ ⚠ ▏So WARNING SIGN           
14  U+26A0 ▕ ⚠ ▏So WARNING SIGN           
17  U+FE0F ▕  ️ ▏Mn VARIATION SELECTOR-16

Buffering

The application works in two modes: buffered (the default if INPUT is a file) and unbuffered (default when reading from stdin). Options -b/-u explicitly override output mode regardless of the default setting.

In buffered mode the result begins to appear only after EOF is encountered (i.e., the WHOLE file has been read to the buffer). This is suitable for short and predictable inputs and produces the most compact output with fixed column sizes.

The unbuffered mode comes in handy when input is an endless piped stream: the results will be displayed in real time, as soon as the type of each byte sequence is determined, but the output column widths are not fixed and can vary as the process goes further.

Despite the name, the app actually uses tiny (4 bytes) input buffer, but it's the only way to handle UTF-8 stream and distinguish valid sequences from broken ones; in truly unbuffered mode the output would consist of ASCII-7 characters (0x00-0x7F) and unrecognized binary data (0x80-0xFF) only, which is not something the application was made for.

Configuration / Advanced usage

Options:
  -b, --buffered / -u, --unbuffered
                                  Explicitly set to wait for EOF before processing the output
                                  (buffered), or to stream the results in parallel with reading, as
                                  soon as possible (unbuffered). See BUFFERING section above for the
                                  details.
  -m, --merge                     Replace all sequences of repeating characters with one of each,
                                  together with initial length of the sequence.
  -g, --group                     Group the input by code points (=count unique), sort descending
                                  and display counts instead of normal output. Implies '--merge' and
                                  forces buffered mode. Specifying the option twice ('-gg') results
                                  in grouping by code point category instead, while doing it thrice
                                  ('-ggg') makes the app group the input by super categories.
  -f, --format                    Comma-separated list of columns to show (order is preserved). Run
                                  'holms --legend' to see the details.
  -F, --full                      Display ALL columns.
  -S, --static                    Do not shrink columns by collapsing the prefix when possible.
  -c, --color / -C, --no-color    Explicitly turn colored results on or off; if not specified, will
                                  be selected automatically depending on the type and capabilities
                                  of receiving device (e.g. colors will be enabled for a terminal
                                  emulator and disabled for piped/redirected output).
  --decimal                       Use decimal byte offsets instead of hexadecimal.
  -L, --legend                    Show detailed info on an output format and code point category
                                  chromacoding, and exit.
  -V, --version                   Show the version and exit.
  -?, --help                      Show this message and exit. 

Examples

Output column selection

Option -f/--filter can be used to specify what columns to display. As an alternative, there is an -F/--full option that enables displaying of all currently available columns.

Column availability depending on operating mode
example010

Also -m/--merge option is demonstrated, which tells the app to collapse repetitive characters into one line of the output while counting them:

example005
Plain text output
> holms -m -S phpstan.txt
 
000  U+2B ▕ + ▏    Sm PLUS SIGN               
001+ U+2D ▕ - ▏27× Pd HYPHEN-MINUS            
01c  U+2B ▕ + ▏    Sm PLUS SIGN               
01d  U+20 ▕ ␣ ▏    Zs SPACE                   
01e  U+2B ▕ + ▏    Sm PLUS SIGN               
01f+ U+2D ▕ - ▏27× Pd HYPHEN-MINUS            
03a  U+2B ▕ + ▏    Sm PLUS SIGN               
03b  U+ A ▕ ↵ ▏    Cc ASCII C0 [LF] LINE FEED 
03c  U+7C ▕ | ▏    Sm VERTICAL LINE           
03d+ U+20 ▕ ␣ ▏27× Zs SPACE                   
...

Reading from pipeline

There is an official Unicode Consortium data file included in the repository for test purposes, named confusables.txt. In the next example we extract line #3620 using sed, delete all TAB (0x08) characters and feed the result to the application. The result demonstrates various Unicode dot/bullet code points:

example006
Plain text output
> sed confusables.txt -Ee 'sg' -e '3620!d' |
    holms -S -
00  U+   B7 ▕ · ▏Po MIDDLE DOT                          
02  U+ 1427 ▕ ᐧ ▏Lo CANADIAN SYLLABICS FINAL MIDDLE DOT
05  U+  387 ▕ · ▏Po GREEK ANO TELEIA                    
07  U+ 2022 ▕ • ▏Po BULLET                              
0a  U+ 2027 ▕ ‧ ▏Po HYPHENATION POINT                   
0d  U+ 2219 ▕ ∙ ▏Sm BULLET OPERATOR                     
10  U+ 22C5 ▕ ⋅ ▏Sm DOT OPERATOR                        
13  U+ 30FB ▕・ ▏Po KATAKANA MIDDLE DOT                 
16  U+10101 ▕ 𐄁 ▏Po AEGEAN WORD SEPARATOR DOT           
1a  U+ FF65 ▕ ・ ▏Po HALFWIDTH KATAKANA MIDDLE DOT       
1d  U+    A ▕ ↵ ▏Cc ASCII C0 [LF] LINE FEED

Code points / categories statistics

-g/--group option can be used to count unique code points, and to compute the occurrence rate of each one:

example008
Plain text output
> holms -g -S ./tests/data/confusables.txt
   
U+   20 ▕ ␣ ▏   13% ▍   62732× Zs SPACE                                                                               
U+    9 ▕ ⇥ ▏  7.3% ▏   36745× Cc ASCII C0 [HT] HORIZONTAL TABULATION                                                 
U+   41 ▕ A ▏  6.1% ▏   30555× Lu LATIN CAPITAL LETTER A                                                              
U+   49 ▕ I ▏  5.2% ▏   26063× Lu LATIN CAPITAL LETTER I                                                              
U+   45 ▕ E ▏  5.0% ▏   24992× Lu LATIN CAPITAL LETTER E                                                              
U+   54 ▕ T ▏  3.7%     18776× Lu LATIN CAPITAL LETTER T                                                              
U+   4C ▕ L ▏  3.7%     18763× Lu LATIN CAPITAL LETTER L                                                              
U+ 200E ▕   ▏  3.7%     18494× Cf LEFT-TO-RIGHT MARK                                                                  
U+    A ▕ ↵ ▏  2.9%     14609× Cc ASCII C0 [LF] LINE FEED                                                             
U+   43 ▕ C ▏  2.9%     14450× Lu LATIN CAPITAL LETTER C                                                              
...

When used twice (-gg) or thrice (-ggg), the application groups the input by code point category or code point super category, respectively, which can be used e.g. for frequency domain analysis:

example011 example012
Plain text output
> holms -gg -S ./tests/data/confusables.txt
 
  53% █████▎     266233× Lu Uppercase_Letter      
  13% █▎          62748× Zs Space_Separator       
  10% █           51356× Cc Control               
 8.5% ▊           42511× Nd Decimal_Number        
 3.7% ▎           18497× Cf Format                
 3.0% ▎           14832× Lo Other_Letter          
 2.0% ▏            9778× Sm Math_Symbol           
 1.8% ▏            9261× Pe Close_Punctuation     
 1.8% ▏            9259× Ps Open_Punctuation      
 1.5% ▏            7525× Po Other_Punctuation     
...

> holms -ggg -S ./tests/data/confusables.txt
 
  57% █████▋     284074× L Letter      
  14% █▍          69853× C Other       
  13% █▎          62750× Z Separator   
 8.5% ▊           42796× N Number      
 5.9% ▌           29571× P Punctuation 
 2.2% ▏           11072× S Symbol      
 0.2%               965× M Mark        

In-place type highlighting

When --format is specified exactly as a single char column: --format=char, the application omits all the columns and prints the original file contents, while highligting each character with a color that indicates its' Unicode category. Note that ASCII control codes, as well as Unicode ones, are kept untouched and invisible.

example007
Plain text output
> sed chars.txt -nEe 150,159p |
  holms --format=char -S -
‰ ‱ ′ ″ ‴ ‵ ‶ ‷ ‸ ‹ › ※ ‼ ‽ ‾ ‿
⁀ ⁁ ⁂ ⁃ ⁄ ⁅ ⁆ ⁇ ⁈ ⁉ ⁊ ⁋ ⁌ ⁍ ⁎ ⁏
⁐ ⁑ ⁒ ⁓ ⁔ ⁕ ⁖ ⁗ ⁘ ⁙ ⁚ ⁛ ⁜ ⁝ ⁞
⁰ ⁱ   ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ⁺ ⁻ ⁼ ⁽ ⁾ ⁿ
₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉ ₊ ₋ ₌ ₍ ₎
ₐ ₑ ₒ ₓ ₔ ₕ ₖ ₗ ₘ ₙ ₚ ₛ ₜ
      ₣ ₤         ₩ ₪ ₫ €
  ₱ ₲ ₳   ₵       ₹ ₺     ₽   ₿
  ⃐   ⃑   ⃒   ⃓   ⃔   ⃕   ⃖   ⃗   ⃘   ⃙   ⃚   ⃛   ⃜  ⃝   ⃞   ⃟
  ⃠   ⃡  ⃢   ⃣   ⃤   ⃥   ⃦   ⃧    ⃨   ⃩   ⃪   ⃫   ⃬   ⃭   ⃮   ⃯

ASCII C0 / C1 details

While developing the application I encountered strange (as it seemed to be at the beginning) behaviour of Python interpreter, which encoded C1 control bytes as two bytes of UTF-8, while C0 control bytes were displayed as sole bytes, like it would have been encoded in a plain ASCII. Then there was a bit of researching done.

According to ISO/IEC 6429 (ECMA-48), there are two types of ASCII control codes (to be precise, much more, but for our purposes it's mostly irrelevant) — C0 and C1. The first one includes ASCII code points 0x00-0x1F and 0x7F (some authors also include a regular space character 0x20 in this list), and the characteristic property of this type is that all C0 code points are encoded in UTF-8 exactly the same as they do in 7-bit US-ASCII (ISO/IEC 646). This helps to disambiguate exactly what type of encoding is used even for broken byte sequences, considering the task is to tell if a byte represents sole code point or is actually a part of multibyte UTF-8 sequence.

However, C1 control codes are represented by 0x80-0x9F bytes, which also are valid bytes for multibyte UTF-8 sequences. In order to distinguish the first type from the second UTF-8 encodes them as two-byte sequences instead (0x800xC280, etc.); also this applies not only to control codes, but to all other ISO/IEC 8859 code points starting from 0x80.

With this in mind, let's see how the application reflects these differences. First command produces several 8-bit ASCII C1 control codes, which are classified as raw binary/non-UTF-8 data, while the second command's output consists of the very same code points but being encoded in UTF-8 (thanks to Python's full transparent Unicode support, we don't even need to bother much about the encodings and such):

example013
Plain text output
> printf '\x80\x90\x9f' |
holms --format=raw,number,char,type,name -S -
0x       80      --  ▕ ▯ ▏-- NON UTF-8 BYTE 0x80
0x       90      --  ▕ ▯ ▏-- NON UTF-8 BYTE 0x90
0x       9f      --  ▕ ▯ ▏-- NON UTF-8 BYTE 0x9F
> python -c 'print("\x80\x90\x9f", end="")' |
    holms --format=raw,number,char,type,name -S -
0x    c2 80 U+    80 ▕ ▯ ▏Cc ASCII C1 [PC] PADDING CHARACTER
0x    c2 90 U+    90 ▕ ▯ ▏Cc ASCII C1 [DCS] DEVICE CONTROL STRING
0x    c2 9f U+    9F ▕ ▯ ▏Cc ASCII C1 [APC] APPLICATION PROGRAM COMMAND

Legend

The image below illustrates the color scheme developed for the app specifically, to simplify distinguishing code points of one category from others.

example009

Most frequently encountering control codes also have a unique character replacements, which allows to recognize them without reading the label or memorizing code point identifiers:

example014

Changelog

CHANGES.rst

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

holms-1.0.1.tar.gz (176.2 kB view hashes)

Uploaded Source

Built Distribution

holms-1.0.1-py3-none-any.whl (24.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page