Skip to main content

Text to Unicode code points breakdown

Project description


Downloads
PyPI Coverage Status Code style: black

@ DESC

Motivation

@ DESC

Installation

pipx install holms

Basic usage

@ IMG

Configuration / Advanced usage

Usage: holms [OPTIONS] FILE

  Read data from FILE, find all valid UTF-8 byte sequences, decode them and display as separate Unicode code points.
  Use '-' as FILE to read from stdin instead.

Options:
  -f, --format [offset|number|char|count|category|name]
                                  Comma-separated list of columns to show. The order of items determines the order of
                                  columns in the output. Default is to show all columns in the order specified above.
                                  Note that 'count' column is visible only when '-s' is specified. 'number' is the ID
                                  of code point (U+xxxx).
  -u, --unbuffered                Start streaming the result as soon as possible, do not read the whole input
                                  preemptively. See BUFFERING paragraph above for the details.
  -s, --squash                    Replace all sequences of repeating characters with the first character from each,
                                  followed by a length of the sequence.
  --decimal                       Use decimal offsets instead of hexadecimal.
  -V, --version                   Show the version and exit.
  --help                          Show this message and exit.

Examples

Buffering

The application works in two modes: buffered (the default) and unbuffered.

In buffered mode the result begins to appear only after EOF is encountered. This is suitable for relatively short and predictable inputs (e.g. from a file) and allows to produce the most compact output (because all the column sizes are known from the start).

When input is not a file and can proceed infinitely (e.g. a piped stream), the unbuffered mode comes in handy: the application prints the results in real time, as soon as the type of each byte sequence is determined.

Despite the name, it actually uses a tiny input buffer (size is 4 bytes), but it's the only way to handle UTF-8 stream and distinguish valid sequences from broken ones; in truly unbuffered mode the output would consist of ASCII-7 characters (0x00-0x7F) and unrecogniesed binary data (0x80-0xFF) only, which is not something the application was made for.

Changelog

@ WIP

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

holms-0.5.0.tar.gz (40.9 kB view hashes)

Uploaded Source

Built Distribution

holms-0.5.0-py3-none-any.whl (14.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page