hexdump for your unicode data
Project description
A Unicode codepoint dump.
Think of it as hexdump(1) for Unicode. The command analyses the input and
prints then three columns: the raw byte count of the first codepoint in this
row, codepoints in their hex notation, and finally the raw input characters
with control and whitespace replaced by a dot.
Invalid byte sequences are represented with an “X” and with the hex value en-
closed in question marks, e.g., “?F5?”.
You can pipe in data from stdin, select several files at once, or even mix
all those input methods together.
Examples:
* Basic usage with stdin:
echo -n 'ABCDEFGHIJKLMNOP' | unidump -n 4
0 0041 0042 0043 0044 ABCD
4 0045 0046 0047 0048 EFGH
8 0049 004A 004B 004C IJKL
12 004D 004E 004F 0050 MNOP
* Dump the code points translated from another encoding:
unidump -c latin-1 some-legacy-file
* Dump many files at the same time:
unidump foo-*.txt
* Control characters and whitespace are safely rendered:
echo -n -e '\x01' | unidump -n 1
0 0001 .
* Finally learn what your favorite Emoji is composed of:
( echo -n -e '\xf0\x9f\xa7\x9d\xf0\x9f\x8f\xbd\xe2' ; \
echo -n -e '\x80\x8d\xe2\x99\x82\xef\xb8\x8f' ; ) | \
unidump -n 5
0 1F9DD 1F3FD 200D 2642 FE0F .🏽.♂️
See <http://emojipedia.org/man-elf-medium-skin-tone/> for images. The “elf”
emoji (the first character) is replaced with a dot here, because the current
version of Python’s unicodedata doesn’t know of this character yet.
* Use it like strings(1):
unidump -e '{data}' some-file.bin
This will replace every unknown byte from the input file with “X” and every
control and whitespace character with “.”.
* Only print the code points of the input:
unidump -e '{repr}'$'\n' -n 1 some-file.txt
This results in a stream of codepoints in hex notation, each on a new line,
without byte counter or rendering of actual data. You can use this to count
the total amount of characters (as opposed to raw bytes) in a file, if you
pipe it through `wc -l`.
Think of it as hexdump(1) for Unicode. The command analyses the input and
prints then three columns: the raw byte count of the first codepoint in this
row, codepoints in their hex notation, and finally the raw input characters
with control and whitespace replaced by a dot.
Invalid byte sequences are represented with an “X” and with the hex value en-
closed in question marks, e.g., “?F5?”.
You can pipe in data from stdin, select several files at once, or even mix
all those input methods together.
Examples:
* Basic usage with stdin:
echo -n 'ABCDEFGHIJKLMNOP' | unidump -n 4
0 0041 0042 0043 0044 ABCD
4 0045 0046 0047 0048 EFGH
8 0049 004A 004B 004C IJKL
12 004D 004E 004F 0050 MNOP
* Dump the code points translated from another encoding:
unidump -c latin-1 some-legacy-file
* Dump many files at the same time:
unidump foo-*.txt
* Control characters and whitespace are safely rendered:
echo -n -e '\x01' | unidump -n 1
0 0001 .
* Finally learn what your favorite Emoji is composed of:
( echo -n -e '\xf0\x9f\xa7\x9d\xf0\x9f\x8f\xbd\xe2' ; \
echo -n -e '\x80\x8d\xe2\x99\x82\xef\xb8\x8f' ; ) | \
unidump -n 5
0 1F9DD 1F3FD 200D 2642 FE0F .🏽.♂️
See <http://emojipedia.org/man-elf-medium-skin-tone/> for images. The “elf”
emoji (the first character) is replaced with a dot here, because the current
version of Python’s unicodedata doesn’t know of this character yet.
* Use it like strings(1):
unidump -e '{data}' some-file.bin
This will replace every unknown byte from the input file with “X” and every
control and whitespace character with “.”.
* Only print the code points of the input:
unidump -e '{repr}'$'\n' -n 1 some-file.txt
This results in a stream of codepoints in hex notation, each on a new line,
without byte counter or rendering of actual data. You can use this to count
the total amount of characters (as opposed to raw bytes) in a file, if you
pipe it through `wc -l`.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
unidump-1.1.1.tar.gz
(4.8 kB
view hashes)