remove non-UTF8 bytes from an input file and write a cleaned up version
Project description
utf8cleaner
Read a file byte-by-byte to and strip any byte sequences that have no UTF8 equivalent and strip them
Installation
pip install utf8cleaner
Note: requires python 3
Usage
utf8cleaner --input FILENAME
Will read FILENAME
and write to FILENAME.clean
Why would I want to do this?
Sometimes when exporting and importing data, there are byte sequences that prevent data being imported. To fix this you would otherwise have to do one or more of:
- Manually edit the source data in its native application (eg backspace invisible characters) in JIRA fields
- Edit the file with a hex editor and look for known-bad values (eg copyright symbol)
- Do something smart with perl/vi/sed to find and replace known bad byte patterns
This simple utility fixes these problems in one hit.
Where do these strange characters come from?
Number one culprit: copy and paste from outlook. This often introduces invisible whitespace errors (spaces that are not spaces...) along with "pretty" quotes, etc.
Other sources including copying and pasting from files with the old ISO8859 character encodings
Can I see an example file that demonstrates this issue?
There is a copyright symbol at the end of the file that needs replacing
What exactly is the problem?
iso8859 represents symbols as a single byte, eg the copyright symbol would be represented by the single hex byte:
0xA9
UTF8 uses two bytes to represent such characters, eg:
0xC2 0xA9
Since UTF-8 is a variable width character encoding scheme, it will use from 1 to 4 bytes to encode a single symbol. This is how it is able to represent all kinds of new symbols we take for granted such as emojii and CJK characters.
TODO
- Make sure we don't break correctly encoded sequences, eg by processing
0xC2
andOxA9
independently- Test:
examples/good.txt
should not error - it currently does
- Test:
- Lookup table to "fix" known trouble makers such as copyright symbol
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for utf8cleaner-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6f21264fc6b75bc96a8c11d2568bf37d96f92e6b978d2f26cc4c8386d39807b |
|
MD5 | 26e8f2e0f203b42a75c52dc67f098134 |
|
BLAKE2b-256 | 493302adabb396faa3c153d51dc63c30256903d95b223eb3a6eb161429c43c57 |