Skip to main content

remove non-UTF8 bytes from an input file and write a cleaned up version

Project description

utf8cleaner

Read a file byte-by-byte to and strip any byte sequences that have no UTF8 equivalent and strip them

Installation

pip install utf8cleaner

Note: requires python 3

Usage

utf8cleaner --input FILENAME

Will read FILENAME and write to FILENAME.clean

Why would I want to do this?

Sometimes when exporting and importing data, there are byte sequences that prevent data being imported. To fix this you would otherwise have to do one or more of:

  • Manually edit the source data in its native application (eg backspace invisible characters) in JIRA fields
  • Edit the file with a hex editor and look for known-bad values (eg copyright symbol)
  • Do something smart with perl/vi/sed to find and replace known bad byte patterns

This simple utility fixes these problems in one hit.

Where do these strange characters come from?

Number one culprit: copy and paste from outlook. This often introduces invisible whitespace errors (spaces that are not spaces...) along with "pretty" quotes, etc.

Other sources including copying and pasting from files with the old ISO8859 character encodings

Can I see an example file that demonstrates this issue?

examples/test.txt

There is a copyright symbol at the end of the file that needs replacing

What exactly is the problem?

iso8859 represents symbols as a single byte, eg the copyright symbol would be represented by the single hex byte:

0xA9

UTF8 uses two bytes to represent such characters, eg:

0xC2 0xA9

Since UTF-8 is a variable width character encoding scheme, it will use from 1 to 4 bytes to encode a single symbol. This is how it is able to represent all kinds of new symbols we take for granted such as emojii and CJK characters.

TODO

  • Make sure we don't break correctly encoded sequences, eg by processing 0xC2 and OxA9 independently
    • Test: examples/good.txt should not error - it currently does
  • Lookup table to "fix" known trouble makers such as copyright symbol

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

utf8cleaner-0.0.0.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

utf8cleaner-0.0.0-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file utf8cleaner-0.0.0.tar.gz.

File metadata

  • Download URL: utf8cleaner-0.0.0.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.9

File hashes

Hashes for utf8cleaner-0.0.0.tar.gz
Algorithm Hash digest
SHA256 d477254cdce6c6f8c20e09289986382d7b8f4e50824448c64d89c398c078419f
MD5 ac478df99bba84cbfa0e96c58d18c1c7
BLAKE2b-256 4a640be64c516e5d65ee5c1223f08c0c70d4db25da1c5e14b2c76bf2dac23e43

See more details on using hashes here.

File details

Details for the file utf8cleaner-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: utf8cleaner-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 4.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.9

File hashes

Hashes for utf8cleaner-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 163cf72ff73257e8c0877814ecf1da880bcee4a43cdee2a4ec8c3a23d6441d3e
MD5 255a0a4c1d5ec01e8322f45cae41fcb7
BLAKE2b-256 a08ea9da3b5306d48a8004b5ec64e66d1d46ffb4972536442a7981e551d8dc4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page