3 functions to normalize strings, repair bad encoding, replace non-printable characters
Project description
3 functions to normalize strings, repair bad encoding, replace non-printable characters
The function use numba under the hood - that means the first run is very slow, (compile time), but then the speed-up is tremendous.
pip install charchef
from charchef import aa_convert_utf8_to_ascii_,aa_repair_bad_conversion_to_utf8,aa_replace_non_printable_chars
text = r"""ąćęłńóśźż ĄĆĘŁŃÓŚŹ\x00Ż Junto à Estação de Carcavelos; Bragança Situado
en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet. Cartão MOBI.E R.
Conselheiro Emídio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)
àáâãäåa èéêëe ìíîïi òóôõöo ùúûüu ýÿy Suzy & John " £682m
\u00FF\u00FF\u00F0\u00f0\x95\xFF SmörgÃ¥s Non ti suscita niente la parola pietÃ\xa0 RosŽ RUF MICH ZURÃœCK.
aqu\195\173 09. Bát Nhã Tâm Kinh criança Koç University Technische Universität Dresden Universität
für Musik und darstellende Kunst Wien Technische Universität Wien Ã\x89cole Nationale Supérieure
des Beaux-Arts Paris Universidad Simón BolÃ\xadvar (USB) 240 Åland Islands 2014.0
MARIEHAMN 11437.0 1 240 Åland Islands 2010.0 MARIEHAMN 5829.5 1 240
Albania 2011.0 Durrës 113249.0 240 Albania 2011.0 TIRANA
418495.0 240 Albania 2011.0 Durrës 56511.0 "Tutu Au Mic' – dumbéa"
""".splitlines()
bigc1 = aa_convert_utf8_to_ascii_(
str_=text,
preprocessing_functions=(
"8x_3_lower_case_escaped",
"8x_3_upper_case_escaped",
"8u_4_upper_case_escaped",
"8u_4_lower_case_escaped",
"8x_69_upper_case_escaped",
"8x_69_lower_case_escaped",
"8n_escaped",
"8wrong_chars",
"8zerox_unescaped_lower",
"8zerox_unescaped_upper",
"8html_entity",
),
preprocessing_function_non_printable=(
"substitute_allcontrols_s",
"substitute_allcontrols",
"substitute_allcontrols2",
"substitute_allcontrols2_s",
"substitute_allcontrols3",
"substitute_allcontrols3_s",
),
respect_german_letters=True,
)
bigc2 = aa_repair_bad_conversion_to_utf8(
str_=text,
functions=(
"8x_3_lower_case_escaped",
"8x_3_upper_case_escaped",
"8u_4_upper_case_escaped",
"8u_4_lower_case_escaped",
"8x_69_upper_case_escaped",
"8x_69_lower_case_escaped",
"8n_escaped",
"8wrong_chars",
"8zerox_unescaped_lower",
"8zerox_unescaped_upper",
"8html_entity",
),
)
bigc3 = aa_replace_non_printable_chars(
str_="\x00rsi\\x00d\x00ad \x0aSimón BolÃ\xadvar",
functions=(
"substitute_allcontrols_s",
"substitute_allcontrols",
"substitute_allcontrols2",
"substitute_allcontrols2_s",
),
removex0a=False,
)
bigc1 # replaces all accents, special characters ...
Out[3]:
['acelnoszz ACELNOSZZ Junto a Estacao de Carcavelos; Braganca Situado ',
'en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet. Cartao MOBI.E R. ',
'Conselheiro Emidio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',
' aaaaaeaa eeeee iiiii oooooeo uuuueu yyy Suzy & John " PS682m ',
' yydd*y Smorgas Non ti suscita niente la parola pieti RosZ RUF MICH ZURUCK.',
' aqui 09. Bat Nha Tam Kinh crianca Koc University Technische Universitat Dresden Universitat ',
' fur Musik und darstellende Kunst Wien Technische Universitat Wien Ecole Nationale Superieure ',
' des Beaux-Arts Paris Universidad Simon Bolivar (USB) 240 Sland Islands 2014.0 ',
' MARIEHAMN 11437.0 1 240 Sland Islands 2010.0 MARIEHAMN 5829.5 1 240 ',
' Albania 2011.0 Durres 113249.0 240 Albania 2011.0 TIRANA ',
' 418495.0 240 Albania 2011.0 Durres 56511.0 "Tutu Au Mic\' - dumbea"',
' ']
bigc2 # Repairs messed up Unicode
Out[4]:
['ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ Junto à Estação de Carcavelos; Bragança Situado ',
'en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet. Cartão MOBI.E R. ',
'Conselheiro Emídio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',
' àáâãäåa èéêëe ìíîïi òóôõöo ùúûüu ýÿy Suzy & John " £682m ',
' ÿÿðð•ÿ Smörgås Non ti suscita niente la parola pietí RosŽ RUF MICH ZURÜCK.',
' aquí 09. Bát Nhã Tâm Kinh criança Koç University Technische Universität Dresden Universität ',
' für Musik und darstellende Kunst Wien Technische Universität Wien École Nationale Supérieure ',
' des Beaux-Arts Paris Universidad Simón Bolívar (USB) 240 Šland Islands 2014.0 ',
' MARIEHAMN 11437.0 1 240 Šland Islands 2010.0 MARIEHAMN 5829.5 1 240 ',
' Albania 2011.0 Durrës 113249.0 240 Albania 2011.0 TIRANA ',
' 418495.0 240 Albania 2011.0 Durrës 56511.0 "Tutu Au Mic\' – dumbéa"',
' ']
bigc3 # Removes non-printable characters
Out[5]: ['rsidad Simón BolÃ\xadvar']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
charchef-0.12.tar.gz
(213.6 kB
view details)
Built Distribution
charchef-0.12-py3-none-any.whl
(218.1 kB
view details)
File details
Details for the file charchef-0.12.tar.gz
.
File metadata
- Download URL: charchef-0.12.tar.gz
- Upload date:
- Size: 213.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1b83d836d586f6383c7ea2e3578e3b2c97b6060db6df5595070d17c5f908018 |
|
MD5 | 7b5f8fb92df755b4f798af06ee2b7733 |
|
BLAKE2b-256 | 7cd181dd6fde1e5ef23b90b77a248b774469b4ccf715568a42841db9bfe8db14 |
File details
Details for the file charchef-0.12-py3-none-any.whl
.
File metadata
- Download URL: charchef-0.12-py3-none-any.whl
- Upload date:
- Size: 218.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf099de64396704575654cfe79ddd29436078ef2fd32076f13bfbffd132a6b3d |
|
MD5 | 65effffe7d78b0971e66d56ec4f911cb |
|
BLAKE2b-256 | 582f7a0963be7d291195cf12bdcfaa5afd9a03533d2e7473a9f7ef9e6ac621ae |