Skip to main content

Utility for string normalization

Project description

sic

pypi

(Latin) so, thus, such, in such a way, in this way
(English) spelling is correct

sic is a module for string normalization. Given a string, it separates sequences of alphabetical, numeric, and punctuation characters, as well as performs more complex transformation (i.e. separates or replaces specific words or individual symbols). It comes with set of default normalization rules to transliterate and tokenize greek letters and replace accented characters with their base form. It also allows using custom normalization configurations.

Basic usage:

import sic

builder = sic.Builder()
machine = builder.build_normalizer()
x = machine.normalize('abc123xyzalphabetagammag')
print(x)

The output will be:

abc 123 xyz alpha beta gamma g

Installation

pip install sic
  • sic is designed to work in Python 3 environment.
  • sic only needs Python Standard Library (no other packages).

Tokenization configs

sic implements tokenization, i.e. it splits a given string into tokens and processes those tokens according to the rules specified in a configuration file. Basic tokenization includes separating groups of alphabetical, numerical, and punctuation characters within a string, thus turning them into separate words (for future reference, we'll call such words tokens). For instance, abc-123 will be transformed into abc - 123, having tokens abc, -, and 123.

What happens next to initially tokenized string must be defined using XML in configuration file(s). Entry point to default tokenizer applied to a string is sic/tokenizer.standard.xml.

Below is the template and description for tokenizer config.

<!-- tokenizer.config.xml -->
<!--
  This is the description of config file for tokenizer.
  General structure:
  <tokenizer>
  +-<import>
  +-...
  +-<import>
  +-<setting>
  +-...
  +-<setting>
  +-<split>
  +-...
  +-<split>
  +-<token>
  +-...
  +-<token>
  +-<character>
  +-...
  +-<character>
-->

<?xml version="1.0" encoding="UTF-8"?>

<!-- There must be single root element, and it must be <tokenizer>: -->
<tokenizer name="$name">
<!-- $name: string label for this tokenizer -->

  <!--
    Direct children of <tokenizer> are <import>, <setting>, <split>,
      <token>, and/or <character> elements (there can be zero to many
      declarations of any of those)
  -->

  <!-- <import> elements point at other tokenizer configs to merge with -->
  <import file="$file" />
  <!-- $file: path to file with another tokenizer config -->

  <!-- <setting> elements define high-level tokenizer settings -->
  <setting name="$name" value="$value" />
  <!--
    Names and value requirements for /tokenizer/setting elements:
    $name="cs": $value="0"|"1" (if "1", this tokenizer will be case-sensitive)
    $name="bypass": $value="0"|"1" (if "1", this tokenizer will do nothing,
      regardless the rest content of this file)
  -->

  <!--
    <split> elements define substrings that should be separated from text as
      tokens
  -->
  <split where="$where" value="$value" />
  <!--
    $where="l"|"r"|"m" ("l" for left, "r" for right, "m" for middle)
    $value: string that will be handled as token when it's either in the
      beginning of word ($where="l"), at the end of word ($where="r"), or in
      the middle ($where="m")
  -->

  <!--
    <token> elements define tokens that should be replaced with other tokens
      (or with nothing => removed)
  -->
  <token to="$to" from="$from" />
  <!--
    $to: string that should replace the token specified in $from
    $from: string that is the token to be replaced by another string specified
      in $to
  -->

  <!--
    <character> elements define single characters that should be replaced with
      other single characters
  -->
  <character to="$to" from="$from" />
  <!--
    $to: character that should replace the another character specified in $from
    $from: character that should be replaced by another character specified in
      $to
  -->

</tokenizer>

Below are descriptions and examples of tokenizer config elements.

ELEMENT ATTRIBUTES DESCRIPTION EXAMPLE
<import> file="path/to/another/config.xml" Import tokenization rules from another tokenizer config.
<setting> name="bypass" value="?" If present and value="1", all tokenization rules are ignored, as if there was no tokenization at all (left for debug purposes).
<setting> name="cs" value="?" If value="1", string is processed case-sensitively; if value="0" - case-insensitively; if not present, tokenizer is case-insensitive.
<split> where="l" value="?" Separates token specified in value from left part of a bigger token. where="l" value="kappa": nf kappab --> nf kappa b
<split> where="m" value="?" Separates token specified in value when it is found in the middle of a bigger token. where="m" value="kappa": nfkappab --> nf kappa b
<split> where="r" value="?" Separates token specified in value from right part of a bigger token. where="r" value="gamma": ifngamma --> ifn gamma
<token> to="?" from="?" Replaces token specified in to with another token specified in from. to="protein" from="gene": nf kappa b gene --> nf kappa b protein
<character> to="?" from="?" Replaces character specified in to with another character specified in from. to="e" from="ë": citroën --> citroen

Attribute where of <split> element may have any combination of l, m, or r literals if the specified substring is required to be separated in different places of a bigger string. So, instead of three different elements

<split where="l" value="word">
<split where="m" value="word">
<split where="r" value="word">

using the following single one

<split where="lmr" value="word">

will achieve the same result.

Transformation is applied in the following order:

  1. Replacing characters
  2. Splitting tokens
  3. Replacing tokens

When splitting tokens, longer ones shadow shorter ones.

Usage

import sic

For detailed description of all function and methods, see comments in the source code.

Class Builder

Function Builder.build_normalizer() reads tokenization config, instantiates Normalizer class that would perform tokenization according to rules specified in a given config, and returns this Normalizer class instance.

ARGUMENT TYPE DEFAULT DESCRIPTION
filename str None Path to tokenizer configuration file.
# create Builder object
builder = sic.Builder()

# create Normalizer object with default set of rules
machine = builder.build_normalizer()

# create Normalizer object with custom set of rules
machine = builder.build_normalizer('/path/to/config.xml')

Class Normalizer

Function Normalizer.normalize() performs string normalization according to the rules ingested at the time of class initialization, and returns normalized string.

ARGUMENT TYPE DEFAULT DESCRIPTION
source_string str n/a String to normalize.
word_separator str ' ' Word delimiter (single character).
normalizer_option int 0 Mode of post-processing.

word_separator: Specified character will be considered a boundary between tokens. The default value is ' ' (space) which seems reasonable choice for natural language. However any character can be specified, which might be more useful in certain context.

normalizer_option: The value can be either one of 0, 1, or 2 and controls the way tokenized string is post-processed:

VALUE MODE
0 No post-processing.
1 Rearrange tokens in alphabetical order.
2 Rearrange tokens in alphabetical order and remove duplicates.

Property Normalizer.result retains the result of last call for Normalizer.normalize function as dict object with the following keys:

KEY VALUE TYPE DESCRIPTION
'original' str Original string value that was processed.
'normalized' str Returned normalized string value.
'map' list(int) Map between original and normalized strings.
'r_map' list(list(int)) Reverse map between original and normalized strings.

Normalizer.result['map']: Not only Normalizer.normalize() generates normalized string out of originally provided, it also tries to map character indexes in normalized string back on those in the original one. This map is represented as list of integers where item index is character position in normalized string and item value is character position in original string. This is only valid when normalizer_option argument for Normalizer.normalize() call has been set to 0.

Normalizer.result['r_map']: Reverse map between character locations in original string and its normalized reflection (item index is character position in original string; item value is list [x, y] where x and y are respectively lowest and highest indexes of mapped characted in normalized string.

# using default word_separator and normalizer_option
x = machine.normalize('alpha-2-macroglobulin-p')
print(x) # 'alpha - 2 - macroglobulin - p'
print(machine.result)
"""
{
  'original': 'alpha-2-macroglobulin-p',
  'normalized': 'alpha - 2 - macroglobulin - p',
  'map': [
    0, 1, 2, 3, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 21, 22, 22
  ],
  'r_map: [
    [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 26], [27, 28]
  ]
}
"""

# using custom word separator
x = machine.normalize('alpha-2-macroglobulin-p', word_separator='|')
print(x) # 'alpha|-|2|-|macroglobulin|-|p'

# using normalizer_option=1
x = machine.normalize('alpha-2-macroglobulin-p', normalizer_option=1)
print(x) # '- - - 2 alpha macroglobulin p'

# using normalizer_option=2
x = machine.normalize('alpha-2-macroglobulin-p', normalizer_option=2)
print(x) # '- 2 alpha macroglobulin p'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sic-1.0.4a1.tar.gz (17.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sic-1.0.4a1-cp38-cp38-win_amd64.whl (167.5 kB view details)

Uploaded CPython 3.8Windows x86-64

sic-1.0.4a1-cp37-cp37m-win_amd64.whl (164.9 kB view details)

Uploaded CPython 3.7mWindows x86-64

sic-1.0.4a1-cp36-cp36m-win_amd64.whl (165.0 kB view details)

Uploaded CPython 3.6mWindows x86-64

File details

Details for the file sic-1.0.4a1.tar.gz.

File metadata

  • Download URL: sic-1.0.4a1.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.6

File hashes

Hashes for sic-1.0.4a1.tar.gz
Algorithm Hash digest
SHA256 41760c79341bfba505be31aa6f2f294582e5b54728530222545b6cfdd9f5933e
MD5 9cd3b86427a23f2f6713a9651e77857e
BLAKE2b-256 75bd1f97cb4d21e7416ed13ddda5569a266f3d84573932a434b41d54cd418e47

See more details on using hashes here.

File details

Details for the file sic-1.0.4a1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: sic-1.0.4a1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 167.5 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.6

File hashes

Hashes for sic-1.0.4a1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 88e638365f7e29dfce6c002d3623d1cd9b7dad85771e03ac090bd1ff48d302a0
MD5 0a8521bbdbc8d4019df9c9a2adb9b436
BLAKE2b-256 5a04131f667cdba1ca6c659ee1243e517904f8f9f0a50af4af733d50b9a25356

See more details on using hashes here.

File details

Details for the file sic-1.0.4a1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: sic-1.0.4a1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 164.9 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.6

File hashes

Hashes for sic-1.0.4a1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 7a74422d1403eaf521d8ea9b9edb00a0d8d20d28333594a8c692562882950804
MD5 3fea34ed3da53ca736d433a15de0d5c2
BLAKE2b-256 a6e5e2cc315e8978b0fa8504877197a26a53d01cf573341f4e615340640ff41f

See more details on using hashes here.

File details

Details for the file sic-1.0.4a1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: sic-1.0.4a1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 165.0 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.6

File hashes

Hashes for sic-1.0.4a1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 1add4c6c9d3f00a9900692ede48aa8cda1fe8d3000884e326293a0616114627b
MD5 3954f55d121ba3b184f4e241764eefcf
BLAKE2b-256 213216c52fba42cdee4ed58d7e80787543c54fd2a0e6a0ba0daea035e425b5cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page