spoteno is a library for spoken text normalization for ASR
Project description
spoteno
spoteno (Spoken-Text-Normalization) is a tool to cleanup text-transcripts for speech recognition systems. These systems normally expect target transcripts to contain only characters from a restricted set.
Installation
Install the latest development version:
pip install git+https://github.com/ynop/spoteno.git
Examples
The default usecase would be to normalize a sentence. This enforces the output string to contain only valid characters (as defined by the configuration).
import spoteno
sentence = ('Am 11. Januar geht er um 5m nach links,'
'weshalb er $d schon "ziemlich" müde ist.')
norm = spoteno.Normalizer.de()
outsent = norm.normalize(sentence)
print(outsent)
# >>> am elfte januar geht er um fünf m nach links weshalb er d schon ziemlich müde ist
With force=False
, the final cleanup can be disabled.
This way invalid characters may occurr in the output,
if the configuration hasn't handled them specifically.
outsent = norm.normalize(sentence, force=False)
print(outsent)
# >>> am elfte januar geht er um fünf m nach links weshalb er $d schon ziemlich müde ist
With the debug method, one can retrieve a set of invalid characters in the final output. This can be used to create or debug a configuration. Additionaly the outputs of the different configuration steps can be printed.
outsent, error = norm.debug(sentence)
print(error)
# >>> START Am 11. Januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.
# >>> Strip ['Am 11. Januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.']
# >>> Lower ['am 11. januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.']
# >>> StripChar ['am 11. januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceIfNotSurroundedByDigits['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceIfNotPrecededByDigit['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceRegex ['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceChar ['am 11. januar geht er um 5m nach links weshalb er $d schon ziemlich müde ist']
# >>> ReplaceChar ['am 11. januar geht er um 5m nach links weshalb er $d schon ziemlich müde ist']
# >>> WhitespaceTokenize ['am', '11.', 'januar', 'geht', 'er', 'um', '5m', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> SplitNumberSuffix ['am', '11.', 'januar', 'geht', 'er', 'um', '5', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> NumberToWords ['am', '11.', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> OrdinalNumberToWords['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> ReplaceChar ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> ReplaceFull ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> RemoveDiacritics ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> Strip ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> END ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']k
# >>> {'$'}
Development
Prerequisites
It's recommended to use a virtual environment when developing spoteno. To create one, execute the following command in the project's root directory:
python -m venv .
To install spoteno and all it's dependencies, execute:
pip install -e .
Running the test suite
pip install -e .[dev]
python setup.py test
With PyCharm you might have to change the default test runner. Otherwise, it might only suggest to use nose. To do so, go to File > Settings > Tools > Python Integrated Tools (on the Mac it's PyCharm > Preferences > Settings > Tools > Python Integrated Tools) and change the test runner to py.test.
Versions
Versions is handled using bump2version. To bump the version:
bump2version [major,minor,patch,release,num]
In order to directly go to a final relase version (skip .dev/.rc/...):
bump2version [major,minor,patch] --new-version x.x.x
Release
Commands to create a new release on pypi.
rm -rf build
rm -rf dist
python setup.py sdist
python setup.py bdist_wheel
twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spoteno-0.1.1.tar.gz
.
File metadata
- Download URL: spoteno-0.1.1.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9e384d4a9a596d3386f2e9338075225a31d2c1306b24b99b58716757abd4523 |
|
MD5 | b1b210f84e01333d05eca2ef39636d1e |
|
BLAKE2b-256 | 071cd19634a37580636d1228ea4465e6950ff2a9d9900891f7fdcca70d2908b4 |
File details
Details for the file spoteno-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: spoteno-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56d8eff2fb8bfb2f81adbb7bc85335577ec075f23d5c33f0f2ec503f146ffc1e |
|
MD5 | 219f28060ef9322aa71ad2dcbfb348c0 |
|
BLAKE2b-256 | a86aaad25f6b2c7cb43ae3868ae4f90c4b230edab0852585df47a6235df912b8 |