Skip to main content

The amazing Murre will normalize non-standard Finnish

Project description

🐶 Murre 🐕

DOI

The amazing Murre (genitive Murren 🐕) will normalize non-standard Finnish (puhekieli) to standard Finnish (kirjakieli). This repository is maintained by Mika Hämäläinen.

Installation

This library is designed for Python 3 and it may not work on Python 2.

pip3 install murre
python3 -m murre.download

Usage

To normalize Finnish, all you need to do is to run:

from murre import normalize_sentence

print(normalize_sentence("mä syön paljo karkkii".split(" ")))
>> minä syön paljon karkkia

To use the same chunk level BRNN model as described in the paper, you can pass wnut19_model=True, however this model might only work on Linux.

You can normalize multiple sentences at the same time by running

from murre import normalize_sentences

sents = ["kissa syö karkkii", "jok laulaa tuol puole", "en tiiä oikee et kuka se o", "kyl on hölömöö"]
sentences = [x.split(" ") for x in sents] #tokenize each sentence [["kissa", "syö", "karkkii"], ["jok", "laulaa"...]...]

print(normalize_sentences(sentences))
>> ['kissa syö karkkia', 'joka laulaa tuolla puolen', 'en tiedä oikein että kuka se on', 'kyllä on hölmöä']

Cite

Niko Partanen, Mika Hämäläinen, and Khalid Alnajjar. 2019. Dialect Text Normalization to Normative Standard Finnish. In the Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for murre, version 1.0.1
Filename, size File type Python version Upload date Hashes
Filename, size murre-1.0.1-py2.py3-none-any.whl (3.6 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size murre-1.0.1.tar.gz (4.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page