A small Python librairy to generate random credible words based on a list of words by esimating the probability of the next character from the frequency of the previous ones
Project description
markov-word-generator
A small Python library to generate random credible/plausible words based on a list of words by estimating the probability of the next character from the frequency of the previous N ones. This uses Markov chain
Installation
pip install markov-word-generator
Principle
In order to generate random words that sounds like real words, we need to analyze character distribution among a corpus in a given language. We can start by analyzing the character apparition frequency based on the previous character.
Here are heatmaps showing the distribution of each character (column) given the previous one (row).
$
= End of word^
= Start of a word
In English:
And in French:
Estimating the probability of a character to appear given the probability of the given previous one works fine but is still hazardous. We can increase the likelihood of the world sounds true by looking at the next N previous characters.
The generator will parse an input text file containing one word per line (dictionary), count each character occurrence based on the occurrence of the N previous ones create a mapping table for each character-combination and its associated frequency in the corpus.
Usage
Parsing the English dictionary to create a pseudo-word that sounds English by generating characters one by one. In this example, it works by analyzing the probability of each character to appear based on the last 4 ones.
from markov_word_generator import MarkovWordGenerator, WordType
# Generate a random word in English by predicting the probability of each new character based on its last 4 last characters
generator = MarkovWordGenerator(
markov_length=4,
language='en',
word_type=WordType.WORD,
)
print(generator.generate_word())
output:
rebutaneously
Parameters
-
MarkovWordGenerator():
markov_length
: int. Number of previous characters the generator will take into account to compute probability of apparition of each the next character.language
: str. Language to use to generate the word. Must be part of the supported languages.word_type
: str. Type of word to generate. Must be part of the supported word types.dictionary_filename
: str. Corpus the generator will parse to analyze character apparition frequency. Must be used only iflanguage
andword_type
are not set.ignore_accents
: Optional boolean. If set to True, Accents will not be considered while parsing dictionary_filename. Default to False
-
generate_word()
- seed: Optional str. If seed is set, it will generate a word starting with this seed
from markov_word_generator import MarkovWordGenerator, WordType, AllowedLanguages
# Generate a random German name by predicting the probability of each new character based on its last 3 last characters
generator = MarkovWordGenerator(
markov_length=3,
language=AllowedLanguages.DE,
word_type=WordType.NAME,
)
print(generator.generate_word())
ludgerten
Supported languages and word_types
import markov_word_generator
# List supported languages
print(markov_word_generator.get_supported_languages())
# ['EN', 'FR', 'DE', 'FI', 'IT', 'PT', 'SE']
# List supported word_type
print(markov_word_generator.get_supported_word_types())
# ['WORD', 'NAME']
More languages and word types (plants, movie names, cities...) can be added in the future.
Impact of the markov_length parameter
- The higher the number of characters N we take into account, the more credible the word will be. We may end up with already existing words (see Impact of the markov_length parameter below).
- Lowering N will lead to words that sound less real. Some words will also either very short (1-2 chars) or very long (>20chars)
from markov_word_generator import MarkovWordGenerator, WordType, AllowedLanguages
generator = MarkovWordGenerator(
markov_length=N, # N=1,2,3,4 or 5 in following examples
language=AllowedLanguages.EN,
word_type=WordType.WORD
)
for i in range(0, 10):
print(generator.generate_word())
Length 1
output:
eroun
unteticakreatintes
sucle
erarums
eablatirlac
e
ghils
rllig
beseleforuat
de
Length 2
output:
malle
dallintathilight
boaddly
nobtiousle
ing
alaymplaings
rusle
sprevircirdbages
bant
ritablegruphicalls
Length 3
output:
blungalinther
super
solder
degreetricked
mittlessly
out
hearf
fracertory
gyny
locious
Length 4
output:
authering
negligented
manoeistical
bleat
lover
confusions
dest
hand
display
entwinkle
Length 5
output:
significative
contention
grandmaidens
aidesdecamped
paralleled
contradicate
thereby
numskull
crises
battlegro
Benchmarks
Empirically generating 5000 random words for each of the tests and checking the percentage of them which do exist as actual valid words. 10 tests have been running. From N=1 to N=5 in both English and French languages Results are the following
N\Language | EN | FR* |
---|---|---|
1 | 4.61% | 6.15% |
2 | 8.89% | 10.60% |
3 | 14.80% | 10.04% |
4 | 33.08% | 33.88% |
5 | 62.84% | 65.68% |
Empirical measurements of the percentage of output words from the generator that are real words (exists in the dictionary) based on the number of characters N we take into account in the markov_chain over 5000 samples *accents have been ignored in benchmarks in French
From N=5, there are more than 50% of chances to generate an existing word.
More examples
Random generated words
EN | FR | ES | DE | IT | SE |
---|---|---|---|---|---|
duplicables | chouchonnées | inflamandando | regenfreunden | scommissari | medmännens |
feathenism | fumigents | diacontenderá | rechtsbeleuchtes | insortiti | metallösningens |
convolutionalist | saponisassiez | transnacionarán | unerschieben | immalintenziale | stationskligt |
jinglehand | pareraient | abundeo | unstimme | pronometro | arbetslöftenas |
stariness | toniciens | encuestionó | überredete | acconciliani | utredningsviljande |
trellish | challe | abombearán | zwischere | afferrofilia | tributionsverktygs |
subsidiariest | potames | banderolasteis | plädiertem | dispiacerete | slappningarnas |
discourself | rudoyers | construéis | wolken | trisecchererai | tidsnärings |
melanchorist | reluisionnés | desagüense | kompetentenzeichnen | riappavia | spagatellig |
cleavagery | sacagneuse | desvergonzaremos | dümmst | sgancializzando | yngstakternas |
Random generated names
EN | FR | ES | DE | IT | SE |
---|---|---|---|---|---|
charlena | arian | sandro | germann | severonica | brittan |
sorrell | clementin | uliseo | gunde | evarissa | kristin |
austinee | théophie | teofilomena | werthold | florena | frid |
hardine | augustine | herina | hannelia | tizia | torstein |
shantal | jeanninette | amilo | helmar | leonardinanda | gitta |
kristian | flavier | leandra | tatja | fortunatale | kerstina |
lessica | isidonie | dolorencio | sieghardt | simondo | sigfrida |
reana | clothaire | dion | anelia | geltrudenzio | thorsten |
leanoreen | fabriel | anuncia | trud | battia | gunils |
roslyn | bastienne | calis | eleonhard | lorentina | jerkel |
Given other types of dictionaries, generator can create random words in some specific topics: Random jobs, random plants, random animals, random cities...
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file markov_word_generator-0.6.tar.gz
.
File metadata
- Download URL: markov_word_generator-0.6.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9077daed7bfdcf0de1a5038f02f0b6876ee91ae9fe06ed0f39f95284a79223e |
|
MD5 | 9d2399a723e608d9432eedd6b215f108 |
|
BLAKE2b-256 | 27a054aa9a1aa4c959a0a8cc23e7db9d458d626fdaaf39aa49525c7211478e00 |
File details
Details for the file markov_word_generator-0.6-py3-none-any.whl
.
File metadata
- Download URL: markov_word_generator-0.6-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 622e327b0a4f27106d1e2928049ebb155b44112b9ba97c98c4d1c362eac8267a |
|
MD5 | f443dce80b2b0a0a41fc730591421b01 |
|
BLAKE2b-256 | 237d867ae99873048aa13facd8e157b3166e670d8d737d874057ee7f1ee78a29 |