Library for procedurally-generating text that resembles a particular language.
Project description
ipsum
Ipsum is a Python library for the generation of international placeholder text.
Unlike most other generators which work by scrambling a particular text (e.g. Lorem Ipsum generators with Cicero's "De Finibus Bonorum et Malorum"), it instead uses Markov models to generate a vocabulary of meaningless new words that resemble the language it was trained on. This allows for the generation of text that is typographically similar to a specified language (i.e. uses the same alphabet and punctuation, in the same manner and at the same frequency), but is semantically meaningless.
You can read more about how Ipsum works here.
You can use Ipsum directly from your browser by accessing the web app at ipsum.trifunovski.me.
It currently supports the following languages:
- English
- German
- Albanian
- Bulgarian
- Dutch
- English
- French
- German
- Greek
- Italian
- Macedonian
- Serbian
- Spanish
- Swedish
Installing
Note that ipsum
requires Python >= 3.8.1.
Run
pip install ipsum
to install the latest published version of the library, or clone the repo and
use poetry
git clone git@github.com:dtrifuno/ipsum
cd ipsum/ipsum
poetry install
to install a development copy.
Usage
import ipsum
# Load the English language model
model = ipsum.load_model("en")
# Returns a list of 3 strings, each resembling a paragraph of English
paragraphs = model.generate_paragraphs(3)
# Returns a list of 10 strings, each resembling a full sentence of English
sentences = model.generate_sentences(10)
# Returns a list of 50 words (does not include any punctuation)
words = model.generate_words(50)
Development
Typechecking, linting and testing
You can run
poetry run mypy /src /tests
to typecheck,
poetry run flake8
to lint, or
poetry run pytest --cov
to test the code.
Additional scripts
This repository contains several scripts that are useful in development, but are
not included with the PyPI package. If you want to make a change to this library,
please clone the repository instead. You can check out these scripts and what
they do by running poetry run dev
.
Adding a language
- Find out the two-letter ISO 639-1 code of
the language you want to add (
xx
for the rest of this subsection). Add the full English name and ISO 639-1 code of the language tosupported_languages.py
. - Prepare a corpus of texts in the language. The corpus should be packaged as a
zip archive of
.txt
files. - Write a parser for the language (look at
src/ipsum/parse/en_parser.py
for an example). Name theParser
instancexx_parser
and save it assrc/ipsum/parse/language/xx.py
. Add the parser instance toload_parser
insrc/ipsum/parse/__init__.py
. - Run
poetry run dev parser-diagnostics xx
. Ideally, the parser should detect around 100,000 sentences and be able to parse into skeletons more than 50–60% of them. - Run
poetry run dev build_model xx && poetry run model_diagnostics xx
. - Inspect
diagnostics/xx.png
. If it looks good, congrats, you are done! Otherwise, return to Step 2 and try to figure out what went wrong.
Corpora
The models were trained on the following corpora:
- Albanian: Leipzig Corpora Collection - 2020 Albanian News 100k Sentences
- Bulgarian: Bulgarian National Corpus - Diachronic corpus for the period of 1951–2021
- Dutch: Leipzig Corpora Collection - 2020 Dutch News 100k Sentences
- English: Selections from Computational Stylistics Group - 100 English Novels ver. 1.4
- French: Leipzig Corpora Collection - 2018 French News 100k Sentences
- German: Computational Stylistics Group - 68 German Novels
- Greek: Monolingual Greek corpus in the culture domain
- Italian: Leipzig Corpora Collection - 2019 Italian News 100k Sentences
- Macedonian: Selections from Electronic Corpus of Macedonian Literary Texts - 135 Тома Македонска Книжевност
- Serbian: Leipzig Corpora Collection - 2016 Serbian Wikipedia 100k Sentences
- Spanish: Leipzig Corpora Collection - 2016 Spanish News 100k Sentences
- Swedish: Leipzig Corpora Collection - 2019 Swedish News 100k Sentences
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.