ilo li moku e toki li pana e sona ni: ni li toki ala toki pona?

Project description

sona toki

Test workflow for this library

What is sona toki?

This library, "Language Knowledge," helps you identify whether a message is in Toki Pona. It does so by determining whether a large enough number of words in a statement are "in Toki Pona". No grammar checking, yet.

I wrote this library with a variety of scraps and lessons learned from a prior project, ilo pi toki pona taso, "toki-pona-only tool". That tool now uses this library to great success!

If you've ever worked on a similar project, you know the question "is this message in [language]" is not a consistent one- the environment, topic, preferences of the speaker, and much more, can all alter whether a given message is "in" any specific language. This complexity applies to Toki Pona too.

So, this project "solves" that complex problem by offering an opinionated tokenizer and a configurable parser, allowing you to tune its output to your preferences and goals. Even silly ones.

Quick Start

Install with your preferred Python package manager. Example:

pdm init  # if your pyproject.toml doesn't exist yet
pdm add sonatoki

Then get started with a script along these lines:

from sonatoki.ilo import Ilo
from sonatoki.Configs import PrefConfig

def main():
    ilo = Ilo(**PrefConfig)
    ilo.is_toki_pona("imagine how is touch the sky")  # False
    ilo.is_toki_pona("o pilin insa e ni: sina pilin e sewi")  # True
    ilo.is_toki_pona("I Think I Can Evade Detection")  # False

if __name__ == "__main__":
    main()

Or if you'd prefer to configure on your own:

from copy import deepcopy
from sonatoki.ilo import Ilo
from sonatoki.Configs import BaseConfig
from sonatoki.Filters import NimiLinkuCore, NimiLinkuCommon, Phonotactic, ProperName, Or
from sonatoki.Scorers import SoftPassFail

def main():
    config = deepcopy(BaseConfig)
    config["scoring_filters"].extend([Or(NimiLinkuCore, NimiLinkuCommon), Phonotactic, ProperName])
    config["scorer"] = SoftPassFail

    ilo = Ilo(**config)
    ilo.is_toki_pona("mu mu!")  # True
    ilo.is_toki_pona("mi namako e moku mi")  # True
    ilo.is_toki_pona("ma wulin")  # False

if __name__ == "__main__":
    main()

Ilo is highly configurable by necessity, so I recommend looking through the premade configs in Configs as well as the individual Preprocessors, Filters, and Scorers. In Cleaners, all you need is ConsecutiveDuplicates. In Tokenizers, the preferred tokenizers WordTokenizer and SentTokenizer are already the default in Ilo.

Development

Install pdm
pdm install --dev
Open any file you like!

FAQ

Why isn't this README/library written in Toki Pona?

The intent is to show our methodology to the Unicode Consortium, particularly to the Script Encoding Working Group (previously the Script Ad Hoc Group). As far as we're aware, zero members of the committee know Toki Pona, which unfortunately means we fall back on English.

I originally intended to translate this file and library into Toki Pona once Unicode had reviewed our proposal, but this library has picked up some interest outside of the Toki Pona community, so this library and README will remain accessible to them.

What's the deal with the tokenizers?

The Toki Pona tokenizer sonatoki.Tokenizers.WordTokenizer attempts to tokenize statements such that every token either represents a word candidate ("toki", "mumumu") or a complete non-candidate ("..!", "123"). This design is highly undesirable for NLTK's English tokenizer because words in languages other than Toki Pona can have punctuation characters in or around them which are part of the word. Toki Pona doesn't have any mid-word symbols when rendered in the Latin alphabet or in Private Use Area Unicode characters, so a more aggressive tokenizer is highly desirable. However, this tokenizer doesn't ignore intra-word punctuation entirely. Instead, exactly one of - or ' is allowed at a time, so long as both of its neighbors are writing characters. This increases the accuracy of the tokenizer significantly, and makes identifying Toki Pona sentences among arbitrary ones similarly more accurate.

The goal of splitting into word candidates and non-candidates is important, because any encoding of Toki Pona's logographic script will require each character be split into its own token, where the default behavior would be to leave consecutive non-punctuation together.

Aren't there a lot of false positives?

For any individual filter, yes. Here are some examples:

ProperName will errantly match text in languages without a capital/lowercase distinction
Alphabetic matches words so long as they are only made of letters in Toki Pona's alphabet, which is 14 letters of the Latin alphabet.
Syllabic and Phonetic, despite imposing more structure than Alphabetic, will match a surprising amount of English words. For example, every word in "an awesome joke!" matches.
NimiPu and NimiLinkuCore will match a, mute, open regardless of the surrounding language.

This is point of Ilo and the Scorers: None of these filters would individually be able to correctly identify a Toki Pona statement, but all of them working together with some tuning are able to achieve a surprisingly high accuracy.

Don't some of the cleaners/filters conflict?

Yes, though not terribly much.

ConsecutiveDuplicates may errantly change a word's validity. For example, "manna" is phonotactically invalid in Toki Pona, but would become "mana" which is valid.
ConsecutiveDuplicates will not work correctly with syllabaries, though this should not change the validity of the analyzed word unless you attempt to dictionary match these words.
If you build your own MemberFilter with words that have capital letters or consecutive duplicates, they will never match unless you use prep_dictionary.

You'll notice these are mostly casued by applying latin alphabet filters to non-latin text. Working on it!

Project details

Release history Release notifications | RSS feed

0.12.0

May 25, 2026

0.11.5

May 25, 2026

0.11.4

Nov 22, 2025

0.11.3

Oct 5, 2025

0.11.2

Sep 28, 2025

0.11.1

Apr 16, 2025

0.11.0

Jan 3, 2025

0.10.1

Dec 19, 2024

0.10.0

Dec 18, 2024

This version

0.9.2

Dec 12, 2024

0.9.1

Oct 31, 2024

0.9.0

Oct 16, 2024

0.8.4

Sep 2, 2024

0.8.3

Aug 20, 2024

0.8.2

Aug 19, 2024

0.8.1

Aug 17, 2024

0.8.0

Aug 17, 2024

0.7.0

Aug 17, 2024

0.6.3

Aug 14, 2024

0.6.2

Aug 14, 2024

0.6.1

Aug 14, 2024

0.6.0

Aug 14, 2024

0.5.3

Jul 27, 2024

0.5.2

Jul 26, 2024

0.5.1

Jul 3, 2024

0.5.0

Jul 3, 2024

0.4.0

Jun 25, 2024

0.3.3

Jun 23, 2024

0.3.2

Jun 10, 2024

0.3.1

Jun 8, 2024

0.3.0

Jun 7, 2024

0.2.2

May 13, 2024

0.2.1

May 13, 2024

0.2.0

May 13, 2024

0.1.5

May 11, 2024

0.1.4

May 9, 2024

0.1.3

May 5, 2024

0.1.2

May 4, 2024

0.1.1

May 3, 2024

0.1.0

May 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sonatoki-0.9.2.tar.gz (141.8 kB view details)

Uploaded Dec 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sonatoki-0.9.2-py3-none-any.whl (134.2 kB view details)

Uploaded Dec 12, 2024 Python 3

File details

Details for the file sonatoki-0.9.2.tar.gz.

File metadata

Download URL: sonatoki-0.9.2.tar.gz
Upload date: Dec 12, 2024
Size: 141.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: pdm/2.22.0 CPython/3.11.10 Linux/6.5.0-1025-azure

File hashes

Hashes for sonatoki-0.9.2.tar.gz
Algorithm	Hash digest
SHA256	`c88dae4b559cdc1225662e72d7e6788c19b65175d44eb16af1810796f23eb62a`
MD5	`2fd1351fae204bf154c557b4a9bf6895`
BLAKE2b-256	`9a00b75272e9b90077128c411d5a1a3366955700d3ff54d1a90bdb3f33b40a72`

See more details on using hashes here.

File details

Details for the file sonatoki-0.9.2-py3-none-any.whl.

File metadata

Download URL: sonatoki-0.9.2-py3-none-any.whl
Upload date: Dec 12, 2024
Size: 134.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: pdm/2.22.0 CPython/3.11.10 Linux/6.5.0-1025-azure

File hashes

Hashes for sonatoki-0.9.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c183bc6c0cd09f979f5cc481669dc796f7eb090892a67bf7b49085ccc809c36`
MD5	`9ed2425d05de9c428de074e4b3655a52`
BLAKE2b-256	`1c96e47b774cef1a60c14049159bce40539fa083a4db2f1b57c83a21bf75e1a7`

See more details on using hashes here.

sonatoki 0.9.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

sona toki

What is sona toki?

Quick Start

Development

FAQ

Why isn't this README/library written in Toki Pona?

What's the deal with the tokenizers?

Aren't there a lot of false positives?

Don't some of the cleaners/filters conflict?

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes