Skip to main content

Linguistic Identification of Morphosyntactic and Expressive Snags

Reason this release was yanked:

import structure between docs and library was misaligned

Project description

LIMES: Linguistic Identification of Morphosyntactic and Expressive Snags

LIMES is a library for performing linguistic analyses on provided texts regarding their complexity. The goal of this project is to create a tool that provides actionable insights on how to make written texts easier to comprehend.

Refer to the project documentation for in-depth information about concepts, API, and more.

Please note that the actual logic for identifying language barriers is completely language-specific. Because it is a lot of work to develop these heuristics, the library currently only ships with implemented analyzers for German texts. However, we encourage you to build your own analyzers based on the provided class templates, either for your own use or to contribute to the project.

Installation

You can install this package via pip by running:

pip install limes

You currently CAN'T install via pip, the package isn't published yet! If you can't wait to try it out for yourself, try building from source using uv.

Additional Dependencies

The library requires use of a Parser. Currently, we only ship a parser based on spaCy's excellent NLP pipeline. This means that you need to install a spaCy model that supports the language you are working with.

Example Usage

You must use a string container to wrap the text you want to analyze. As our analysis work on a sentence level, you can either manually sentencize and create separate Sentence objects or just throw your whole text into a Text object that takes care of sentencization for you.

We will do the latter for the purpose of this example.

from limes import Text, SpacyParser, GermanAnalyzer

analyzer = GermanAnalyzer()

# You can also pass a spacy NLP object instead of the model name.
# Make sure the model you want to use is installed.
parser = SpacyParser(model="de_core_news_sm")

text = Text(
    raw="Das hier ist ein Text. Dieser Text hat mehrere Sätze.",
    analyzer=analyzer,
    parser=parser,
)

Identifying Barriers

Barriers are detected lazily, and results are cached to avoid redundant computations. Barriers themselves are a property of the Text object.

# You can iterate over the all barriers in the entire text if you want.
for barrier in text.barriers:
    print(barrier.title)
    # Print the actual string of the token.
    print(barrier.affected_tokens)
    # Print the position of the token in the source text.
    print([token.i for token in barrier.affected_tokens])

# You can also iterate over each sentence.
for sentence in text:
    print(sentence.barriers)

# Alternatively, you can also inspect a specific sentence by index.
print(text[1].barriers)

Please note that barriers are also language-specific (because different languages also differ in how they make comprehension "difficult").

Calculating Complexities

There are multiple ways in which you can try to approximate language complexity (see our documentation for more information).

from limes import ComplexityAlgorithm

# Get the average complexity of the text. You can manually set the heuristic.
avg_complexity = text.average_complexity(
    heuristic=ComplexityAlgorithm.AGGREGATED_LOCAL,
)
print(avg_complexity)

# Alternatively, you can get phrase-level complexities.
# These are also lazily computed and cached.
for phrase, complexity in text.local_complexities:
    print(phrase)
    print(complexity)

# You could also iterate over all sentences in the text and get each sentence's
# global complexity.
for sentence in text:
    complexity = sentence.global_complexity(
        heuristic=ComplexityAlgorithm.AGGREGATED_LOCAL,
    )
    print(sentence)
    print(complexity)

Next Steps

A good place to start is to get an overview of the concepts used to build and configure the whole processing pipeline.

Currently Supported Languages

Language Contributors
DE Katja Grosch & Susanne Wagner (IFTO GmbH), Jannik Schmitt (deepsight GmbH)

Additional Resources

Word Frequency Lists

German

The frequency list for German words was kindly provided by Projekt Deutscher Wortschatz of the Universität Leipzig. The unprocessed list included in this repository (data/deu_words_2024.txt) is based on [1]. Please note that it is not based on the publicly available "Normgrößenkorpora" but was provided on request by the Leipzig Corpora team under a CC BY 4.0 license.

References

[1] Leipzig Corpora Collection (2024). German news corpus based on material from 2024. Leipzig Corpora Collection. Dataset. https://corpora.uni-leipzig.de/en?corpusId=deu_news_2024

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lang_limes-0.1.0.tar.gz (53.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lang_limes-0.1.0-py3-none-any.whl (19.8 MB view details)

Uploaded Python 3

File details

Details for the file lang_limes-0.1.0.tar.gz.

File metadata

  • Download URL: lang_limes-0.1.0.tar.gz
  • Upload date:
  • Size: 53.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.16

File hashes

Hashes for lang_limes-0.1.0.tar.gz
Algorithm Hash digest
SHA256 957b88ce030f24f77b73d3a0756c627660eb60f87d2208e1a2d87f789e919835
MD5 9b76f18c9edf8f663ef337d8f41e36d8
BLAKE2b-256 2a8e97fbd1db0301a5b961f4667553852cddd7712b33bf37b119acdbb0383dea

See more details on using hashes here.

File details

Details for the file lang_limes-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lang_limes-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.16

File hashes

Hashes for lang_limes-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b3171babf6267ca7ccce665bee7eef575f3fc8c7392d413875ba31b03cce7173
MD5 ea9b5fc850e09bcc73c7d5238bce50bd
BLAKE2b-256 3f4936df3405b84b755286eb79a862b67eab02cfd5b0f5bdab7aa77def725cee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page