Skip to main content

StyloMetrix tool

Project description

StyloMetrix

StyloMetrixNASK

Zakład Inżynierii Lingwistycznej i Anailzy Tekstu, NASK PIB

📌 Quick

💡 Stylometry tool in beta version for Polish and English language, distributed as a Python package

💡 Tutorial notebook

💡 List of built-in metrics for Polish, English

💡 Helper functions and extensions

🔖 Citation

Please cite this article when referring to StyloMetrix:

Okulska, I., & Zawadzka, A. Styles with Benefits. The StyloMetrix Vectors for Stylistic and Semantic Text Classification of Small-Scale Datasets and Different Sample Length.

🔔 About

StyloMetrix is a tool for creating text representations as StyloMetrix vectors. Each metric in vector quantifies a linguistic feature in text. Therefore a detailed information of the style of text can be translated to numeric values and used for - whatever you want!

The metrics are:

  • interpretable - each metric represents an aspect of linguistic knowledge
  • normalized - metrics express number of ocurrences of given feature per number of tokens in text, which lets us escape scaling effect in texts of different lengths
  • reproducible - values of metrics can be recalculated or even counted manually giving always the same output. The representation doesn't depend on any random factor or seeding
  • customizable - if your needs exceed the scope of built-in metrics, create your own! Don't forget to share your work and contribute to the community of StyloMetrix!

A StyloMetrix vector can be used as:

  • stylometric signature that encodes the writing style of the author and the genre
  • input for classifiers of supervised or unsupervised learning, for example Random Forest classifier or feature selection algorithms
  • values for statistical analyses in science
  • set of linguistic data for manual reference

The tool offers customization of vectors by selecting from built-in metrics or creating new metrics according to user's needs. We provide a user-friendly interface to support these tasks. See instructions below! ⬇

Currently StyloMetrix is available for Polish and English language!

📢 Release

Our most recent release is:

v0.0.6

  • Add categories Syntactic and Lexical for English
Previous releases

v0.0.4

  • Add English beta with built-in metrics in category Grammatical Forms

v0.0.3

  • Add StyloMetrix structure
  • Add tutorial
  • Add 6 built-in metrics categories for Polish beta: Grammatical Forms, Inflection, Lexical, Psycholinguistic, Syntactic, Word Formation
  • Specify license & citation

🔨 Installation

1. Install spaCy

Install spacy according to spaCy install instructions

2. Install model

For English:

Install en_core_web_trf from spaCy install instructions

For Polish:

Download and install model pl_nask v0.0.5

📍 pl_nask is the new HerBERT based model from IPI PAN, requires spacy==3.3

python -m pip install <PATH_TO_MODEL/pl_nask-0.0.5.tar.gz> 

3. Install StyloMetrix

python -m pip install stylo_metrix

🪁 How to use

  1. Add StyloMetrix pipe to spaCy pipeline:
import spacy
import stylo_metrix
nlp = spacy.load('pl_nask')         # for Polish
nlp = spacy.load('en_core_web_trf') # for English
nlp.add_pipe("stylo_metrix")
  1. Use for any text:
doc = nlp("W ten piękny dzień na niebie nie było ani jednej chmurki.")
doc._.stylo_metrix_vector
  1. Find your results in doc._.stylo_metrix_vector extension, or doc._.smv for conveninece.

That's it! Find out about more usages and customization options in extended use section or notebook tutorial.

📈 Metrics

We have put care into creating a set of powerful built-in metrics. See the list below ⬇. However, since flexibility is strength, we provide an esy way to create new metrics and mix existing groups. See the extended use section!

Polish (see full list):

Group Import
Grammatical Forms grammatical_forms_group
Inflection inflection_group
Lexical lexical_group
Psycholinguistic psycholinguistic_group
Punctuation punctuation_group
Syntactic syntactic_group
Word Formation word_formation_group
All ⬆ original_group

English (see full list):

Group Import
Grammatical Forms grammatical_forms_group
Syntactic syntactic_group
Lexical lexical_group
All ⬆ original_group

🚀 Extended use

See our notebook tutorial for complete instructions!

Imports that you will use:

from stylo_metrix.structures import CustomMetric, MetricsGroup
from stylo_metrix.utils import incidence, ratio

1. Create custom metrics

Quickest way: write a function that returns a value and decorate it with CustomMetric(). You can use all spaCy features:

@CustomMetric("Liczba niepustych tokenów")
def METRIC(doc):
    result = doc._.n_tokens
    return result

Or add more details and debug to keep your metrics clean:

@CustomMetric(name_pl="Występowanie czasowników w 3 os. l. poj.", name_en="Third person singular verb incidence")
def VERBS_3S(doc):
    verbs = [token for token in doc
            if token._.pos == "v" and token._.verb_person == "s3"]
    result = ratio(len(verbs), doc._.n_tokens)
    debug = {"verbs": verbs, "n_tokens": doc._.n_tokens}
    return result, debug

2. Use new metrics

Put your metrics in a group and update nlp object so they know to use your new group:

my_group = MetricsGroup(TEST1, TEST2)
nlp.metrics_group = my_group

Now run nlp(text) and that's it! Find the metric in doc._.stylo_metrix_vector or doc._.smv.

3. Create groups

Put custom metrics in groups to manage them. Create new MetricsGroup or concatenate groups:

group = MetricsGroup(METRIC, VERBS_3S)
# <MetricsGroup [METRIC, VERBS_3S]>

Import groups of metrics from our built-in set:

from stylo_metrix.metrics.pl import verbs_tenses_group, verbs_aspects_group
large_group = group + verbs_tenses_group + verbs_aspects_group
# <MetricsGroup [METRIC, VERBS_3S, IN_V_PAST, IN_V_PRES, IN_V_FUT, IN_V_FUTS, IN_V_FUTC, IN_V_PERF, IN_V_IMPERF]>

4. Save documentation

Keep your work clean by saving record of your metrics. You can get_codes() or get_descriptions() as list of strings for tagging, get_md() or get_txt() to print a neatly formatted table of metrics or save_txt(path) and save_md(path) to have your list generated and saved in one line:

group.get_txt()
# Nr   Kategoria            Kod              Nazwa                                   
# -----------------------------------------------------------------------------------
# Dodane metryki       METRIC           Metric METRIC                           
# Dodane metryki       VERBS_3S         Metric VERBS_3S                         
# ...
# Fleksja              IN_V_IMPERF      Występowanie czasowników w aspekcie niedokonanym

5. Use built-in extensions and functions

We share some features to facilitate your work. See the full list of helper functions and extensions.

Extensions

Skip repetetive searches using built-in extensions. Some of them are: token._.pos for part of speech or doc._.n_tokens.

Functions

Use built-in functions to replace most frequent lines of code and escape most common errors (like zero division). Currently we provide the following functions: incidence, mean, median, ratio, stdev.

Let's use them to calculate verbs starting with A letter in text.

@CustomMetric("Czasowniki rozpoczynające się na A")
def A_VERBS(doc):
    search = [token for token in doc 
              if token._.pos == 'v' and token.prefix_ == 'a']
    result = incidence(doc, search)
    debug = {'verbs': search}
    return result, debug

A_VERBS(nlp("Aneta często angażowała się w absorbujące aktywności, ale nie potrafiła pływać."))
# {'value': 0.15384615384615385,
#  'code': 'A_VERBS',
#  'name_pl': 'Metric A_VERBS',
#  'category_pl': 'Dodane metryki',
#  'debug': {'verbs': [angażowała, absorbujące]}}

📚 We use

📪 Contact

Zakład Inżynierii Lingwistycznej i Anailzy Tekstu, Naukowa i Akademicka Sieć Komputerowa – Państwowy Instytut Badawczy

Anna Zawadzka anna.zawadzka@nask.pl | Inez Okulska inez.okulska@nask.pl

Copyright (C) 2022 NASK PIB

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stylo_metrix-0.0.6.tar.gz (131.9 kB view hashes)

Uploaded Source

Built Distribution

stylo_metrix-0.0.6-py3-none-any.whl (150.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page