StyloMetrix tool

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

StyloMetrix

Zakład Inżynierii Lingwistycznej i Analizy Tekstu, NASK PIB

📌 Quick

💡 Stylometry tool in beta version for English, German, Polish, Russian and Ukrainian language, distributed as a Python package

💡 Tutorial notebook

💡 List of built-in metrics for Polish, English, German, Ukrainian, Russian

🔖 Citation

Please cite this article when referring to StyloMetrix:

Okulska, I., Stetsenko, D., Kołos, A., Karlińska, A., Głąbińska, K., & Nowakowski, A. (2023). StyloMetrix: An Open-Source Multilingual Tool for Representing Stylometric Vectors. arXiv preprint arXiv:2309.12810.

🔔 About

StyloMetrix is a tool for creating text representations as StyloMetrix vectors. Each metric in vector quantifies a linguistic feature in text. Therefore a detailed information of the style of text can be translated to numeric values and used for - whatever you want!

The metrics are:

interpretable - each metric represents an aspect of linguistic knowledge
normalized - metrics express number of ocurrences of given feature per number of tokens in text, which lets us escape scaling effect in texts of different lengths
reproducible - values of metrics can be recalculated or even counted manually giving always the same output. The representation doesn't depend on any random factor or seeding
customizable - if your needs exceed the scope of built-in metrics, create your own! Don't forget to share your work and contribute to the community of StyloMetrix!

A StyloMetrix vector can be used as:

stylometric signature that encodes the writing style of the author and the genre
input for classifiers of supervised or unsupervised learning, for example Random Forest classifier or feature selection algorithms
values for statistical analyses in science
set of linguistic data for manual reference

The tool offers customization of vectors by selecting from built-in metrics or creating new metrics according to user's needs. We provide a user-friendly interface to support these tasks. See instructions below! ⬇

Currently StyloMetrix is available for English, German, Polish, Russian and Ukrainian language.

📢 Release

Our most recent release is:

v0.1.0

Changing the structure of StyloMetrix
Works much faster!
New metrics and categories in Polish and English language
German language in beta version
Russian language in beta version
Ukrainian language in beta version
Possibility to define metrics to use / categories of metrics as list of strings containing names.
Possibility to save intermediate steps so even if something crashes, you still have some of work saved.

Please notice that support for Russian and Ukrainian languages will no longer be available.

Previous releases ⌛

v0.0.6

Add categories Syntactic and Lexical for English

v0.0.4

Add English beta with built-in metrics in category Grammatical Forms

v0.0.3

Add StyloMetrix structure
Add tutorial
Add 6 built-in metrics categories for Polish beta: Grammatical Forms, Inflection, Lexical, Psycholinguistic, Syntactic, Word Formation
Specify license & citation

🔨 Installation

1. Install spaCy

Install spacy according to spaCy install instructions

2. Install model

▶ For Polish:

Download and install model pl_nask v0.0.7

📍 pl_nask is the new HerBERT based model from IPI PAN, requires spacy==3.3

python -m pip install <PATH_TO_MODEL/pl_nask-0.0.7.tar.gz>

▶ For other languages:

For English install en_core_web_trf from spaCy install instructions
For German install de_core_news_lg from spaCy install instructions
For Russian install ru_core_news_lg from spaCy install instructions
For Ukrainian install uk_core_web_trf from spaCy install instructions

3. Install StyloMetrix

pip install stylo_metrix

🪁 How to use

Get your texts and import StyloMetrix:

import stylo_metrix as sm

texts = ['Panno święta, co Jasnej bronisz Częstochowy I w Ostrej świecisz Bramie!',
        'Ofiarowany, martwą podniosłem powiekę; I zaraz mogłem pieszo, do Twych świątyń progu...',
        'W ludziach straty nie było. Ale wszystkie ławy Miały zwichnione nogi;']

Use StyloMetrix object for this texts:

stylo = sm.StyloMetrix('pl')
metrics = stylo.transform(texts)
print(metrics)

Your results is now in metrics object.

That's it! Find out about more usages and customization options in notebook tutorial.

Find out about using StyloMetrix in classification or in clustering in example notebook

📈 Metrics

We have put care into creating a set of powerful built-in metrics. See the list below ⬇. However, since flexibility is strength, we provide an easy way to create new metrics.

Polish (see full list)

English (see full list)

German (see full list)

Russian (see full list)

Ukrainian (see full list)

📚 We use

spaCy (MIT License)
spacy-syllables (MIT License)
pl_nask model (GNU GPL 3.0 License), Ryszard Tuora and Łukasz Kobyliński, "Integrating Polish Language Tools and Resources in spaCy". In: Proceedings of PP-RAI'2019 Conference, 16-18.10.2019, Wrocław, Poland.
experimental data from Imbir, K. K. (2016). Affective Norms for 4900 Polish Words Reload (ANPW_R): Assessments for valence, arousal, dominance, origin, significance, concreteness, imageability and, age of acquisition. Frontiers in Psychology, 7, Article 1081. https://doi.org/10.3389/fpsyg.2016.01081

📪 Contact

Zakład Inżynierii Lingwistycznej i Analizy Tekstu, Naukowa i Akademicka Sieć Komputerowa – Państwowy Instytut Badawczy

Inez Okulska inez.okulska@nask.pl | ziliat@nask.pl

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.9.1

Mar 19, 2024

0.1.9

Mar 14, 2024

0.1.8

Feb 21, 2024

0.1.7

Sep 25, 2023

0.1.6

Aug 24, 2023

0.1.5

Aug 23, 2023

0.1.4

Aug 23, 2023

0.1.3

May 12, 2023

0.1.2

May 8, 2023

0.1.1

Apr 21, 2023

0.1.0

Apr 21, 2023

0.0.7

Sep 2, 2022

0.0.6

Jun 18, 2022

0.0.5

Jun 2, 2022

0.0.4

Jun 1, 2022

0.0.3

May 18, 2022

0.0.2

May 16, 2022

0.0.1

May 16, 2022

0.0.0

May 15, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stylo_metrix-0.1.9.1.tar.gz (175.7 kB view hashes)

Uploaded Mar 19, 2024 Source

Built Distribution

stylo_metrix-0.1.9.1-py3-none-any.whl (214.2 kB view hashes)

Uploaded Mar 19, 2024 Python 3

Hashes for stylo_metrix-0.1.9.1.tar.gz

Hashes for stylo_metrix-0.1.9.1.tar.gz
Algorithm	Hash digest
SHA256	`d5359b2bfe336e9983c09d28c6d8f081f099d9639de9c3b0a64c021880f4a734`
MD5	`367892d5b577244a4fe2c37fd8044fd2`
BLAKE2b-256	`dc4aa5df0e960c08e5c71d8783ae2a9bda3d33770a66aa6d11cfa8916544758a`

Hashes for stylo_metrix-0.1.9.1-py3-none-any.whl

Hashes for stylo_metrix-0.1.9.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e865db192b9dd094c1cd6e31abc9add14cd1dc444542843cae17718d7db4db70`
MD5	`ebe0d51d6781f126f48759c37fbad825`
BLAKE2b-256	`c5b58b1e30de5bcaea235a788620049c58ea3aeb4147381e4d593333027a5938`