StyloMetrix tool
Project description
StyloMetrix
Zakład Inżynierii Lingwistycznej i Anailzy Tekstu, NASK PIB
📌 Quick
💡 Stylometry tool in beta version for Polish, English and Ukrainian language, distributed as a Python package
💡 List of built-in metrics for Polish, English
💡 Helper functions and extensions
🔖 Citation
Please cite this article when referring to StyloMetrix:
Okulska, I., & Zawadzka, A. Styles with Benefits. The StyloMetrix Vectors for Stylistic and Semantic Text Classification of Small-Scale Datasets and Different Sample Length.
🔔 About
StyloMetrix is a tool for creating text representations as StyloMetrix vectors. Each metric in vector quantifies a linguistic feature in text. Therefore a detailed information of the style of text can be translated to numeric values and used for - whatever you want!
The metrics are:
- interpretable - each metric represents an aspect of linguistic knowledge
- normalized - metrics express number of ocurrences of given feature per number of tokens in text, which lets us escape scaling effect in texts of different lengths
- reproducible - values of metrics can be recalculated or even counted manually giving always the same output. The representation doesn't depend on any random factor or seeding
- customizable - if your needs exceed the scope of built-in metrics, create your own! Don't forget to share your work and contribute to the community of StyloMetrix!
A StyloMetrix vector can be used as:
- stylometric signature that encodes the writing style of the author and the genre
- input for classifiers of supervised or unsupervised learning, for example Random Forest classifier or feature selection algorithms
- values for statistical analyses in science
- set of linguistic data for manual reference
The tool offers customization of vectors by selecting from built-in metrics or creating new metrics according to user's needs. We provide a user-friendly interface to support these tasks. See instructions below! ⬇
Currently StyloMetrix is available for Polish, English and Ukrainian language!
📢 Release
Our most recent release is:
v0.1.0
- Changing the structure of StyloMetrix
- Works mutch faster!
- New metrics and categories in Polish and English language
- Ukrainian language in beta version
Previous releases ⌛
v0.0.6
- Add categories
Syntactic
andLexical
for English
v0.0.4
- Add English beta with built-in metrics in category
Grammatical Forms
v0.0.3
- Add StyloMetrix structure
- Add tutorial
- Add 6 built-in metrics categories for Polish beta:
Grammatical Forms
,Inflection
,Lexical
,Psycholinguistic
,Syntactic
,Word Formation
- Specify license & citation
🔨 Installation
1. Install spaCy
Install spacy
according to spaCy install instructions
2. Install model
▶ For English:
Install en_core_web_trf
from spaCy install instructions
▶ For Polish:
Download and install model pl_nask
v0.0.7
📍 pl_nask
is the new HerBERT based model from IPI PAN, requires spacy==3.3
python -m pip install <PATH_TO_MODEL/pl_nask-0.0.7.tar.gz>
3. Install StyloMetrix
pip install stylo_metrix
🪁 How to use
- Get your texts and import StyloMetrix:
import stylo_metrix as sm
texts = ['Panno święta, co Jasnej bronisz Częstochowy I w Ostrej świecisz Bramie!',
'Ofiarowany, martwą podniosłem powiekę; I zaraz mogłem pieszo, do Twych świątyń progu...',
'W ludziach straty nie było. Ale wszystkie ławy Miały zwichnione nogi;']
- Use StyloMetrix object for this texts:
stylo = sm.StyloMetrix('pl')
metrics = stylo.transform(texts)
print(metrics)
- Your results is now in
metrics
object.
That's it! Find out about more usages and customization options in notebook tutorial.
📈 Metrics
We have put care into creating a set of powerful built-in metrics. See the list below ⬇. However, since flexibility is strength, we provide an esy way to create new metrics.
Polish (see full list)
English (see full list)
Ukrainian (see full list)
📚 We use
- spaCy (MIT License)
- spacy-syllables (MIT License)
- pl_nask model (GNU GPL 3.0 License), Ryszard Tuora and Łukasz Kobyliński, "Integrating Polish Language Tools and Resources in spaCy". In: Proceedings of PP-RAI'2019 Conference, 16-18.10.2019, Wrocław, Poland.
- experimental data from Imbir, K. K. (2016). Affective Norms for 4900 Polish Words Reload (ANPW_R): Assessments for valence, arousal, dominance, origin, significance, concreteness, imageability and, age of acquisition. Frontiers in Psychology, 7, Article 1081. https://doi.org/10.3389/fpsyg.2016.01081
📪 Contact
Zakład Inżynierii Lingwistycznej i Anailzy Tekstu, Naukowa i Akademicka Sieć Komputerowa – Państwowy Instytut Badawczy
Adam Nowakowski adam.nowakowski@nask.pl | Inez Okulska inez.okulska@nask.pl
Copyright (C) 2023 NASK PIB
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for stylo_metrix-0.1.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0098cb92e987c7ba505e420be3ef3e01bf3152addf8be3064bc799b8fb6c909 |
|
MD5 | 39cac06228ec539d8169bb49ba30d70d |
|
BLAKE2b-256 | 2fd33df43af8fc66f4fc9d4886fd80ec20dad3eede4d978ca3fab4943139b85c |