A library for calculating a variety of features from text using spaCy
Project description
TextDescriptives
A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.
🔧 Installation
pip install textdescriptives
📰 News
- Version 2.0 out with a new API, a new component, updated documentation, and tutorials! Components are now called by "
textdescriptives/{metric_name}
. Newcoherence
component for calculating the semantic coherence between sentences. See the documentation for tutorials and more information!
⚡ Quick Start
Import the library and add the component(s) to your pipeline using the standard spaCy syntax. Available components are descriptive_stats, readability, dependency_distance, pos_proportions, coherence, and quality prefixed with textdescriptives/
.
If you want to add all components you can use the shorthand textdescriptives/all
.
import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/all")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")
# access some of the values
doc._.readability
doc._.token_length
TextDescriptives includes convenience functions for extracting metrics to a Pandas DataFrame or a dictionary.
td.extract_dict(doc)
td.extract_df(doc)
text | first_order_coherence | second_order_coherence | pos_prop_DET | pos_prop_NOUN | pos_prop_AUX | pos_prop_VERB | pos_prop_PUNCT | pos_prop_PRON | pos_prop_ADP | pos_prop_ADV | pos_prop_SCONJ | flesch_reading_ease | flesch_kincaid_grade | smog | gunning_fog | automated_readability_index | coleman_liau_index | lix | rix | n_stop_words | alpha_ratio | mean_word_length | doc_length | proportion_ellipsis | proportion_bullet_points | duplicate_line_chr_fraction | duplicate_paragraph_chr_fraction | duplicate_5-gram_chr_fraction | duplicate_6-gram_chr_fraction | duplicate_7-gram_chr_fraction | duplicate_8-gram_chr_fraction | duplicate_9-gram_chr_fraction | duplicate_10-gram_chr_fraction | top_2-gram_chr_fraction | top_3-gram_chr_fraction | top_4-gram_chr_fraction | symbol_#_to_word_ratio | contains_lorem ipsum | passed_quality_check | dependency_distance_mean | dependency_distance_std | prop_adjacent_dependency_relation_mean | prop_adjacent_dependency_relation_std | token_length_mean | token_length_median | token_length_std | sentence_length_mean | sentence_length_median | sentence_length_std | syllables_per_token_mean | syllables_per_token_median | syllables_per_token_std | n_tokens | n_unique_tokens | proportion_unique_tokens | n_characters | n_sentences | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The world is changed(...) | 0.633002 | 0.573323 | 0.097561 | 0.121951 | 0.0731707 | 0.170732 | 0.146341 | 0.195122 | 0.0731707 | 0.0731707 | 0.0487805 | 107.879 | -0.0485714 | 5.68392 | 3.94286 | -2.45429 | -0.708571 | 12.7143 | 0.4 | 24 | 0.853659 | 2.95122 | 41 | 0 | 0 | 0 | 0 | 0.232258 | 0.232258 | 0 | 0 | 0 | 0 | 0.0580645 | 0.174194 | 0 | 0 | False | False | 1.77524 | 0.553188 | 0.457143 | 0.0722806 | 3.28571 | 3 | 1.54127 | 7 | 6 | 3.09839 | 1.08571 | 1 | 0.368117 | 35 | 23 | 0.657143 | 121 | 5 |
📖 Documentation
TextDescriptives has a detailed documentation as well as a series of Jupyter notebook tutorials.
All the tutorials are located in the docs/tutorials
folder and can also be found on the documentation webiste.
Documentation | |
---|---|
📚 Getting started | Guides and instructions on how to use TextDescriptives and its features. |
😎 Tutorials | Detailed tutorials on how to make the most of TextDescriptives |
📰 News and changelog | New additions, changes and version history. |
🎛 API References | The detailed reference for TextDescriptive's API. Including function documentation |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for textdescriptives-2.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 475faad042bc309b104e8b1a0359d8d5d2182b133b3284848c2302cba6a10ef3 |
|
MD5 | fc6ac4682d4f0e025c669efc1271cd16 |
|
BLAKE2b-256 | f012a4ead6f4298cd87d9bc90b34440662fcbd03bf0e8bdd59e2eb62bc61f5ae |