A library for calculating a variety of features from text using spaCy
Project description
TextDescriptives
A Python library for calculating a large variety of metrics from text(s) using spaCy v.3 pipeline components and extensions.
🔧 Installation
pip install textdescriptives
📰 News
- We now have a TextDescriptives-powered web-app so you can extract and downloads metrics without a single line of code! Check it out here
- Version 2.0 out with a new API, a new component, updated documentation, and tutorials! Components are now called by "
textdescriptives/{metric_name}. Newcoherencecomponent for calculating the semantic coherence between sentences. See the documentation for tutorials and more information!
⚡ Quick Start
Use extract_metrics to quickly extract your desired metrics. To see available methods you can simply run:
import textdescriptives as td
td.get_valid_metrics()
# {'quality', 'readability', 'all', 'descriptive_stats', 'dependency_distance', 'pos_proportions', 'information_theory', 'coherence'}
Set the spacy_model parameter to specify which spaCy model to use, otherwise, TextDescriptives will auto-download an appropriate one based on lang. If lang is set, spacy_model is not necessary and vice versa.
Specify which metrics to extract in the metrics argument. None extracts all metrics.
import textdescriptives as td
text = "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it."
# will automatically download the relevant model (´en_core_web_lg´) and extract all metrics
df = td.extract_metrics(text=text, lang="en", metrics=None)
# specify spaCy model and which metrics to extract
df = td.extract_metrics(text=text, spacy_model="en_core_web_lg", metrics=["readability", "coherence"])
Usage with spaCy
To integrate with other spaCy pipelines, import the library and add the component(s) to your pipeline using the standard spaCy syntax. Available components are descriptive_stats, readability, dependency_distance, pos_proportions, coherence, and quality prefixed with textdescriptives/.
If you want to add all components you can use the shorthand textdescriptives/all.
import spacy
import textdescriptives as td
# load your favourite spacy model (remember to install it first using e.g. `python -m spacy download en_core_web_sm`)
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/all")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")
# access some of the values
doc._.readability
doc._.token_length
TextDescriptives includes convenience functions for extracting metrics from a Doc to a Pandas DataFrame or a dictionary.
td.extract_dict(doc)
td.extract_df(doc)
| text | first_order_coherence | second_order_coherence | pos_prop_DET | pos_prop_NOUN | pos_prop_AUX | pos_prop_VERB | pos_prop_PUNCT | pos_prop_PRON | pos_prop_ADP | pos_prop_ADV | pos_prop_SCONJ | flesch_reading_ease | flesch_kincaid_grade | smog | gunning_fog | automated_readability_index | coleman_liau_index | lix | rix | n_stop_words | alpha_ratio | mean_word_length | doc_length | proportion_ellipsis | proportion_bullet_points | duplicate_line_chr_fraction | duplicate_paragraph_chr_fraction | duplicate_5-gram_chr_fraction | duplicate_6-gram_chr_fraction | duplicate_7-gram_chr_fraction | duplicate_8-gram_chr_fraction | duplicate_9-gram_chr_fraction | duplicate_10-gram_chr_fraction | top_2-gram_chr_fraction | top_3-gram_chr_fraction | top_4-gram_chr_fraction | symbol_#_to_word_ratio | contains_lorem ipsum | passed_quality_check | dependency_distance_mean | dependency_distance_std | prop_adjacent_dependency_relation_mean | prop_adjacent_dependency_relation_std | token_length_mean | token_length_median | token_length_std | sentence_length_mean | sentence_length_median | sentence_length_std | syllables_per_token_mean | syllables_per_token_median | syllables_per_token_std | n_tokens | n_unique_tokens | proportion_unique_tokens | n_characters | n_sentences | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The world is changed(...) | 0.633002 | 0.573323 | 0.097561 | 0.121951 | 0.0731707 | 0.170732 | 0.146341 | 0.195122 | 0.0731707 | 0.0731707 | 0.0487805 | 107.879 | -0.0485714 | 5.68392 | 3.94286 | -2.45429 | -0.708571 | 12.7143 | 0.4 | 24 | 0.853659 | 2.95122 | 41 | 0 | 0 | 0 | 0 | 0.232258 | 0.232258 | 0 | 0 | 0 | 0 | 0.0580645 | 0.174194 | 0 | 0 | False | False | 1.77524 | 0.553188 | 0.457143 | 0.0722806 | 3.28571 | 3 | 1.54127 | 7 | 6 | 3.09839 | 1.08571 | 1 | 0.368117 | 35 | 23 | 0.657143 | 121 | 5 |
📖 Documentation
TextDescriptives has a detailed documentation as well as a series of Jupyter notebook tutorials.
All the tutorials are located in the docs/tutorials folder and can also be found on the documentation website.
| Documentation | |
|---|---|
| 📚 Getting started | Guides and instructions on how to use TextDescriptives and its features. |
| 👩💻 Demo | A live demo of TextDescriptives. |
| 😎 Tutorials | Detailed tutorials on how to make the most of TextDescriptives |
| 📰 News and changelog | New additions, changes and version history. |
| 🎛 API References | The detailed reference for TextDescriptive's API. Including function documentation |
| 📄 Paper | The preprint of the TextDescriptives paper. |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textdescriptives-2.8.4.tar.gz.
File metadata
- Download URL: textdescriptives-2.8.4.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
872baf8d3ee88d3f0cd076003a265b4788215126b5dcbeb6f7d439a50df40ba1
|
|
| MD5 |
a28fd921989d4f1aa8ca34cdb6ec8e2a
|
|
| BLAKE2b-256 |
d1ce2b3174f2105e4b16639fc7d2af000c3b4abc3c09f0b2a7f250b3165fbb40
|
File details
Details for the file textdescriptives-2.8.4-py3-none-any.whl.
File metadata
- Download URL: textdescriptives-2.8.4-py3-none-any.whl
- Upload date:
- Size: 254.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de97af21c622918196523f0692059c0680a8afe0ca4a63d41bd0789c69899d78
|
|
| MD5 |
fd904e9a8bd1903747f7f17f75848b1b
|
|
| BLAKE2b-256 |
09f168407ea9f6451d76aac115b81a4199f10b79c4385399e892dfddc7397875
|