Skip to main content

A library for calculating a variety of features from text using spaCy

Project description

TextDescriptives

spacy github actions pytest github actions docs arXiv status Open in Streamlit

A Python library for calculating a large variety of metrics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

🔧 Installation

pip install textdescriptives

📰 News

  • We now have a TextDescriptives-powered web-app so you can extract and downloads metrics without a single line of code! Check it out here
  • Version 2.0 out with a new API, a new component, updated documentation, and tutorials! Components are now called by "textdescriptives/{metric_name}. New coherence component for calculating the semantic coherence between sentences. See the documentation for tutorials and more information!

⚡ Quick Start

Use extract_metrics to quickly extract your desired metrics. To see available methods you can simply run:

import textdescriptives as td
td.get_valid_metrics()
# {'quality', 'readability', 'all', 'descriptive_stats', 'dependency_distance', 'pos_proportions', 'information_theory', 'coherence'}

Set the spacy_model parameter to specify which spaCy model to use, otherwise, TextDescriptives will auto-download an appropriate one based on lang. If lang is set, spacy_model is not necessary and vice versa.

Specify which metrics to extract in the metrics argument. None extracts all metrics.

import textdescriptives as td

text = "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it."
# will automatically download the relevant model (´en_core_web_lg´) and extract all metrics
df = td.extract_metrics(text=text, lang="en", metrics=None)

# specify spaCy model and which metrics to extract
df = td.extract_metrics(text=text, spacy_model="en_core_web_lg", metrics=["readability", "coherence"])

Usage with spaCy

To integrate with other spaCy pipelines, import the library and add the component(s) to your pipeline using the standard spaCy syntax. Available components are descriptive_stats, readability, dependency_distance, pos_proportions, coherence, and quality prefixed with textdescriptives/.

If you want to add all components you can use the shorthand textdescriptives/all.

import spacy
import textdescriptives as td
# load your favourite spacy model (remember to install it first using e.g. `python -m spacy download en_core_web_sm`)
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/all") 
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

TextDescriptives includes convenience functions for extracting metrics from a Doc to a Pandas DataFrame or a dictionary.

td.extract_dict(doc)
td.extract_df(doc)
text first_order_coherence second_order_coherence pos_prop_DET pos_prop_NOUN pos_prop_AUX pos_prop_VERB pos_prop_PUNCT pos_prop_PRON pos_prop_ADP pos_prop_ADV pos_prop_SCONJ flesch_reading_ease flesch_kincaid_grade smog gunning_fog automated_readability_index coleman_liau_index lix rix n_stop_words alpha_ratio mean_word_length doc_length proportion_ellipsis proportion_bullet_points duplicate_line_chr_fraction duplicate_paragraph_chr_fraction duplicate_5-gram_chr_fraction duplicate_6-gram_chr_fraction duplicate_7-gram_chr_fraction duplicate_8-gram_chr_fraction duplicate_9-gram_chr_fraction duplicate_10-gram_chr_fraction top_2-gram_chr_fraction top_3-gram_chr_fraction top_4-gram_chr_fraction symbol_#_to_word_ratio contains_lorem ipsum passed_quality_check dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences
0 The world is changed(...) 0.633002 0.573323 0.097561 0.121951 0.0731707 0.170732 0.146341 0.195122 0.0731707 0.0731707 0.0487805 107.879 -0.0485714 5.68392 3.94286 -2.45429 -0.708571 12.7143 0.4 24 0.853659 2.95122 41 0 0 0 0 0.232258 0.232258 0 0 0 0 0.0580645 0.174194 0 0 False False 1.77524 0.553188 0.457143 0.0722806 3.28571 3 1.54127 7 6 3.09839 1.08571 1 0.368117 35 23 0.657143 121 5

📖 Documentation

TextDescriptives has a detailed documentation as well as a series of Jupyter notebook tutorials. All the tutorials are located in the docs/tutorials folder and can also be found on the documentation website.

Documentation
📚 Getting started Guides and instructions on how to use TextDescriptives and its features.
👩‍💻 Demo A live demo of TextDescriptives.
😎 Tutorials Detailed tutorials on how to make the most of TextDescriptives
📰 News and changelog New additions, changes and version history.
🎛 API References The detailed reference for TextDescriptive's API. Including function documentation
📄 Paper The preprint of the TextDescriptives paper.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textdescriptives-2.4.5.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

textdescriptives-2.4.5-py3-none-any.whl (251.7 kB view details)

Uploaded Python 3

File details

Details for the file textdescriptives-2.4.5.tar.gz.

File metadata

  • Download URL: textdescriptives-2.4.5.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.28.2 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.65.0 importlib-metadata/6.5.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.11

File hashes

Hashes for textdescriptives-2.4.5.tar.gz
Algorithm Hash digest
SHA256 3e9470a63a6df62593effbe95f03eb9fbc9db209ce7fd47a8956ca95287b5726
MD5 09b88f726282f00b1520753332ff9b90
BLAKE2b-256 c0f970d1ccc82a88473c55ebf2450a636ed6573ab0fa8acf084feba47b13d128

See more details on using hashes here.

File details

Details for the file textdescriptives-2.4.5-py3-none-any.whl.

File metadata

  • Download URL: textdescriptives-2.4.5-py3-none-any.whl
  • Upload date:
  • Size: 251.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.28.2 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.65.0 importlib-metadata/6.5.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.11

File hashes

Hashes for textdescriptives-2.4.5-py3-none-any.whl
Algorithm Hash digest
SHA256 35796a77bf309ce9588d3ac5f7368f3dc15af1d64843d860cede4f151da66751
MD5 89147c3f123c286353d55ead1e06ca9c
BLAKE2b-256 b76c6597866cedf3d57ef4aea08b845c692148d8d6effbe940ca4e07ed2b6c4b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page