Skip to main content

A library for calculating a variety of features from text using spaCy

Project description

spacy github actions pytest github actions docs github coverage DOI

TextDescriptives

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

🔧 Installation

pip install textdescriptives

📰 News

  • TextDescriptives has been completely re-implemented using spaCy v.3.0. The stanza implementation can be found in the stanza_version branch and will no longer be maintained.
  • Check out the brand new documentation here! See NEWS.md for release notes (v. 1.0.5 and onwards)

👩‍💻 Usage

Import the library and add the component to your pipeline using the string name of the "textdescriptives" component factory:

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives") 
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

TextDescriptives includes convenience functions for extracting metrics to a Pandas DataFrame or a dictionary.

td.extract_df(doc)
# td.extract_dict(doc)
text token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences flesch_reading_ease flesch_kincaid_grade smog gunning_fog automated_readability_index coleman_liau_index lix rix dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std pos_prop_DT pos_prop_NN pos_prop_VBZ pos_prop_VBN pos_prop_. pos_prop_PRP pos_prop_VBP pos_prop_IN pos_prop_RB pos_prop_VBD pos_prop_, pos_prop_WP
0 The world (...) 3.28571 3 1.54127 7 6 3.09839 1.08571 1 0.368117 35 23 0.657143 121 5 107.879 -0.0485714 5.68392 3.94286 -2.45429 -0.708571 12.7143 0.4 1.69524 0.422282 0.44381 0.0863679 0.097561 0.121951 0.0487805 0.0487805 0.121951 0.170732 0.121951 0.121951 0.0731707 0.0243902 0.0243902 0.0243902

Set which group(s) of metrics you want to extract using the metrics parameter (one or more of readability, dependency_distance, descriptive_stats, pos_stats, defaults to all)

If extract_df is called on an object created using nlp.pipe it will format the output with 1 row for each document and a column for each metric. Similarly, extract_dict will have a key for each metric and values as a list of metrics (1 per doc).

docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
            'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])

td.extract_df(docs, metrics="dependency_distance")
text dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std
0 The world (...) 1.69524 0.422282 0.44381 0.0863679
1 He felt (...) 2.56 0 0.44 0

The text column can by exluded by setting include_text to False.

Using specific components

The specific components (descriptive_stats, readability, dependency_distance and pos_stats) can be loaded individually. This can be helpful if you're only interested in e.g. readability metrics or descriptive statistics and don't want to run the dependency parser or part-of-speech tagger.

nlp = spacy.blank("da")
nlp.add_pipe("descriptive_stats")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
            "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])

# extract_df is clever enough to only extract metrics that are in the Doc
td.extract_df(docs, include_text = False)
token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences
0 4.4 3 2.59615 10 10 1 1.65 1 0.852936 20 19 0.95 90 2
1 4 3.5 2.44949 6 6 3 1.58333 1 0.862007 12 12 1 53 2

Available attributes

The table below shows the metrics included in TextDescriptives and their attributes on spaCy's Doc, Span, and Token objects. For more information, see the docs.

Attribute Component Description
Doc._.token_length descriptive_stats Dict containing mean, median, and std of token length.
Doc._.sentence_length descriptive_stats Dict containing mean, median, and std of sentence length.
Doc._.syllables descriptive_stats Dict containing mean, median, and std of number of syllables per token.
Doc._.counts descriptive_stats Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc.
Doc._.pos_proportions pos_stats Dict of {pos_prop_POSTAG: proportion of all tokens tagged with POSTAG}. Does not create a key if no tokens in the document fit the POSTAG.
Doc._.readability readability Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc.
Doc._.dependency_distance dependency_distance Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc.
Span._.token_length descriptive_stats Dict containing mean, median, and std of token length in the span.
Span._.counts descriptive_stats Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span.
Span._.pos_proportions pos_stats Dict of {pos_prop_POSTAG: proportion of all tokens tagged with POSTAG}. Does not create a key if no tokens in the span fit the POSTAG.
Span._.dependency_distance dependency_distance Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc.
Token._.dependency_distance dependency_distance Dict containing the dependency distance and whether the head word is adjacent for a Token.

Authors

Developed by Lasse Hansen (@HLasse) at the Center for Humanities Computing Aarhus

Collaborators:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textdescriptives-1.0.7.tar.gz (33.5 kB view details)

Uploaded Source

Built Distribution

textdescriptives-1.0.7-py2.py3-none-any.whl (35.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file textdescriptives-1.0.7.tar.gz.

File metadata

  • Download URL: textdescriptives-1.0.7.tar.gz
  • Upload date:
  • Size: 33.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for textdescriptives-1.0.7.tar.gz
Algorithm Hash digest
SHA256 a0835b836019a7c197a292c731153ef3f138113373da4f257ba61490a8969b20
MD5 706d3853de05a1e140dee8d3df55d6b5
BLAKE2b-256 c19cc5a5e4d4d740d34f255edcd960aba51bc1099d81c7f893800bfe9f154eae

See more details on using hashes here.

File details

Details for the file textdescriptives-1.0.7-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for textdescriptives-1.0.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 75545490270f62cff3ec055ee29a59045be6e0ad6119fab256a9893f0cb0359e
MD5 8123175d1e55558f69b3effdb0c48f8d
BLAKE2b-256 2b38a6e66c189781d23f618ed45d25908288c933d66e3e74412ad7f9300e3f44

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page