A library for calculating a variety of features from text using spaCy

These details have not been verified by PyPI

Project links

Homepage

Project description

github coverage

TextDescriptives

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

🔧 Installation

pip install textdescriptives

📰 News

TextDescriptives has been completely re-implemented using spaCy v.3.0. The stanza implementation can be found in the stanza_version branch and will no longer be maintained.
Check out the brand new documentation here! See NEWS.md for release notes (v. 1.0.5 and onwards)

👩‍💻 Usage

Import the library and add the component to your pipeline using the string name of the "textdescriptives" component factory:

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives") 
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

TextDescriptives includes convenience functions for extracting metrics to a Pandas DataFrame or a dictionary.

td.extract_df(doc)
# td.extract_dict(doc)

	text	token_length_mean	token_length_median	token_length_std	sentence_length_mean	sentence_length_median	sentence_length_std	syllables_per_token_mean	syllables_per_token_median	syllables_per_token_std	n_tokens	n_unique_tokens	proportion_unique_tokens	n_characters	n_sentences	flesch_reading_ease	flesch_kincaid_grade	smog	gunning_fog	automated_readability_index	coleman_liau_index	lix	rix	dependency_distance_mean	dependency_distance_std	prop_adjacent_dependency_relation_mean	prop_adjacent_dependency_relation_std	pos_prop_DT	pos_prop_NN	pos_prop_VBZ	pos_prop_VBN	pos_prop_.	pos_prop_PRP	pos_prop_VBP	pos_prop_IN	pos_prop_RB	pos_prop_VBD	pos_prop_,	pos_prop_WP
0	The world (...)	3.28571	3	1.54127	7	6	3.09839	1.08571	1	0.368117	35	23	0.657143	121	5	107.879	-0.0485714	5.68392	3.94286	-2.45429	-0.708571	12.7143	0.4	1.69524	0.422282	0.44381	0.0863679	0.097561	0.121951	0.0487805	0.0487805	0.121951	0.170732	0.121951	0.121951	0.0731707	0.0243902	0.0243902	0.0243902

Set which group(s) of metrics you want to extract using the metrics parameter (one or more of readability, dependency_distance, descriptive_stats, pos_stats, defaults to all)

If extract_df is called on an object created using nlp.pipe it will format the output with 1 row for each document and a column for each metric. Similarly, extract_dict will have a key for each metric and values as a list of metrics (1 per doc).

docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
            'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])

td.extract_df(docs, metrics="dependency_distance")

	text	dependency_distance_mean	dependency_distance_std	prop_adjacent_dependency_relation_mean	prop_adjacent_dependency_relation_std
0	The world (...)	1.69524	0.422282	0.44381	0.0863679
1	He felt (...)	2.56	0	0.44	0

The text column can by exluded by setting include_text to False.

Using specific components

The specific components (descriptive_stats, readability, dependency_distance and pos_stats) can be loaded individually. This can be helpful if you're only interested in e.g. readability metrics or descriptive statistics and don't want to run the dependency parser or part-of-speech tagger.

nlp = spacy.blank("da")
nlp.add_pipe("descriptive_stats")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
            "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])

# extract_df is clever enough to only extract metrics that are in the Doc
td.extract_df(docs, include_text = False)

	token_length_mean	token_length_median	token_length_std	sentence_length_mean	sentence_length_median	sentence_length_std	syllables_per_token_mean	syllables_per_token_median	syllables_per_token_std	n_tokens	n_unique_tokens	proportion_unique_tokens	n_characters	n_sentences
0	4.4	3	2.59615	10	10	1	1.65	1	0.852936	20	19	0.95	90	2
1	4	3.5	2.44949	6	6	3	1.58333	1	0.862007	12	12	1	53	2

Available attributes

The table below shows the metrics included in TextDescriptives and their attributes on spaCy's Doc, Span, and Token objects. For more information, see the docs.

Attribute	Component	Description
`Doc._.token_length`	`descriptive_stats`	Dict containing mean, median, and std of token length.
`Doc._.sentence_length`	`descriptive_stats`	Dict containing mean, median, and std of sentence length.
`Doc._.syllables`	`descriptive_stats`	Dict containing mean, median, and std of number of syllables per token.
`Doc._.counts`	`descriptive_stats`	Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc.
`Doc._.pos_proportions`	`pos_stats`	Dict of `{pos_prop_POSTAG: proportion of all tokens tagged with POSTAG}`. Does not create a key if no tokens in the document fit the POSTAG.
`Doc._.readability`	`readability`	Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc.
`Doc._.dependency_distance`	`dependency_distance`	Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc.
`Span._.token_length`	`descriptive_stats`	Dict containing mean, median, and std of token length in the span.
`Span._.counts`	`descriptive_stats`	Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span.
`Span._.pos_proportions`	`pos_stats`	Dict of `{pos_prop_POSTAG: proportion of all tokens tagged with POSTAG}`. Does not create a key if no tokens in the span fit the POSTAG.
`Span._.dependency_distance`	`dependency_distance`	Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc.
`Token._.dependency_distance`	`dependency_distance`	Dict containing the dependency distance and whether the head word is adjacent for a Token.

Authors

Developed by Lasse Hansen (@HLasse) at the Center for Humanities Computing Aarhus

Collaborators:

Ludvig Renbo Olsen (@ludvigolsen, ludvigolsen.dk)
Kenneth Enevoldsen (@KennethEnevoldsen)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.8.2

May 31, 2024

2.8.1

May 7, 2024

2.8.0

Apr 9, 2024

2.7.3

Feb 6, 2024

2.7.2

Feb 6, 2024

2.7.1

Oct 31, 2023

2.7.0

Oct 12, 2023

2.6.2

Jul 31, 2023

2.6.1

May 3, 2023

2.6.0

Apr 28, 2023

2.5.1

Apr 26, 2023

2.5.0

Apr 26, 2023

2.4.6

Apr 24, 2023

2.4.5

Apr 19, 2023

2.4.4

Mar 28, 2023

2.4.3

Mar 1, 2023

2.4.2

Mar 1, 2023

2.4.1

Feb 8, 2023

2.4.0

Jan 31, 2023

2.3.0

Jan 23, 2023

2.2.0

Jan 16, 2023

2.1.0

Jan 6, 2023

2.0.10

Jan 3, 2023

2.0.4

Jan 3, 2023

1.1.1

Dec 5, 2022

1.1.0

Sep 26, 2022

This version

1.0.7

May 4, 2022

1.0.6

Oct 28, 2021

1.0.5

Oct 4, 2021

1.0.4

Aug 31, 2021

1.0.3

Aug 17, 2021

1.0.2

Aug 16, 2021

1.0.1

Aug 9, 2021

1.0.0

Aug 9, 2021

0.2.0

Aug 9, 2021

0.1.1

Mar 6, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textdescriptives-1.0.7.tar.gz (33.5 kB view details)

Uploaded May 4, 2022 Source

Built Distribution

textdescriptives-1.0.7-py2.py3-none-any.whl (35.7 kB view details)

Uploaded May 4, 2022 Python 2 Python 3

File details

Details for the file textdescriptives-1.0.7.tar.gz.

File metadata

Download URL: textdescriptives-1.0.7.tar.gz
Upload date: May 4, 2022
Size: 33.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for textdescriptives-1.0.7.tar.gz
Algorithm	Hash digest
SHA256	`a0835b836019a7c197a292c731153ef3f138113373da4f257ba61490a8969b20`
MD5	`706d3853de05a1e140dee8d3df55d6b5`
BLAKE2b-256	`c19cc5a5e4d4d740d34f255edcd960aba51bc1099d81c7f893800bfe9f154eae`

See more details on using hashes here.

File details

Details for the file textdescriptives-1.0.7-py2.py3-none-any.whl.

File metadata

Download URL: textdescriptives-1.0.7-py2.py3-none-any.whl
Upload date: May 4, 2022
Size: 35.7 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for textdescriptives-1.0.7-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`75545490270f62cff3ec055ee29a59045be6e0ad6119fab256a9893f0cb0359e`
MD5	`8123175d1e55558f69b3effdb0c48f8d`
BLAKE2b-256	`2b38a6e66c189781d23f618ed45d25908288c933d66e3e74412ad7f9300e3f44`