A package for calculating a wide variety of features from text
A Python package for calculating a large variety of statistics from text(s).
Clone the Github directory, navigate to it in a terminal, and call
pip install .
To calculate all possible metrics:
import textdescriptives # Input can be either a string, list of strings, or pandas Series en_test = ['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.', 'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'] textdescriptives.all_metrics(en_test, lang = 'en', snlp_path = snlp_path)
|0||The world is changed.(...)||3.28571||3||1.54127||7||6||3.09839||1.08571||1||0.368117||0.657143||12.7143||0.4||24||5||35||121||3.94286||5.68392||107.879||-0.0485714||-2.45429||-0.708571||75||25||0.333333||1.60381||0.36493||0.695238||0.0481871|
|1||He felt that his whole (...)||4.16667||4||1.97203||24||24||0||1.16667||1||0.471405||0.833333||40.6667||4||21||1||24||101||11.2667||0||83.775||7.53667||10.195||7.46667||83.3333||16.6667||0.2||2.16||0||0.64||0|
To calculate one category at a time:
textdescriptives.basic_stats(texts, lang = 'en', metrics = 'all') textdescriptives.readability(texts, lang = 'en') textdescriptives.etymology(texts, lang = 'en') textdescriptives.dependency_distance(texsts, lang = 'en', snlp_path = None)
Textdescriptives works for most languages, simply change the country code:
da_test = pd.Series(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning', "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"]) textdescriptives.all_metrics(da_test, lang = 'da', snlp_path=snlp_path)
If you only want a subset of the basic statistics
textdescriptives.basic_stats(en_test, lang = 'en', metrics=['avg_word_length', 'n_chars'])
|0||The world is changed.(...)||3.28571||121|
|1||He felt that his whole (...)||4.16667||101|
The readability measures are largely derived from the textstat library and are thoroughly defined there.
The etymology measures are calculated using macroetym only slightly rewritten to be called from a script. They are calculated since in English, a greater frequency of words with a Latinate origin tends to indicate a more formal language register.
Mean dependency distance can be used as a way of measuring the average syntactic complexity of a text. Requres the
The dependency distance function requires stanfordnlp, and their language models. If you have already downloaded these models, the path to the folder can be specified in the snlp_path paramter. Otherwise, the models will be downloaded to your working directory + /snlp_resources.
Depending on which measures you want to calculate, the dependencies differ.
- Basic and readability: numpy, pandas, pyphen, pycountry
- Etymology: nltk and the following models
python3 -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('averaged_perceptron_tagger'); nltk.download('wordnet')"
- Depedency distance: snlp
Metrics currently implemented:
- Basic descriptive statistics - mean, median, standard deviation of the following:
- Word length
- Sentence length, words
- Sentence length, characters (TODO)
- Syllables per word
- Number of characters
- Number of sentences
- Number of types (unique words)
- Number of tokens (total words)
- Type/toḱen ratio
- Readability metrics:
- Flesch reading ease
- Flesch-Kincaid grade
- Automated readability index
- Coleman-Liau index
- Etymology-related metrics:
- Percentage words with Germanic origin
- Percentage words with Latinate origin
- Latinate/Germanic origin ratio
- Dependency distance metrics:
- Mean dependency distance, sentence level (mean, standard deviation)
- Mean proportion adjacent dependency relations, sentence level (mean, standard devaiation)
Developed by Lasse Hansen at the Center for Humanities Computing Aarhus
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size textdescriptives-0.1.1-py3-none-any.whl (11.3 MB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size textdescriptives-0.1.1.tar.gz (11.2 MB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for textdescriptives-0.1.1-py3-none-any.whl