Skip to main content

A simple Python toolkit for corpus analyses

Project description


The corpus-toolkit package grew out of courses in corpus linguistics and learner corpus research. The toolkit attempts to balance simplicity of use, broad application, and scalability. Common corpus analyses such as the calculation of word and n-gram frequency and range, keyness, and collocation are included. In addition, more advanced analyses such as the identification of dependency bigrams (e.g., verb-direct object combinations) and their frequency, range, and strength of association are also included.

Install corpus-toolkit

The package can be downloaded using pip

pip install corpus-toolkit


The corpus-toolkit package makes use of Spacy for tagging and parsing. However, the package also includes a tokenization and lemmatization function that does not require Spacy. If you want to tag or parse your files, you will need to install Spacy (and an appropriate Spacy language model).

pip install -U spacy
python -m spacy download en_core_web_sm

Quickstart guide

There are three corpus pre-processing options. The first is to use the tokenize() function, which does not rely on a part of speech tagger. The second is to use the tag() function, which uses Spacy to tokenize and tag the corpus. The third option is to pre-process the corpus in any way you like before using the other functions of the corpus-toolkit package.

This tutorial presumes that you have downloaded and extracted the, which is a version of the Brown corpus. The folder "brown_single" should be in your working directory.

Load, tokenize, and generate a frequency list

from corpus_toolkit import corpus_tools as ct
brown_corp = ct.ldcorpus("brown_single") #load and read corpus
tok_corp = ct.tokenize(brown_corp) #tokenize corpus - by default this lemmatizes as well
brown_freq = ct.frequency(tok_corp) #creates a frequency dictionary
#note that range can be calculated instead of frequency using the argument calc = "range"
ct.head(brown_freq, hits = 10) #print top 10 items
the     69836
be      37689
of      36365
a       30475
and     28826
to      26126
in      21318
he      19417
have    11938
it      10932

The functions ldcorpus() and tokenize() are Python generators, which means that they must be re-declared each time they are used (iterated over). A slightly messier (but more appropriate) way to achieve the results above is to nest the commands.

brown_freq = ct.frequency(ct.tokenize(ct.ldcorpus("brown_single")))
ct.head(brown_freq, hits = 10)
the     69836
be      37689
of      36365
a       30475
and     28826
to      26126
in      21318
he      19417
have    11938
it      10932

Create a tagged version of your corpus

The most efficient way to conduct multiple analyses with a tagged corpus is to write a tagged version of your corpus to file and then conduct subsequent analyses with the tagged files. If this is not possible for some reason, one can always run the tagger each time an analysis is conducted.

tagged_brown = ct.tag(ct.ldcorpus("brown_single"))
ct.write_corpus("tagged_brown_single",tagged_brown) #the first argument is the folder where the tagged files will be written

The function tag() is also a Python generator, so the preferred way to write a corpus is:


Now, we can reload our tagged corpus using the reload() function and generate a part of speech sensitive frequency list.

tagged_freq = ct.frequency(ct.reload("tagged_brown_single"))
ct.head(tagged_freq, hits = 10)
the_DET 69861
be_VERB 37800
of_ADP  36322
and_CCONJ       28889
a_DET   23069
in_ADP  20967
to_PART 15409
have_VERB       11978
to_ADP  10800
he_PRON 9801


Use the collocator() function to find collocates for a particular word.

collocates = ct.collocator(ct.tokenize(ct.ldcorpus("brown_single")),"go",stat = "MI")
#stat options include: "MI", "T", "freq", "left", and "right"

ct.head(collocates, hits = 10)
downstairs      7.875170389265524
upstairs        6.915812373762869
bedroom 6.627242875821938
abroad  6.273134375185426
re      6.21620730710059
m       6.211322724303333
forever 6.174730671124432
stanley 6.174730671124432
let     5.938347287580174
wrong   5.868744120106091


Keyness is calculated using two frequency dictionaries (consisting of raw frequency values). Only effect sizes are reported (p values are arguably not particularly useful for keyness analyses). Keyness calculation options include "log-ratio", "%diff", and "odds-ratio".

#First, generate frequency lists for each corpus
corp1freq = ct.frequency(ct.tokenize(ct.ldcorpus("corp1")))
corp2freq = ct.frequency(ct.tokenize(ct.ldcorpus("corp2")))

#then calculate Keyness
corp_key = ct.keyness(corp1freq,corp2freq, effect = "log-ratio")
ct.head(corp_key, hits = 10) #to display top hits


N-grams are contiguous sequences of n words. The tokenize() function can be used to create an n-gram version of a corpus by employing the ngram argument. By default, words in an n-gram are separated by two underscores "__"

trigramfreq = ct.frequency(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False, ngram = 3))
ct.head(trigramfreq, hits = 10)
one__of__the    404
the__united__states     339
as__well__as    237
some__of__the   179
out__of__the    172
the__fact__that 167
i__do__nt       162
the__end__of    149
part__of__the   144
it__was__a      143

Dependency bigrams

Dependency bigrams consist of two words that are syntactically connected via a head-dependent relationship. For example, in the clause "The player kicked the ball", the main verb kicked is connected to the noun ball via a direct object relationship, wherein kicked is the head and ball is the dependent.

The function dep_bigram() generates frequency dictionaries for the dependent, the head, and the dependency bigram. In addition, range is calculated along with a complete list of sentences in which the relationship occurs.

bg_dict = ct.dep_bigram(ct.ldcorpus("brown_single"),"dobj")
ct.head(bg_dict["bi_freq"], hits = 10)
#other keys include "dep_freq", "head_freq", and "range"
#also note that the key "samples" can be used to obtain a list of sample sentences
#but, this is not compatible with the ct.head() function (see ct.dep_conc() instead)
#all dependency bigrams are formatted as dependent_head
what_do 247
place_take      84
what_say        80
him_told        67
it_do   63
that_do 51
time_have       49
what_mean       46
this_do 46
what_call       42

Strength of association

Various measures of strength of association can calculated between dependents and heads. The soa() function takes a dictionary generated by the dep_bigram() function and calculates the strength of association for each dependency bigram.

soa_mi = ct.soa(bg_dict,stat = "MI")
#other stat options include: "T", "faith_dep", "faith_head","dp_dep", and "dp_head"
ct.head(soa_mi, hits = 10)
radiation_ionize        12.037110123486007
B_paragraph     12.037110123486007
suicide_commit  10.648544835568353
nose_scratch    10.39700606857239
calendar_adjust 9.972979786066292
imagination_capture     9.774075717652213
nose_blow       9.672113306706759
English_speak   9.496541742123304
throat_clear    9.367258725178337
expense_deduct  9.256227412789594

Concordance lines for dependency bigrams

A number of excellent cross-platform GUI- based concordancers such as AntConc are freely available, and are likely the preferred method for most concordancing.

However, it is difficult to get concordance lines for dependency bigrams without a more advanced program. The dep_conc() function takes the samples generated by the dep_bigram() function and creates a random sample of hits (50 hits by default) formatted as an html file.

The following example will write an html file named "dobj_results.html" to your working directory.


When opened, the resulting file will include the following:

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for corpus-toolkit, version 0.29
Filename, size File type Python version Upload date Hashes
Filename, size corpus_toolkit-0.29-py3-none-any.whl (1.7 MB) File type Wheel Python version py3 Upload date Hashes View
Filename, size corpus_toolkit-0.29.tar.gz (1.7 MB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page