A simple Python toolkit for corpus analyses
Project description
Corpus-toolkit
The corpus-toolkit package grew out of courses in corpus linguistics and learner corpus research. The toolkit attempts to balance simplicity of use, broad application, and scalability. Common corpus analyses such as the calculation of word and n-gram frequency and range, keyness, and collocation are included. In addition, more advanced analyses such as the identification of dependency bigrams (e.g., verb-direct object combinations) and their frequency, range, and strength of association are also included.
More details on each function in the package (including various option settings) can be found on the corpus-toolkit resource page.
Install corpus-toolkit
The package can be downloaded using pip
pip install corpus-toolkit
Dependencies
The corpus-toolkit package makes use of Spacy for tagging and parsing. However, the package also includes a tokenization and lemmatization function that does not require Spacy. If you want to tag or parse your files, you will need to install Spacy (and an appropriate Spacy language model).
pip install -U spacy
python -m spacy download en_core_web_sm
Quickstart guide
There are three corpus pre-processing options. The first is to use the tokenize() function, which does not rely on a part of speech tagger. The second is to use the tag() function, which uses Spacy to tokenize and tag the corpus. The third option is to pre-process the corpus in any way you like before using the other functions of the corpus-toolkit package.
This tutorial presumes that you have downloaded and extracted the brown_single.zip, which is a version of the Brown corpus. The folder "brown_single" should be in your working directory.
Load, tokenize, and generate a frequency list
from corpus_toolkit import corpus_tools as ct
brown_corp = ct.ldcorpus("brown_single") #load and read corpus
tok_corp = ct.tokenize(brown_corp) #tokenize corpus - by default this lemmatizes as well
brown_freq = ct.frequency(tok_corp) #creates a frequency dictionary
#note that range can be calculated instead of frequency using the argument calc = "range"
ct.head(brown_freq, hits = 10) #print top 10 items
the 69836
be 37689
of 36365
a 30475
and 28826
to 26126
in 21318
he 19417
have 11938
it 10932
The functions ldcorpus() and tokenize() are Python generators, which means that they must be re-declared each time they are used (iterated over). A slightly messier (but more appropriate) way to achieve the results above is to nest the commands.
brown_freq = ct.frequency(ct.tokenize(ct.ldcorpus("brown_single")))
ct.head(brown_freq, hits = 10)
the 69836
be 37689
of 36365
a 30475
and 28826
to 26126
in 21318
he 19417
have 11938
it 10932
Note that the frequency() function can also calculate range and normalized frequency figures. See the resource page for details.
Generate concordance lines
Concordance lines can be generated using the concord() function. By default, a random sample of 25 hits will be generated, with 10 tokens of left and right context.
conc_results1 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run","ran","running","runs"],nhits = 10)
for x in conc_results1:
print(x)
[['buckle', 'drag', 'the', 'wagons', 'to', 'the', 'spring', 'lew', 'durkin', 'yelled'], 'run', ['em', 'right', 'into', 'the', 'spring', 'hustle', 'one', 'of', 'the', 'wagons']]
[['his', 'sweater', 'soaking', 'into', 'a', 'dark', 'streak', 'of', 'dirt', 'that'], 'ran', ['diagonally', 'across', 'the', 'white', 'wool', 'on', 'his', 'shoulder', 'as', 'though']]
[['took', 'a', 'hasty', 'shot', 'then', 'fled', 'without', 'knowing', 'the', 'result'], 'ran', ['until', 'breath', 'was', 'a', 'pain', 'in', 'his', 'chest', 'and', 'his']]
[['back', 'to', 'new', 'york', 'as', 'maude', 'suggested', 'she', 'would', 'nt'], 'run', ['like', 'a', 'scared', 'cat', 'but', 'well', 'she', 'd', 'be', 'very']]
[['with', 'that', 'soap', 'i', 'was', 'loaded', 'with', 'suds', 'when', 'i'], 'ran', ['away', 'and', 'i', 'have', 'nt', 'had', 'a', 'chance', 'to', 'wash']]
[['conditions', 'of', 'international', 'law', 'are', 'met', 'countries', 'that', 'try', 'to'], 'run', ['the', 'blockade', 'do', 'so', 'at', 'their', 'own', 'risk', 'blockade', 'runners']]
[['produce', 'something', 'which', 'has', 'not', 'previously', 'existed', 'thus', 'creativity', 'may'], 'run', ['all', 'the', 'way', 'from', 'making', 'a', 'cake', 'building', 'a', 'chicken']]
[['from', 'the', 'school', 'he', 'did', 'nt', 'look', 'back', 'and', 'he'], 'ran', ['until', 'he', 'was', 'out', 'of', 'sight', 'of', 'the', 'schoolhouse', 'and']]
[['in', 'my', 'body', 'i', 'could', 'light', 'all', 'the', 'lights', 'and'], 'run', ['all', 'the', 'factories', 'in', 'the', 'entire', 'united', 'states', 'for', 'some']]
[['in', 'any', 'time', 'they', 'please', 'sergeant', 'no', 'sir', 'running', 'in'], 'running', ['out', 'ca', 'nt', 'have', 'it', 'makes', 'for', 'confusion', 'and', 'congestion']]
Collocates can also be added as secondary search terms:
conc_results2 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run","ran","running","runs"],collocates = ["quick","quickly"], nhits = 10)
for x in conc_results2:
print(x)
[['range', 'and', 'in', 'marlin', 's', 'underground', 'test', 'gallery', 'we', 'quickly'], 'ran', ['into', 'the', 'same', 'trouble', 'that', 'plagued', 'bill', 'ruger', 'in', 'his']]
[['s', 'nest', 'to', 'the', 'rocky', 'ribs', 'of', 'the', 'canyonside', 'russ'], 'ran', ['up', 'the', 'steps', 'quickly', 'to', 'the', 'plank', 'porch', 'the', 'front']]
[['hands', 'and', 'feet', 'keeping', 'the', 'hands', 'in', 'the', 'starting', 'position'], 'run', ['in', 'place', 'to', 'a', 'quick', 'rhythm', 'after', 'this', 'has', 'become']]
[['engine', 'up', 'to', 'operating', 'temperature', 'quickly', 'and', 'to', 'keep', 'it'], 'running', ['at', 'its', 'most', 'efficient', 'temperature', 'through', 'the', 'proper', 'circulation', 'of']]
Search terms (and collocate search terms) can also be interpreted as regular expressions:
conc_results3 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run.*","ran"],collocates = ["quick.*"], nhits = 10, regex = True)
for x in conc_results3:
print(x)
[['impact', 'we', 'fired', 'this', 'little', '20-inch-barrel', 'job', 'on', 'my', 'home'], 'range', ['and', 'in', 'marlin', 's', 'underground', 'test', 'gallery', 'we', 'quickly', 'ran']]
[['range', 'and', 'in', 'marlin', 's', 'underground', 'test', 'gallery', 'we', 'quickly'], 'ran', ['into', 'the', 'same', 'trouble', 'that', 'plagued', 'bill', 'ruger', 'in', 'his']]
[['minutes', 'the', 'gallery', 'leaders', 'had', 'given', 'the', 'students', 'a', 'quick'], 'rundown', ['on', 'art', 'from', 'the', 'renaissance', 'to', 'the', 'late', '19th', 'century']]
[['s', 'nest', 'to', 'the', 'rocky', 'ribs', 'of', 'the', 'canyonside', 'russ'], 'ran', ['up', 'the', 'steps', 'quickly', 'to', 'the', 'plank', 'porch', 'the', 'front']]
[['hands', 'and', 'feet', 'keeping', 'the', 'hands', 'in', 'the', 'starting', 'position'], 'run', ['in', 'place', 'to', 'a', 'quick', 'rhythm', 'after', 'this', 'has', 'become']]
[['engine', 'up', 'to', 'operating', 'temperature', 'quickly', 'and', 'to', 'keep', 'it'], 'running', ['at', 'its', 'most', 'efficient', 'temperature', 'through', 'the', 'proper', 'circulation', 'of']]
Concordance lines can also be written to a file for easier analysis (e.g., using spreadsheet software). By default, items are separated by tab characters ("\t").
#write concordance lines to a file called "run_25.txt"
conc_results4 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run","ran","running","runs"], outname = "run_25.txt")
Create a tagged version of your corpus
The most efficient way to conduct multiple analyses with a tagged corpus is to write a tagged version of your corpus to file and then conduct subsequent analyses with the tagged files. If this is not possible for some reason, one can always run the tagger each time an analysis is conducted.
tagged_brown = ct.tag(ct.ldcorpus("brown_single"))
ct.write_corpus("tagged_brown_single",tagged_brown) #the first argument is the folder where the tagged files will be written
The function tag() is also a Python generator, so the preferred way to write a corpus is:
ct.write_corpus("tagged_brown_single",ct.tag(ct.ldcorpus("brown_single")))
Now, we can reload our tagged corpus using the reload() function and generate a part of speech sensitive frequency list.
tagged_freq = ct.frequency(ct.reload("tagged_brown_single"))
ct.head(tagged_freq, hits = 10)
the_DET 69861
be_VERB 37800
of_ADP 36322
and_CCONJ 28889
a_DET 23069
in_ADP 20967
to_PART 15409
have_VERB 11978
to_ADP 10800
he_PRON 9801
Collocation
Use the collocator() function to find collocates for a particular word.
collocates = ct.collocator(ct.tokenize(ct.ldcorpus("brown_single")),"go",stat = "MI")
#stat options include: "MI", "T", "freq", "left", and "right"
ct.head(collocates, hits = 10)
downstairs 7.875170389265524
upstairs 6.915812373762869
bedroom 6.627242875821938
abroad 6.273134375185426
re 6.21620730710059
m 6.211322724303333
forever 6.174730671124432
stanley 6.174730671124432
let 5.938347287580174
wrong 5.868744120106091
Keyness
Keyness is calculated using two frequency dictionaries (consisting of raw frequency values). Only effect sizes are reported (p values are arguably not particularly useful for keyness analyses). Keyness calculation options include "log-ratio", "%diff", and "odds-ratio".
#First, generate frequency lists for each corpus
corp1freq = ct.frequency(ct.tokenize(ct.ldcorpus("corp1")))
corp2freq = ct.frequency(ct.tokenize(ct.ldcorpus("corp2")))
#then calculate Keyness
corp_key = ct.keyness(corp1freq,corp2freq, effect = "log-ratio")
ct.head(corp_key, hits = 10) #to display top hits
N-grams
N-grams are contiguous sequences of n words. The tokenize() function can be used to create an n-gram version of a corpus by employing the ngram argument. By default, words in an n-gram are separated by two underscores "__"
trigramfreq = ct.frequency(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False, ngram = 3))
ct.head(trigramfreq, hits = 10)
one__of__the 404
the__united__states 339
as__well__as 237
some__of__the 179
out__of__the 172
the__fact__that 167
i__do__nt 162
the__end__of 149
part__of__the 144
it__was__a 143
Dependency bigrams
Dependency bigrams consist of two words that are syntactically connected via a head-dependent relationship. For example, in the clause "The player kicked the ball", the main verb kicked is connected to the noun ball via a direct object relationship, wherein kicked is the head and ball is the dependent.
The function dep_bigram() generates frequency dictionaries for the dependent, the head, and the dependency bigram. In addition, range is calculated along with a complete list of sentences in which the relationship occurs.
bg_dict = ct.dep_bigram(ct.ldcorpus("brown_single"),"dobj")
ct.head(bg_dict["bi_freq"], hits = 10)
#other keys include "dep_freq", "head_freq", and "range"
#also note that the key "samples" can be used to obtain a list of sample sentences
#but, this is not compatible with the ct.head() function (see ct.dep_conc() instead)
#all dependency bigrams are formatted as dependent_head
what_do 247
place_take 84
what_say 80
him_told 67
it_do 63
that_do 51
time_have 49
what_mean 46
this_do 46
what_call 42
Strength of association
Various measures of strength of association can calculated between dependents and heads. The soa() function takes a dictionary generated by the dep_bigram() function and calculates the strength of association for each dependency bigram.
soa_mi = ct.soa(bg_dict,stat = "MI")
#other stat options include: "T", "faith_dep", "faith_head","dp_dep", and "dp_head"
ct.head(soa_mi, hits = 10)
radiation_ionize 12.037110123486007
B_paragraph 12.037110123486007
suicide_commit 10.648544835568353
nose_scratch 10.39700606857239
calendar_adjust 9.972979786066292
imagination_capture 9.774075717652213
nose_blow 9.672113306706759
English_speak 9.496541742123304
throat_clear 9.367258725178337
expense_deduct 9.256227412789594
Concordance lines for dependency bigrams
A number of excellent cross-platform GUI- based concordancers such as AntConc are freely available, and are likely the preferred method for most concordancing.
However, it is difficult to get concordance lines for dependency bigrams without a more advanced program. The dep_conc() function takes the samples generated by the dep_bigram() function and creates a random sample of hits (50 hits by default) formatted as an html file.
The following example will write an html file named "dobj_results.html" to your working directory.
ct.dep_conc(bg_dict["samples"],"dobj_results")
When opened, the resulting file will include the following:
<style>dep {color:red;} dep_head {color:blue;}</style>A fringe of housing and gardens bearded_dobj_head the top_dobj_dep of the heights , and behind it were sandy roads leading past farms and hayfields . 39
A man with insomnia had better avoid_dobj_head bad dreams_dobj_dep of that kind if he knew what was good for him . 241
He simply would not work_dobj_head his arithmetic problems_dobj_dep when the teacher held his class . 192
You may be sure he marries her in the end and has_dobj_head a fine old knockdown fight_dobj_dep with the brother , and that there are plenty of minor scraps along the way to ensure that you understand what the word Donnybrook means . 198
Anyone familiar with the details of the McClellan hearings must at once realize that the sweetheart arrangements augmented_dobj_head employer profits_dobj_dep far more than they augmented the earnings of the corruptible labor leaders . 407
If the transferor has_dobj_head substantial assets_dobj_dep other than the claim , it seems reasonable to assume no corporation would be willing to acquire all of its properties in the dim hope of collecting a claim for refund of taxes . 433
For the first few months of their marriage she had tried to be nice about Gunny , going out with him to watch_dobj_head this pearl_dobj_dep without price stamp imperiously around in her stall . 441
If the site is on a reservoir , the level of the water at various seasons as it affects_dobj_head recreation_dobj_dep should be studied . 471
She thrust forward through the shadows and the trees that resisted_dobj_head her_dobj_dep and tried to fling her back . 226
The most infamous of all was launched by the explosion of the island of Krakatoa in 1883 ; ; it raced across the Pacific at 300 miles an hour , devastated_dobj_head the coasts_dobj_dep of Java and Sumatra with waves 100 to 130 feet high , and pounded the shore as far away as San Francisco . 40
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file corpus_toolkit-0.32.tar.gz
.
File metadata
- Download URL: corpus_toolkit-0.32.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57d5b62b53637b5c43c3ca1b24d5464be8fe113ade762208f33e5b5bfe2da213 |
|
MD5 | e0898e25e462e777ff7ce63683cfab57 |
|
BLAKE2b-256 | d6ab7e545f6fa4b04adb1da911b37bc190673cad56479a5ab1b7ecd16293f753 |
File details
Details for the file corpus_toolkit-0.32-py3-none-any.whl
.
File metadata
- Download URL: corpus_toolkit-0.32-py3-none-any.whl
- Upload date:
- Size: 1.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9da70a6482a4d9dab3fcacf0f8fc48914ba31d87a65f5bb1591b43b86be7396f |
|
MD5 | 6c6dce51e6928087d6181c91ec9c32cc |
|
BLAKE2b-256 | feb1a7f41ea0368bfaa31f045683745a71cc3e4ab0cfbfdf32cf21ce93aa57aa |