CWB wrapper to extract concordances and score frequency lists
Project description
Collocation and Concordance Computation
Introduction
This module is a Python wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to run queries, extract concordance lines, and score frequency lists (e.g. collocates and keywords).
Prerequisites
The module needs a working installation of the CWB and operates on CWB-indexed corpora.
If you want to run queries with more than two anchor points, the module requires CWB version 3.4.16 or later.
Installation
You can install this module with pip from PyPI:
pip3 install cwb-ccc
You can also clone the source from github, cd
in the respective folder, and build your own wheel:
pipenv install --dev
python3 setup.py bdist_wheel
Accessing Corpora
To list all available corpora, you can use
from ccc import Corpora
corpora = Corpora(registry_path="/usr/local/share/cwb/registry/")
print(corpora)
corpora.show() # returns a DataFrame
All further methods rely on the Corpus
class, which establishes the connection to your CWB-indexed corpus. You can activate a corpus with
corpus = corpora.activate(corpus_name="GERMAPARL1386")
or directly use the respective class:
from ccc import Corpus
corpus = Corpus(
corpus_name="GERMAPARL1386",
registry_path="/usr/local/share/cwb/registry/"
)
This will raise a KeyError
if the named corpus is not in the specified registry.
If you are using macros and wordlists, you have to store them in a separate folder (with subfolders "wordlists/" and "macros/"). Specify this folder via lib_path
when initializing the corpus.
You can use the cqp_bin
to point the module to a specific version of cqp
(this is also helpful if cqp
is not in your PATH
).
By default, the data_path
points to "/tmp/ccc-{version}/". Make sure that "/tmp/" exists and appropriate rights are granted. Otherwise, change the parameter when initializing the corpus. Note that each corpus will have its own subdirectory for each library.
If everything is set up correctly, you can list all available attributes of the activated corpus:
corpus.attributes_available
type | attribute | annotation | active |
---|---|---|---|
p-Att | word | False | True |
p-Att | pos | False | False |
p-Att | lemma | False | False |
s-Att | corpus | False | False |
s-Att | corpus_name | True | False |
s-Att | sitzung | False | False |
s-Att | sitzung_date | True | False |
s-Att | sitzung_period | True | False |
s-Att | sitzung_session | True | False |
s-Att | div | False | False |
s-Att | div_desc | True | False |
s-Att | div_n | True | False |
s-Att | div_type | True | False |
s-Att | div_what | True | False |
s-Att | text | False | False |
s-Att | text_id | True | False |
s-Att | text_name | True | False |
s-Att | text_parliamentary_group | True | False |
s-Att | text_party | True | False |
s-Att | text_position | True | False |
s-Att | text_role | True | False |
s-Att | text_who | True | False |
s-Att | p | False | False |
s-Att | p_type | True | False |
s-Att | s | False | False |
Usage
Queries and Dumps
The usual starting point for using this module is to run a query with corpus.query()
, which accepts valid CQP queries such as
query = r'"\[" ([word="[A-Z0-9]+.?"%d]+ "/"?)+ "\]"'
dump = corpus.query(query)
The result is a Dump
object. Its core is a pandas DataFrame (dump.df
) similar to a CQP dump and multi-indexed by "match" and "matchend" of the query. All entries of the DataFrame, including the index, are integers representing corpus positions:
dump.df
match | matchend | context | contextend |
---|---|---|---|
2313 | 2319 | 2293 | 2339 |
8213 | 8217 | 8193 | 8237 |
8438 | 8444 | 8418 | 8464 |
15999 | 16001 | 15979 | 16021 |
24282 | 24288 | 24262 | 24308 |
... | ... | ... | ... |
You can provide one or more parameters to define the context around the matches: a parameter context
specifying the context window (defaults to 20) and a parameter context_break
naming an s-attribute to limit the context . You can specify asymmetric windows via context_left
and context_right
.
When providing an s-attribute limiting the context, the module additionally retrieves the CWB-id of this attribute, the corpus positions of the respective span start and end, as well as the actual context spans:
dump = corpus.query(
cqp_query=query,
context=20,
context_break='s'
)
dump.df
match | matchend | s_cwbid | s_span | s_spanend | contextid | context | contextend |
---|---|---|---|---|---|---|---|
2313 | 2319 | 161 | 2304 | 2320 | 161 | 2308 | 2320 |
8213 | 8217 | 489 | 8187 | 8218 | 489 | 8208 | 8218 |
8438 | 8444 | 500 | 8425 | 8445 | 500 | 8433 | 8445 |
15999 | 16001 | 905 | 15992 | 16002 | 905 | 15994 | 16002 |
24282 | 24288 | 1407 | 24273 | 24289 | 1407 | 24277 | 24289 |
... | ... | ... | ... | ... | ... | ... | ... |
There are two reasons for defining the context when running a query:
- If you provide a
context_break
parameter, the query will be automatically confined to spans delimited by this s-attribute; this is equivalent to formulating a query that ends on a respective "within" clause. - Subsequent analyses (concordancing, collocation) will all work on the same context.
Notwithstanding (1), the context can also be set after having run the query:
dump.set_context(context_left=5, context_right=10, context_break='s')
Note that this works "inplace".
You can set CQP's matching strategy ("standard", "longest", "shortest", "traditional") via the match_strategy
parameter.
By default, the result is cached: the query parameters are used to create an appropriate identifier. This way, the result can be accessed directly by later queries with the same parameters on the same (sub)corpus, without the need for CQP to run again.
We are set up to analyze the query result. Here's the frequency breakdown:
dump.breakdown()
word | freq |
---|---|
[ SPD ] | 18 |
[ F. D. P. ] | 14 |
[ CDU / CSU ] | 13 |
[ BÜNDNIS 90 / DIE GRÜNEN ] | 12 |
[ PDS ] | 6 |
Concordancing
You can access concordance lines via the concordance()
method of the dump. This method returns a DataFrame with information about the query matches in context:
dump.concordance()
match | matchend | word |
---|---|---|
2313 | 2319 | Joseph Fischer [ Frankfurt ] [ BÜNDNIS 90 / DIE GRÜNEN ] ) |
8213 | 8217 | Widerspruch des Abg. Wolfgang Zöller [ CDU / CSU ] ) |
8438 | 8444 | Joseph Fischer [ Frankfurt ] [ BÜNDNIS 90 / DIE GRÜNEN ] ) |
15999 | 16001 | des Abg. Dr. Peter Struck [ SPD ] ) |
24282 | 24288 | Joseph Fischer [ Frankfurt ] [ BÜNDNIS 90 / DIE GRÜNEN ] ) |
... | ... | ... |
By default, the output is a "simple" format, i.e. a DataFrame indexed by "match" and "matchend" with a column "word" showing the matches in context. You can choose which p-attributes to retrieve via the p_show
parameter. Similarly, you can retrieve s-attributes (at match-position):
dump.concordance(p_show=["word", "lemma"], s_show=["text_id"])
match | matchend | word | lemma | text_id |
---|---|---|---|---|
2313 | 2319 | Joseph Fischer [ Frankfurt ] [ BÜNDNIS 90 / DIE GRÜNEN ] ) | Joseph Fischer [ Frankfurt ] [ Bündnis 90 / die Grünen ] ) | i13_86_1_2 |
8213 | 8217 | Widerspruch des Abg. Wolfgang Zöller [ CDU / CSU ] ) | Widerspruch die Abg. Wolfgang Zöller [ CDU / CSU ] ) | i13_86_1_4 |
8438 | 8444 | Joseph Fischer [ Frankfurt ] [ BÜNDNIS 90 / DIE GRÜNEN ] ) | Joseph Fischer [ Frankfurt ] [ Bündnis 90 / die Grünen ] ) | i13_86_1_4 |
15999 | 16001 | des Abg. Dr. Peter Struck [ SPD ] ) | die Abg. Dr. Peter Struck [ SPD ] ) | i13_86_1_8 |
24282 | 24288 | Joseph Fischer [ Frankfurt ] [ BÜNDNIS 90 / DIE GRÜNEN ] ) | Joseph Fischer [ Frankfurt ] [ Bündnis 90 / die Grünen ] ) | i13_86_1_24 |
... | ... | ... | ... | ... |
The format can be changed using the form
parameter. The "kwic" format e.g. returns three columns for each requested p-attribute:
dump.concordance(form="kwic")
match | matchend | left_word | node_word | right_word |
---|---|---|---|---|
2313 | 2319 | Joseph Fischer [ Frankfurt ] | [ BÜNDNIS 90 / DIE GRÜNEN ] | ) |
8213 | 8217 | Widerspruch des Abg. Wolfgang Zöller | [ CDU / CSU ] | ) |
8438 | 8444 | Joseph Fischer [ Frankfurt ] | [ BÜNDNIS 90 / DIE GRÜNEN ] | ) |
15999 | 16001 | des Abg. Dr. Peter Struck | [ SPD ] | ) |
24282 | 24288 | Joseph Fischer [ Frankfurt ] | [ BÜNDNIS 90 / DIE GRÜNEN ] | ) |
If you want to inspect each query result in detail, use form
="dataframe"; here, every concordance line is verticalized text formated as DataFrame with the cpos of each token as index:
lines = dump.concordance(p_show=['word', 'pos', 'lemma'], form='dataframe')
lines.iloc[0]['dataframe']
cpos | offset | word | pos | lemma |
---|---|---|---|---|
2308 | -5 | Joseph | NE | Joseph |
2309 | -4 | Fischer | NE | Fischer |
2310 | -3 | [ | XY | [ |
2311 | -2 | Frankfurt | NE | Frankfurt |
2312 | -1 | ] | APPRART | ] |
2313 | 0 | [ | ADJA | [ |
2314 | 0 | BÜNDNIS | NN | Bündnis |
2315 | 0 | 90 | CARD | 90 |
2316 | 0 | / | $( | / |
2317 | 0 | DIE | ART | die |
2318 | 0 | GRÜNEN | NN | Grünen |
2319 | 0 | ] | $. | ] |
2320 | 1 | ) | $( | ) |
Further form
s are "slots" (see below) and "dict": In the latter case, every entry in the "dict" column is a dictionary with the following keys:
- "match" (int): the cpos of the match (serves as an identifier)
- "cpos" (list): the cpos of all tokens in the concordance line
- "offset" (list): the offset to match/matchend of all tokens
- "word" (list): the words of all tokens
- "anchors" (dict): a dictionary of {anchor: cpos} (see below)
This format is especially suitable for serialization purposes.
You can decide which and how many concordance lines you want to retrieve by means of the parameters order
("first", "last", or "random") and cut_off
. You can also provide a list of matches
to get only specific concordance lines.
Anchored Queries
The concordancer detects anchored queries automatically. The following query
dump = corpus.query(
cqp_query=r'@1[pos="NE"]? @2[pos="NE"] @3"\[" ([word="[A-Z0-9]+.?"%d]+ "/"?)+ @4"\]"',
context=None, context_break='s', match_strategy='longest'
)
lines = dump.concordance(form='dataframe')
thus returns DataFrames with additional columns for each anchor point:
lines.iloc[0]['dataframe']
cpos | offset | word | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|---|
8187 | -24 | ( | False | False | False | False |
8188 | -23 | Anhaltender | False | False | False | False |
8189 | -22 | lebhafter | False | False | False | False |
8190 | -21 | Beifall | False | False | False | False |
8191 | -20 | bei | False | False | False | False |
8192 | -19 | der | False | False | False | False |
8193 | -18 | SPD | False | False | False | False |
8194 | -17 | -- | False | False | False | False |
8195 | -16 | Beifall | False | False | False | False |
8196 | -15 | bei | False | False | False | False |
8197 | -14 | Abgeordneten | False | False | False | False |
8198 | -13 | des | False | False | False | False |
8199 | -12 | BÜNDNISSES | False | False | False | False |
8200 | -11 | 90 | False | False | False | False |
8201 | -10 | / | False | False | False | False |
8202 | -9 | DIE | False | False | False | False |
8203 | -8 | GRÜNEN | False | False | False | False |
8204 | -7 | und | False | False | False | False |
8205 | -6 | der | False | False | False | False |
8206 | -5 | PDS | False | False | False | False |
8207 | -4 | -- | False | False | False | False |
8208 | -3 | Widerspruch | False | False | False | False |
8209 | -2 | des | False | False | False | False |
8210 | -1 | Abg. | False | False | False | False |
8211 | 0 | Wolfgang | True | False | False | False |
8212 | 0 | Zöller | False | True | False | False |
8213 | 0 | [ | False | False | True | False |
8214 | 0 | CDU | False | False | False | False |
8215 | 0 | / | False | False | False | False |
8216 | 0 | CSU | False | False | False | False |
8217 | 0 | ] | False | False | False | True |
8218 | 1 | ) | False | False | False | False |
For an analysis of certain spans of your query matches, you can use anchor points to define "slots", i.e. single anchors or spans between anchors that define sub-parts of your matches. Use the "slots" format to extract these parts from each match:
dump = corpus.query(
r'@1[pos="NE"]? @2[pos="NE"] @3"\[" ([word="[A-Z0-9]+.?"%d]+ "/"?)+ @4"\]"',
context=0, context_break='s', match_strategy='longest',
)
lines = dump.concordance(
form='slots', p_show=['word', 'lemma'],
slots={"name": [1, 2], "party": [3, 4]}
)
lines
match | matchend | word | name_word | party_word |
---|---|---|---|---|
8211 | 8217 | Wolfgang Zöller [ CDU / CSU ] | Wolfgang Zöller | [ CDU / CSU ] |
15997 | 16001 | Peter Struck [ SPD ] | Peter Struck | [ SPD ] |
25512 | 25516 | Jörg Tauss [ SPD ] | Jörg Tauss | [ SPD ] |
32808 | 32814 | Ina Albowitz [ F. D. P. ] | Ina Albowitz | [ F. D. P. ] |
36980 | 36984 | Christa Luft [ PDS ] | Christa Luft | [ PDS ] |
... | ... | ... | ... | ... |
The module allows for correction of anchor points by integer offsets. This is especially helpful if the query contains optional parts (defined by ?
, +
or *
) – note that this works inplace:
dump.correct_anchors({3: +1, 4: -1})
lines = dump.concordance(
form='slots', p_show=['word', 'lemma'],
slots={"name": [1, 2], "party": [3, 4]}
)
lines
match | matchend | word | name_word | party_word |
---|---|---|---|---|
8211 | 8217 | Wolfgang Zöller [ CDU / CSU ] | Wolfgang Zöller | CDU / CSU |
15997 | 16001 | Peter Struck [ SPD ] | Peter Struck | SPD |
25512 | 25516 | Jörg Tauss [ SPD ] | Jörg Tauss | SPD |
32808 | 32814 | Ina Albowitz [ F. D. P. ] | Ina Albowitz | F. D. P. |
36980 | 36984 | Christa Luft [ PDS ] | Christa Luft | PDS |
... | ... | ... | ... | ... |
Collocation Analyses
After executing a query, you can use dump.collocates()
to extract collocates for a given window size (symmetric windows around the corpus matches). The result will be a DataFrame
with lexical items (e.g. lemmata) as index and frequency signatures and association measures as columns.
dump = corpus.query('[lemma="SPD"]', context=10, context_break='s')
dump.collocates()
item | O11 | O12 | O21 | O22 | E11 | E12 | E21 | E22 | z_score | t_score | log_likelihood | simple_ll | dice | log_ratio | mutual_information | local_mutual_information | conservative_log_ratio | ipm | ipm_expected | ipm_reference | ipm_reference_expected | in_nodes | marginal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
die | 813 | 4373 | 12952 | 131030 | 478.556 | 4707.44 | 13286.4 | 130696 | 15.2882 | 11.7295 | 226.513 | 192.823 | 0.0858 | 0.801347 | 0.230157 | 187.118 | 0.605981 | 156768 | 92278.5 | 89955.7 | 92278.5 | 0 | 13765 |
bei | 366 | 4820 | 991 | 142991 | 47.1777 | 5138.82 | 1309.82 | 142672 | 46.4174 | 16.6651 | 967.728 | 862.013 | 0.111875 | 3.35808 | 0.889744 | 325.646 | 3.00871 | 70574.6 | 9097.13 | 6882.8 | 9097.13 | 0 | 1357 |
( | 314 | 4872 | 1444 | 142538 | 61.1189 | 5124.88 | 1696.88 | 142285 | 32.3466 | 14.2709 | 574.854 | 522.005 | 0.090438 | 2.59389 | 0.710754 | 223.177 | 2.23788 | 60547.6 | 11785.4 | 10029 | 11785.4 | 0 | 1758 |
[ | 221 | 4965 | 477 | 143505 | 24.2668 | 5161.73 | 673.733 | 143308 | 39.9366 | 13.2337 | 654.834 | 582.935 | 0.075119 | 3.68518 | 0.95938 | 212.023 | 3.21474 | 42614.7 | 4679.29 | 3312.91 | 4679.29 | 0 | 698 |
) | 207 | 4979 | 1620 | 142362 | 63.5178 | 5122.48 | 1763.48 | 142219 | 18.0032 | 9.9727 | 218.341 | 202.135 | 0.059033 | 1.82683 | 0.513075 | 106.207 | 1.40153 | 39915.2 | 12247.9 | 11251.4 | 12247.9 | 0 | 1827 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
By default, collocates are calculated on the "lemma"-layer, assuming that this is an available p-attribute in the corpus. The corresponding parameter is p_query
(which will fall back to "word" if the specified attribute is not annotated in the corpus).
New in version 0.9.14: You can now perform collocation analyses on combinations of p-attribute layers, the most prominent use case being POS-disambiguated lemmata:
dump.collocates(['lemma', 'pos'], order='log_likelihood')
item | O11 | O12 | O21 | O22 | E11 | E12 | E21 | E22 | z_score | t_score | log_likelihood | simple_ll | dice | log_ratio | mutual_information | local_mutual_information | conservative_log_ratio | ipm | ipm_expected | ipm_reference | ipm_reference_expected | in_nodes | marginal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bei APPR | 360 | 4826 | 869 | 143113 | 42.7276 | 5143.27 | 1186.27 | 142796 | 48.5376 | 16.7217 | 1014.28 | 899.961 | 0.112237 | 3.52376 | 0.925594 | 333.214 | 3.16388 | 69417.7 | 8239.03 | 6035.48 | 8239.03 | 0 | 1229 |
( $( | 314 | 4872 | 1444 | 142538 | 61.1189 | 5124.88 | 1696.88 | 142285 | 32.3466 | 14.2709 | 574.854 | 522.005 | 0.090438 | 2.59389 | 0.710754 | 223.177 | 2.23649 | 60547.6 | 11785.4 | 10029 | 11785.4 | 0 | 1758 |
Beifall NN | 199 | 4987 | 471 | 143511 | 23.2933 | 5162.71 | 646.707 | 143335 | 36.406 | 12.4555 | 561.382 | 502.351 | 0.067964 | 3.55216 | 0.931621 | 185.393 | 3.06089 | 38372.5 | 4491.58 | 3271.24 | 4491.58 | 0 | 670 |
[ $( | 161 | 5025 | 259 | 143723 | 14.6018 | 5171.4 | 405.398 | 143577 | 38.3118 | 11.5378 | 545.131 | 480.087 | 0.057438 | 4.10923 | 1.04242 | 167.83 | 3.52364 | 31045.1 | 2815.62 | 1798.84 | 2815.62 | 0 | 420 |
]: $( | 139 | 5047 | 340 | 143642 | 16.653 | 5169.35 | 462.347 | 143520 | 29.9811 | 10.3773 | 383.895 | 345.19 | 0.049073 | 3.50467 | 0.921522 | 128.092 | 2.91721 | 26802.9 | 3211.14 | 2361.41 | 3211.14 | 0 | 479 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
For improved performance, all hapax legomena in the context are dropped after calculating the context size. You can change this behaviour via the min_freq
parameter.
By default, the dataframe contains the counts, namely
- observed and expected absolute frequencies (columns O11, ..., E22),
- observed and expected relative frequencies (instances per million, IPM),
- marginal frequencies, and
- instances within nodes.
You can drop the counts by specifying freq=False
. By default, the dataframe is annotated with all available association measures in the pandas-association-measures package (parameter ams
). For notation and further information regarding association measures, see collocations.de.
The dataframe is sorted by co-occurrence frequency (column "O11"), and only the first 100 most frequently co-occurring collocates are retrieved. You can (and should) change this behaviour via the order
and cut_off
parameters.
Subcorpora
In cwb-ccc terms, every instance of a Dump
is a subcorpus. There are two possibilities to get a dump
: either by running a traditional query as outlined above; the following query e.g. defines a subcorpus of all sentences that contain "SPD":
dump = corpus.query('"SPD" expand to s')
Alternatively, you can define subcorpora via values stored in s-attributes. A subcorpus of all noun phrases (assuming they are indexed as structural attribute np
) can e.g. be extracted using
dump = corpus.query_s_att("np")
You can also query the respective annotations:
dump = corpus.query_s_att("text_party", {"CDU", "CSU"})
will e.g. retrieve all text
spans with respective constraints on the party
annotation.
Implementation note: While the CWB does allow storage of arbitrary meta data in s-attributes, it does not index these attributes. corpus.query_s_att()
thus creates a dataframe with the spans of the s-attribute encoded as matches and caches the result. Consequently, the first query of an s-attribute will be compartively slow and subsequent queries will be faster.
Note also that the CWB does not allow complex queries on s-attributes. It is thus reasonable to store meta data in separate spreadsheets or relational databases and link to text spans via simple identifiers. This way (1) you can work with natural meta data queries and (2) working with a small number of s-attributes also unburdens the cache.
In CWB terms, subcorpora are named query results (NQRs), which consist of the corpus positions of match and matchend (and optional anchor points called anchor and keyword). If you give a name
when using corpus.query()
or corpus.query_s_att()
, the respective matches of the dump will also be available as NQRs in CQP.
This way you can run queries on NQRs in CQP (a.k.a. subqueries). Compare e.g. the frequency breakdown for a query on the whole corpus
corpus.query('[lemma="sagen"]').breakdown()
word | freq |
---|---|
sagen | 234 |
gesagt | 131 |
sage | 69 |
sagt | 39 |
Sagen | 12 |
sagte | 6 |
with the one a subcorpus:
corpus.query_s_att("text_party", values={"CDU", "CSU"}, name="Union")
corpus.activate_subcorpus("Union")
print(corpus.subcorpus)
> 'Union'
corpus.query('[lemma="sagen"]').breakdown()
word | freq |
---|---|
sagen | 64 |
gesagt | 45 |
sage | 30 |
sagt | 12 |
Sagen | 6 |
sagte | 3 |
Don't forget to switch back to the main corpus when you are done with the analysis on the activated NQR:
corpus.activate_subcorpus() # switches to main corpus when given no name
print(corpus.subcorpus)
> None
You can access all available NQRs via
corpus.show_nqr()
corpus | subcorpus | size | storage |
---|---|---|---|
GERMAPARL1386 | Union | 82 | md- |
Keyword Analyses
Having created a subcorpus (a dump
)
dump = corpus.query_s_att("text_party", values={"CDU", "CSU"})
you can use its keywords()
method for retrieving keywords:
dump.keywords()
item | O11 | O12 | O21 | O22 | E11 | E12 | E21 | E22 | z_score | t_score | log_likelihood | simple_ll | dice | log_ratio | mutual_information | local_mutual_information | conservative_log_ratio | ipm | ipm_expected | ipm_reference | ipm_reference_expected |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
, | 2499 | 38852 | 5371 | 103078 | 2172.45 | 39178.6 | 5697.55 | 102751 | 7.00617 | 6.53239 | 69.6474 | 46.7967 | 0.101542 | 0.287183 | 0.060817 | 151.982 | 0.131751 | 60433.8 | 52536.7 | 49525.6 | 52536.7 |
in | 867 | 40484 | 1631 | 106818 | 689.551 | 40661.4 | 1808.45 | 106641 | 6.75755 | 6.02647 | 61.2663 | 42.1849 | 0.039545 | 0.47937 | 0.099452 | 86.2253 | 0.204192 | 20966.8 | 16675.6 | 15039.3 | 16675.6 |
CSU | 255 | 41096 | 380 | 108069 | 175.286 | 41175.7 | 459.714 | 107989 | 6.02087 | 4.99187 | 46.6543 | 31.7425 | 0.012147 | 0.81552 | 0.162792 | 41.512 | 0.281799 | 6166.72 | 4238.99 | 3503.95 | 4238.99 |
CDU | 260 | 41091 | 390 | 108059 | 179.427 | 41171.6 | 470.573 | 107978 | 6.01515 | 4.99693 | 46.6055 | 31.7289 | 0.012381 | 0.80606 | 0.161086 | 41.8823 | 0.27822 | 6287.64 | 4339.12 | 3596.16 | 4339.12 |
deswegen | 55 | 41296 | 37 | 108412 | 25.3958 | 41325.6 | 66.6042 | 108382 | 5.87452 | 3.99183 | 41.5308 | 25.794 | 0.002654 | 1.96293 | 0.335601 | 18.458 | 0.558012 | 1330.08 | 614.152 | 341.174 | 614.152 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Just as with collocates, the result is a DataFrame
with lexical items (p_query
layer) as index and frequency signatures and association measures as columns.
New in version 0.9.14: Keywords for p-attribute combinations:
dump.keywords(["lemma", "pos"], order="log_likelihood")
item | O11 | O12 | O21 | O22 | E11 | E12 | E21 | E22 | z_score | t_score | log_likelihood | simple_ll | dice | log_ratio | mutual_information | local_mutual_information | conservative_log_ratio | ipm | ipm_expected | ipm_reference | ipm_reference_expected |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
, $, | 2499 | 38288 | 5371 | 103642 | 2142.82 | 38644.2 | 5727.18 | 103286 | 7.69455 | 7.12512 | 83.3197 | 56.1738 | 0.102719 | 0.314479 | 0.066782 | 166.887 | 0.158575 | 61269.5 | 52536.7 | 49269.4 | 52536.7 |
F. NN | 161 | 40626 | 192 | 108821 | 96.1136 | 40690.9 | 256.886 | 108756 | 6.61853 | 5.11377 | 54.4564 | 36.3385 | 0.007827 | 1.16427 | 0.224041 | 36.0706 | 0.456631 | 3947.34 | 2356.48 | 1761.26 | 2356.48 |
CSU NE | 255 | 40532 | 380 | 108633 | 172.895 | 40614.1 | 462.105 | 108551 | 6.24418 | 5.14158 | 49.731 | 33.9649 | 0.012312 | 0.842817 | 0.168757 | 43.0329 | 0.307344 | 6251.99 | 4238.99 | 3485.82 | 4238.99 |
CDU NE | 260 | 40527 | 390 | 108623 | 176.98 | 40610 | 473.02 | 108540 | 6.24055 | 5.1487 | 49.7162 | 33.9757 | 0.012549 | 0.833356 | 0.16705 | 43.433 | 0.303785 | 6374.58 | 4339.12 | 3577.55 | 4339.12 |
die ART | 3443 | 37344 | 8026 | 100987 | 3122.74 | 37664.3 | 8346.26 | 100667 | 5.7311 | 5.45805 | 47.9751 | 31.7769 | 0.131774 | 0.197304 | 0.042402 | 145.988 | 0.067797 | 84414.2 | 76562.1 | 73624.2 | 76562.1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Implementation note: dump.keywords()
looks at all unigrams at the corpus positions in match..matchend, and compares the frequencies of their surface realizations with their marginal frequencies. Similarly, dump.collocates()
uses the the union of the corpus positions in context..contextend, excluding all corpus positions containted in any match..matchend.
Testing
The module ships with a small test corpus ("GERMAPARL1386"), which contains all speeches of the 86th session of the 13th German Bundestag on Feburary 8, 1996.
The corpus consists of 149,800 tokens in 7332 paragraphs (s-attribute "p" with annotation "type" ("regular" or "interjection")) split into 11,364 sentences (s-attribute "s"). The p-attributes are "pos" and "lemma":
corpus.attributes_available
type | attribute | annotation | active |
---|---|---|---|
p-Att | word | False | True |
p-Att | pos | False | False |
p-Att | lemma | False | False |
s-Att | corpus | False | False |
s-Att | corpus_name | True | False |
s-Att | sitzung | False | False |
s-Att | sitzung_date | True | False |
s-Att | sitzung_period | True | False |
s-Att | sitzung_session | True | False |
s-Att | div | False | False |
s-Att | div_desc | True | False |
s-Att | div_n | True | False |
s-Att | div_type | True | False |
s-Att | div_what | True | False |
s-Att | text | False | False |
s-Att | text_id | True | False |
s-Att | text_name | True | False |
s-Att | text_parliamentary_group | True | False |
s-Att | text_party | True | False |
s-Att | text_position | True | False |
s-Att | text_role | True | False |
s-Att | text_who | True | False |
s-Att | p | False | False |
s-Att | p_type | True | False |
s-Att | s | False | False |
The corpus is located in this repository. All tests are written using this corpus as a reference. Make sure you install all development dependencies:
pip install pipenv
pipenv install --dev
You can then simply
make test
and
make coverage
which uses pytest to check that all methods work reliably.
Note that the make commands above update the path to the binary data files (line 10 of the registry file) in order to make the tests, since the CWB requires an absolute path here.
Acknowledgements
The module includes a slight adaptation of cwb-python, a Python port of Perl's CWB::CL; thanks to Yannick Versley for the implementation. Special thanks to Markus Opolka for the original implementation of association-measures and for forcing me to write tests.
The test corpus was extracted from the GermaParl corpus (see the PolMine Project); many thanks to Andreas Blätte.
This work was supported by the Emerging Fields Initiative (EFI) of Friedrich-Alexander-Universität Erlangen-Nürnberg, project title Exploring the Fukushima Effect (2017-2020).
Further development of the package is funded by the Deutsche Forschungsgemeinschaft (DFG) within the project Reconstructing Arguments from Noisy Text, grant number 377333057 (2018-2023), as part of the Priority Program Robust Argumentation Machines (SPP-1999).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.