CWB wrapper to extract concordances and collocates
Project description
Collocation and Concordance Computation
Introduction
This module is a wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to run queries, extract concordance lines, and calculate collocates.
Prerequisites
The module needs a working installation of the CWB and operates on CWB-indexed corpora.
If you want to run queries with more than two anchor points, the module requires CWB version 3.4.16 or later.
Installation
You can install this module with pip from PyPI:
pip3 install cwb-ccc
You can also clone the source from
github, cd in the
respective folder, and use setup.py:
python3 setup.py install
Corpus Setup
All methods rely on the Corpus class, which establishes the
connection to your CWB-indexed corpus:
from ccc import Corpus
corpus = Corpus(
corpus_name="EXAMPLE_CORPUS",
registry_path='/usr/local/share/cwb/registry/'
)
print(corpus)
This will raise a KeyError if the named corpus is not in the
specified registry.
If you are using macros and wordlists, you have to store them in a
separate folder (with subfolders "wordlists/" and "macros/"). Make
sure you specify this folder via lib_path when initializing the
corpus.
You can use the cqp_bin to point the module to a specific version of
cqp (this is also helpful if cqp is not in your PATH).
By default, the data_path points to "/tmp/ccc-data/". Make sure that
"/tmp/" exists and appropriate rights are granted. Otherwise, change
the parameter when initializing the corpus.
Usage
Queries and Dumps
Before you can display anything, you have to run a query with the
corpus.query() method, which accepts valid CQP queries such as
query = '[lemma="Angela"]? [lemma="Merkel"] [word="\("] [lemma="CDU"] [word="\)"]'
dump = corpus.query(
cqp_query=query
)
print(dump)
The result is a Dump object. Its core is a pandas DataFrame
multi-indexed by CQP's "match" and "matchend" (similar to a
CQP dump). All entries of the DataFrame, including the index, are
integers representing corpus positions:
print(dump.df)
You can provide one or more parameters to define the context around
the matches: a parameter context specifying the context window
(defaults to 20) and an s-attribute defining the context
(context_break). You can specify asymmetric windows via
context_left and context_right.
dump = corpus.query(
cqp_query=query,
context=20,
context_break='s'
)
In this case, the dump.df will contain two further columns,
specifying the context: "context" and "contextend".
Note that queries may end on a "within" clause, which will limit the
matches to regions defined by this structural attribute. If you
provide a context_break parameter, the query will be automatically
confined by this s-attribute.
You can set CQP's matching strategy ("standard", "longest",
"shortest") via the match_strategy parameter.
By default, the result is cached: the query parameters will be used to
create an identifier. The resulting Dump object contains the
appropriate identifier as attribute name_cache. The resulting
subcorpus will be saved to disk by CQP, and the extended dump
containing the context put into a cache. This way, the result can be
accessed directly by later queries with the same parameters on the
same (sub)corpus, without the need for CQP to run again. You can
disable caching by providing a name other than "mnemosyne".
Now you are set up to analyze your query result. Let's start with the frequency breakdown:
print(dump.breakdown())
| word | freq |
|---|---|
| Angela Merkel ( CDU ) | 2253 |
| Merkel ( CDU ) | 29 |
| Angela Merkels ( CDU ) | 2 |
Concordancing
You can directly access concordance lines via the concordance method
of the dump. This method returns a dataframe with information about
the query matches in context:
lines = dump.concordance()
print(lines)
| match | matchend | context | contextend | raw |
|---|---|---|---|---|
| 676 | 680 | 656 | 700 | {'cpos': [656, 657, 658, 659, 660, 661, 662, 6... |
| 1190 | 1194 | 1170 | 1214 | {'cpos': [1170, 1171, 1172, 1173, 1174, 1175, ... |
| 543640 | 543644 | 543620 | 543664 | {'cpos': [543620, 543621, 543622, 543623, 5436... |
| ... | ... | ... | ... | ... |
Column raw contains a dictionary with the following keys:
- "match" (int): the cpos of the match
- "cpos" (list): the cpos of all tokens in the concordance line
- "offset" (list): the offset to match/matchend of all tokens
- "word" (list): the words of all tokens
- "anchors" (dict): a dictionary of {anchor: cpos} (see below)
You can create your own formatting from this, or use the form
parameter to define how your lines should be formatted ("raw",
"simple", "kwic", "dataframes" or "extended"). If form="dataframes"
or form="extended", the dataframe contains a column df with each
concordance line being formatted as a DataFrame with the cpos of
each token as index:
lines = dump.concordance(form="dataframes")
print(lines['df'].iloc[0])
| cpos | offset | word | match | matchend | context | contextend |
|---|---|---|---|---|---|---|
| 48344 | -5 | Eine | False | False | True | False |
| 48345 | -4 | entsprechende | False | False | False | False |
| 48346 | -3 | Steuererleichterung | False | False | False | False |
| 48347 | -2 | hat | False | False | False | False |
| 48348 | -1 | Kanzlerin | False | False | False | False |
| 48349 | 0 | Angela | True | False | False | False |
| 48350 | 0 | Merkel | False | False | False | False |
| 48351 | 0 | ( | False | False | False | False |
| 48352 | 0 | CDU | False | False | False | False |
| 48353 | 0 | ) | False | True | False | False |
| 48354 | 1 | bisher | False | False | False | False |
| 48355 | 2 | ausgeschlossen | False | False | False | False |
| 48356 | 3 | . | False | False | False | True |
Attribute selection is controlled via the p_show and s_show
parameters (lists of p-attributes and s-attributes, respectively):
lines = dump.concordance(
form="dataframes",
p_show=['word', 'lemma'],
s_show=['text_id']
)
print(lines)
| match | matchend | context | contextend | df | text_id |
|---|---|---|---|---|---|
| 676 | 680 | 656 | 700 | ... | A113224 |
| 1190 | 1194 | 1170 | 1214 | ... | A124124 |
| 543640 | 543644 | 543620 | 543664 | ... | A423523 |
| ... | ... | ... | ... | ... | ... |
print(lines['df'].iloc[0])
| cpos | offset | word | lemma | match | matchend | context | contextend |
|---|---|---|---|---|---|---|---|
| 48344 | -5 | Eine | eine | False | False | True | False |
| 48345 | -4 | entsprechende | entsprechende | False | False | False | False |
| 48346 | -3 | Steuererleichterung | Steuererleichterung | False | False | False | False |
| 48347 | -2 | hat | haben | False | False | False | False |
| 48348 | -1 | Kanzlerin | Kanzlerin | False | False | False | False |
| 48349 | 0 | Angela | Angela | True | False | False | False |
| 48350 | 0 | Merkel | Merkel | False | False | False | False |
| 48351 | 0 | ( | ( | False | False | False | False |
| 48352 | 0 | CDU | CDU | False | False | False | False |
| 48353 | 0 | ) | ) | False | True | False | False |
| 48354 | 1 | bisher | bisher | False | False | False | False |
| 48355 | 2 | ausgeschlossen | ausschließen | False | False | False | False |
| 48356 | 3 | . | . | False | False | False | True |
You can decide which and how many concordance lines you want to
retrieve by means of the parameters order ("first", "last", or
"random") and cut_off. You can also provide a list of matches to
get only specific concordance lines.
Anchored Queries
The concordancer detects anchored queries automatically. The following query
dump = corpus.query(
'@0[lemma="Angela"]? @1[lemma="Merkel"] [word="\("] @2[lemma="CDU"] [word="\)"]',
)
dump.concordance(form='dataframes')
thus returns DataFrames with additional columns for each anchor point.
| cpos | offset | word | match | matchend | context | contextend | 0 | 1 | 2 |
|---|---|---|---|---|---|---|---|---|---|
| 48344 | -5 | Eine | False | False | True | False | False | False | False |
| 48345 | -4 | entsprechende | False | False | False | False | False | False | False |
| 48346 | -3 | Steuererleichterung | False | False | False | False | False | False | False |
| 48347 | -2 | hat | False | False | False | False | False | False | False |
| 48348 | -1 | Kanzlerin | False | False | False | False | False | False | False |
| 48349 | 0 | Angela | True | False | False | False | True | False | False |
| 48350 | 0 | Merkel | False | False | False | False | False | True | False |
| 48351 | 0 | ( | False | False | False | False | False | False | False |
| 48352 | 0 | CDU | False | False | False | False | False | False | True |
| 48353 | 0 | ) | False | True | False | False | False | False | False |
| 48354 | 1 | bisher | False | False | False | False | False | False | False |
| 48355 | 2 | ausgeschlossen | False | False | False | False | False | False | False |
| 48356 | 3 | . | False | False | False | True | False | False | False |
Collocation Analyses
After executing a query, you can use the dump.collocates() method to
extract collocates for a given window size (symmetric windows around
the corpus matches). The result will be a DataFrame with lexical
items as index and frequency signatures and association measures as
columns.
dump = corpus.query(
'[lemma="Angela"] [lemma="Merkel"]',
context=10, context_break='s'
)
collocates = dump.collocates()
print(collocates)
| item | O11 | O12 | O21 | O22 | E11 | E12 | E21 | E22 | log_likelihood | ... |
|---|---|---|---|---|---|---|---|---|---|---|
| die | 1189 | 13461 | 22082331 | 233975249 | 1263.407469 | 13386.592531 | 2.208226e+07 | 2.339753e+08 | -4.883922 | ... |
| Bundeskanzlerin | 1165 | 13485 | 5783 | 256051797 | 0.397498 | 14649.602502 | 6.947603e+03 | 2.560506e+08 | 16573.570027 | ... |
| , | 603 | 14047 | 14436277 | 241621303 | 825.939978 | 13824.060022 | 1.443605e+07 | 2.416215e+08 | -70.046255 | ... |
| Kanzlerin | 492 | 14158 | 13274 | 256044306 | 0.787559 | 14649.212441 | 1.376521e+04 | 2.560438e+08 | 5386.275148 | ... |
| haben | 379 | 14271 | 2433866 | 253623714 | 139.264180 | 14510.735820 | 2.434106e+06 | 2.536235e+08 | 283.416865 | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
By default, collocates are calculated on the "lemma"-layer, assuming
that this is a valid p-attribute in the corpus. The corresponding
parameter is p_query (which will fall back to "word" if the
specified attribute is not annotated in the corpus).
For improved performance, all hapax legomena in the context are
dropped after calculating the context size. You can change this
behaviour via the min_freq parameter.
By default, the dataframe is annotated with "z_score", "t_score",
"dice", "log_likelihood", and "mutual_information" (parameter ams).
For notation and further information regarding association measures,
see
collocations.de. Availability
of association measures depends on their implementation in the
pandas-association-measures
package.
The dataframe is sorted by co-occurrence frequency (column "O11"), and
only the first 100 most frequently co-occurring collocates are
retrieved. You can (and should) change this behaviour via the order
and cut_off parameters.
Keyword Analyses
For keyword analyses, you have to define a subcorpus. The natural way
of doing so is by selecting text identifiers via spreadsheets or
relational databases. If you have collected an appropriate set of
ids, you can use the corpus.dump_from_s_att() method:
dump = corpus.dump_from_s_att('text_id', ids)
keywords = dump.keywords()
Just as with collocates, the result is a DataFrame with lexical
items (p_query layer) as index and frequency signatures and
association measures as columns.
You can of course also define a subcorpus via a corpus query, e.g.
dump = corpus.query('"Atomkraft" expand to s')
keywords = dump.keywords()
Acknowledgements
The module relies on cwb-python, thanks to Yannick Versley and Jorg Asmussen for the implementation. Special thanks to Markus Opolka for the implementation of association-measures and for forcing me to write tests.
This work was supported by the Emerging Fields Initiative (EFI) of Friedrich-Alexander-Universität Erlangen-Nürnberg, project title Exploring the Fukushima Effect.
Further development of the package has been funded by the Deutsche Forschungsgemeinschaft (DFG) within the project Reconstructing Arguments from Noisy Text, grant number 377333057, as part of the Priority Program Robust Argumentation Machines (SPP-1999).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cwb-ccc-0.9.11.tar.gz.
File metadata
- Download URL: cwb-ccc-0.9.11.tar.gz
- Upload date:
- Size: 49.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c0c10bee439250703819615f74daac4a7ff79801b42a407574c940b7cb5f83d
|
|
| MD5 |
a8a4fbc67073ab644824da6225a3532a
|
|
| BLAKE2b-256 |
f3a0ecdedbf108c97fd6a029ae6b1d4115164b22c83d0bdc080f2648976c4a94
|
File details
Details for the file cwb_ccc-0.9.11-py3-none-any.whl.
File metadata
- Download URL: cwb_ccc-0.9.11-py3-none-any.whl
- Upload date:
- Size: 66.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4572a02ffbcc0b495e3076885e75f22d4e13cd6b95b3e26150ef3fbfa54f2ec
|
|
| MD5 |
7629d3c6d6464206a008c3449318d03b
|
|
| BLAKE2b-256 |
81d6b5709a8ec16393439d2bd5d978fa130d3a1b636fb6949d77d473a702efe3
|