CWB wrapper to extract concordances and collocates
Project description
Collocation and Concordance Computation
Introduction
This module is a wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to extract concordance lines, calculate keywords and collocates, and run queries with several anchor points.
If you want to extract the results of queries with more than two anchor points, the module requires CWB version 3.4.16 or later.
Installation
You can install this module with pip from PyPI:
pip3 install cwb-ccc
You can also clone the repository from
github, cd in the
respective folder, and use setup.py:
python3 setup.py install
Usage
Corpus Setup
All methods rely on the Corpus class, which establishes the
connection to your CWB-indexed corpus:
from ccc import Corpus
corpus = Corpus(
corpus_name="EXAMPLE_CORPUS",
registry_path="/path/to/your/cwb/registry/"
)
This will raise a KeyError if the named corpus is not in the
specified registry.
If you are using macros and wordlists, you have to store them in a
separate folder (with subfolders wordlists and macros). Make sure
you specify this folder via lib_path when initializing the
corpus.
You can use the cqp_bin to point the module to a specific version of
cqp (this is also helpful if cqp is not in your PATH).
By default, the cache_path points to "/tmp/ccc-data". Make sure
that "/tmp/" exists and appropriate rights are granted. Otherwise,
change the parameter when initializing the corpus (or set it to
None).
Concordancing
Before you can display concordances, you have to run a query with the
corpus.query() method, which accepts valid CQP queries such as
angela = corpus.query(
'[lemma="Angela"]? [lemma="Merkel"] [word="\("] [lemma="CDU"] [word="\)"]'
)
The default context window is 20 tokens to the left and 20 tokens to
the right of the query match and matchend, respectively. You can
change this via the context parameter (use context_left and
context_right for asymmetric windows).
Note that queries may end on a "within" clause, which will limit the
matches to regions defined by this structural attribute.
Alternatively, you can specify this value via s_query. Additionally,
you can specify an s_context parameter, which will cut the
context. NB: The implementation assumes that s_query regions are
confined by s_context regions.
Now you are set up to get the query concordance:
concordance = corpus.concordance(angela)
You can access the query frequency breakdown via
concordance.breakdown:
| type | freq |
|---|---|
| Angela Merkel ( CDU ) | 2253 |
| Merkel ( CDU ) | 29 |
| Angela Merkels ( CDU ) | 2 |
You can use concordance.lines() to get concordance lines. This
method returns dataframes, with various information of the requested
lines which will be determined by p_show and s_show (lists of
attributes to be retrieved). Choose one of "raw", "simple", "kwic",
"dataframes" or "extended".
You can decide which and how many concordance lines you want to
retrieve by means of the parameters order ("first", "last", or
"random") and cut_off. You can also provide a list of matches to
get only specific concordance lines.
If form="dataframes" or form="extended", the dataframe contains a
column df with each concordance line being formatted as a
DataFrame with the cpos of each token as index:
| cpos | offset | word | anchor |
|---|---|---|---|
| 48344 | -5 | Eine | None |
| 48345 | -4 | entsprechende | None |
| 48346 | -3 | Steuererleichterung | None |
| 48347 | -2 | hat | None |
| 48348 | -1 | Kanzlerin | None |
| 48349 | 0 | Angela | None |
| 48350 | 0 | Merkel | None |
| 48351 | 0 | ( | None |
| 48352 | 0 | CDU | None |
| 48353 | 0 | ) | None |
| 48354 | 1 | bisher | None |
| 48355 | 2 | ausgeschlossen | None |
| 48356 | 3 | . | None |
Anchored Queries
The concordancer detects anchored queries automatically. The following query
angela = corpus.query(
'@0[lemma="Angela"]? @1[lemma="Merkel"] [word="\("] @2[lemma="CDU"] [word="\)"]',
)
concordance = corpus.concordance(angela)
thus returns DataFrames with additional columns for each anchor point.
| cpos | offset | word | 0 | 1 | 2 |
|---|---|---|---|---|---|
| 48344 | -5 | Eine | F | F | F |
| 48345 | -4 | entsprechende | F | F | F |
| 48346 | -3 | Steuererleichterung | F | F | F |
| 48347 | -2 | hat | F | F | F |
| 48348 | -1 | Kanzlerin | F | F | F |
| 48349 | 0 | Angela | T | F | F |
| 48350 | 0 | Merkel | F | T | F |
| 48351 | 0 | ( | F | F | F |
| 48352 | 0 | CDU | F | F | T |
| 48353 | 0 | ) | F | F | F |
| 48354 | 1 | bisher | F | F | F |
| 48355 | 2 | ausgeschlossen | F | F | F |
| 48356 | 3 | . | F | F | F |
Collocation Analyses
After executing a query, you can use the corpus.collocates() class
to extract collocates for a given window size (symmetric windows
around the corpus matches):
angela = corpus.query(
'[lemma="Angela"] [lemma="Merkel"]',
s_context='s', context=20
)
collocates = corpus.collocates(angela)
collocates() will create a dataframe of the context of the query
matches. You can specify a smaller maximum window size via the mws
parameter (this might be reasonable for queries with many hits). You
will only be able to score collocates up to this parameter. Note that
mws must not be larger than the context parameter of your initial
query.
By default, collocates are calculated on the "lemma"-layer, assuming
that this is a valid p-attribute in the corpus. The corresponding
parameter is p_query (which will fall back to "word" if the
specified attribute is not annotated in the corpus).
Using the marginal frequencies of the items in the whole corpus as a reference, you can directly annotate the co-occurrence counts in a given window:
collocates.show(window=5)
The result will be a DataFrame with lexical items (p_query layer)
as index and frequency signatures and association measures as columns:
| item | O11 | f2 | N | f1 | O12 | O21 | O22 | E11 | E12 | E21 | E22 | log_likelihood | ... |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| die | 1799 | 25125329 | 300917702 | 22832 | 21033 | 25123530 | 275771340 | 1906.373430 | 20925.626570 | 2.512342e+07 | 2.757714e+08 | -2.459194 | ... |
| Bundeskanzlerin | 1491 | 8816 | 300917702 | 22832 | 21341 | 7325 | 300887545 | 0.668910 | 22831.331090 | 8.815331e+03 | 3.008861e+08 | 1822.211827 | ... |
| . | 1123 | 13677811 | 300917702 | 22832 | 21709 | 13676688 | 287218182 | 1037.797972 | 21794.202028 | 1.367677e+07 | 2.872181e+08 | 2.644804 | ... |
| , | 814 | 17562059 | 300917702 | 22832 | 22018 | 17561245 | 283333625 | 1332.513602 | 21499.486398 | 1.756073e+07 | 2.833341e+08 | -14.204447 | ... |
| Kanzlerin | 648 | 17622 | 300917702 | 22832 | 22184 | 16974 | 300877896 | 1.337062 | 22830.662938 | 1.762066e+04 | 3.008772e+08 | 559.245198 | ... |
For improved performance, all hapax legomena in the context are
dropped after calculating the context size. You can change this
behaviour via the min_freq parameter of collocates.show().
By default, the dataframe is annotated with "z_score", "t_score",
"dice", "log_likelihood", and "mutual_information" (parameter ams).
For notation and further information regarding association measures,
see
collocations.de. Available
association measures depend on their implementation in the
association-measures
module.
The dataframe is sorted by co-occurrence frequency (column "f"), and
only the first 100 most frequently co-occurring collocates are
retrieved. You can change this behaviour via the order and cut_off
parameters.
Keyword Anayses
For keyword analyses, you will have to define a subcorpus. The natural
way of doing so is by selecting text identifiers (on the s_meta
annotations) via spreadsheets or relational databases. If you have
collected a set of identifiers, you can create a subcorpus via the
corpus.subcorpus_from_s_att() method:
corpus.subcorpus_from_s_att('text_id', ids, name="Panorama")
keywords = corpus.keywords("Panorama")
keywords.show()
Just as with collocates, the result will be a DataFrame with lexical
items (p_query layer) as index and frequency signatures and
association measures as columns.
Acknowledgements
The module relies on cwb-python, thanks to Yannick Versley and Jorg Asmussen for the implementation. Special thanks to Markus Opolka for the implementation of association-measures and for forcing me to write tests.
This work was supported by the Emerging Fields Initiative (EFI) of Friedrich-Alexander-Universität Erlangen-Nürnberg, project title Exploring the Fukushima Effect.
Further development of the package has been funded by the Deutsche Forschungsgemeinschaft (DFG) within the project Reconstructing Arguments from Noisy Text, grant number 377333057, as part of the Priority Program Robust Argumentation Machines (SPP-1999).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cwb-ccc-0.9.8.tar.gz.
File metadata
- Download URL: cwb-ccc-0.9.8.tar.gz
- Upload date:
- Size: 34.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
910b62c47a6e2c9c2a81729b3e0d6e4f5d39a3e85134299b103f28b1e84aeb99
|
|
| MD5 |
f4c3493a4dce174f53f68c5719f7e59a
|
|
| BLAKE2b-256 |
e292bd5f8c9c9bccefb38927497ef2d0422880efa7e59f85ca2d13019d7ef2ca
|
File details
Details for the file cwb_ccc-0.9.8-py3-none-any.whl.
File metadata
- Download URL: cwb_ccc-0.9.8-py3-none-any.whl
- Upload date:
- Size: 45.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aafe6e9c9c0a19297308c03149ffa285517c8d9fe7fa71327aee8fc4c3d60f42
|
|
| MD5 |
764fb697bd33ac21c1502b2b8d647968
|
|
| BLAKE2b-256 |
50ed70299307289853024d6e4ce36c145b9680dbc6f359bdb2563192c1149410
|