CWB wrapper to extract concordances and score frequency lists
Project description
Collocation and Concordance Computation
cwb-ccc is a Python 3 wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to run queries (including queries with more than two anchor points), extract concordance lines, and score frequency lists (particularly to extract collocates and keywords).
The Quickstart here gives a rough overview. For a more detailed dive into the functionality, see the Vignette.
Installation
System requirements: The module is developed for Ubuntu (currently 24.04 LTS) but also runs on other Debian-based systems and MacOS. On a fresh install of Ubuntu, you will need to install the following packages:
sudo apt install libncurses5-dev libglib2.0-dev libpcre3 libpcre3-dev
CWB: The module needs a working installation of CWB and operates on CWB-indexed corpora. If you want to run queries with more than two anchor points, you will need CWB version 3.4.16 or later. We recommend installing the 3.5.x package.
On Ubuntu, you will also need to install the corresponding cwb-dev
package:
wget https://sourceforge.net/projects/cwb/files/cwb/cwb-3.5/deb/cwb_3.5.0-1_amd64.deb
wget https://sourceforge.net/projects/cwb/files/cwb/cwb-3.5/deb/cwb-dev_3.5.0-1_amd64.deb
sudo apt install ./cwb_3.5.0-1_amd64.deb
sudo apt install ./cwb-dev_3.5.0-1_amd64.deb
On MacOS, you can simply
brew install cwb3
Python dependencies: Python dependencies are specified in requirements.txt and will be installed automatically if you follow the instructions below. Note that since version v0.13.0, cwb-ccc
uses pandas2
and numpy2
, which requires Python 3.9 or above.
In all cases, we recommend installing dependencies in a virtual environment to avoid conflicts with other installs on your machine.
Installation using pip: You can install cwb-ccc with pip from PyPI:
python3 -m pip install cwb-ccc
Installation from source: You can also clone the source from github, cd
in the respective folder, install all dependencies
python3 -m pip install -U pip setuptools wheel twine
python3 -m pip install -r requirements-dev.txt
compile the C-extension
python3 -m cython -2 ccc/cl.pyx
and build it
python3 setup.py bdist_ext --inplace
Quickstart
Accessing Corpora
To list all available corpora, you can use
from ccc import Corpora
corpora = Corpora(registry_dir="/usr/local/share/cwb/registry/")
Most functionality is tied to the Corpus
class, which establishes the connection to your CWB-indexed corpus:
from ccc import Corpus
corpus = Corpus(corpus_name="GERMAPARL1386", registry_dir="tests/corpora/registry/")
This will raise a KeyError
if the named corpus is not in the specified registry.
Queries and SubCorpora
The usual starting point is to run a query with corpus.query()
. This method accepts valid CQP queries such as
subcorpus = corpus.query('[lemma="Arbeit"]', context_break='s')
The result is a SubCorpus
; at its core this is a pandas DataFrame
with corpus positions (similar to CWB dumps of NQRs).
You can also query structural attributes, e.g.
corpus.query(s_query='text_party', s_values={'CDU', 'CSU'})
Concordancing
You can access concordance lines via the concordance()
method of subcorpora. This method returns a DataFrame with information about the query matches in context:
subcorpus.concordance()
match | matchend | word |
---|---|---|
151 | 151 | Er brachte diese Erfahrung in seine Arbeit im Ausschuß für Familie , Senioren , Frauen und Jugend sowie im Petitionsausschuß ein , wo er sich vor allem |
227 | 227 | Seine Arbeit und sein Rat werden uns fehlen . |
1493 | 1493 | Ausschuß für Arbeit und Sozialordnung |
1555 | 1555 | Ausschuß für Arbeit und Sozialordnung |
1598 | 1598 | Ausschuß für Arbeit und Sozialordnung |
... | ... | ... |
By default, it retrieves concordance lines in simple
format in the order in which they appear in the corpus. In most situations it is more useful to get random
concordance lines in KWIC formatting:
subcorpus.concordance(form='kwic', order='random')
match | matchend | left_word | node_word | right_word |
---|---|---|---|---|
81769 | 81769 | Ich unterstütze daher nachträglich die Forderung , daß die Durchführung des Gesetzes auch künftig durch die Bundesanstalt für | Arbeit | vorgenommen wird ; denn beim Bund gibt es die entsprechend ausgebildeten Sachbearbeiter . |
8774 | 8774 | Glauben Sie im Ernst , Sie könnten am Ende ein Bündnis für | Arbeit | , eine Wende in der deutschen Politik , die Bekämpfung der Arbeitslosigkeit erreichen , wenn Sie nicht die Länder , |
8994 | 8994 | alle Entscheidungen gemeinsam zu treffen , die sich gegen Schwarzarbeit und illegale | Arbeit | wenden , und gemeinsam nach einem Weg zu suchen , |
80098 | 80098 | : Was der Vermittlungsausschuß mit Mehrheit zum Meister-BAföG beschlossen hat , heißt , daß die bewährten Institutionen der Bundesanstalt für | Arbeit | , die die Ausbildungsförderung für Meister bis zum Jahr 1993 durchgeführt haben , die darin große Erfahrung haben , die |
61056 | 61056 | Selbst wenn Sie ein Konstrukt anbieten , das tendenziell die zusätzliche Belastung der Bundesanstalt für | Arbeit | etwas geringer hielte als die Entlastung bei der gesetzlichen Rentenversicherung , so wäre dies bei einem deutlichen Aufwuchs der Arbeitslosigkeit |
... | ... | ... | ... | ... |
Use cut_off
to specify the maximum number of lines.
Collocation Analyses
After executing a query, you can use subcorpus.collocates()
to extract collocates (see the vignette for parameter settings). The result is a DataFrame
with lemmata as index and frequency signatures and association measures as columns:
subcorpus.collocates()
item | O11 | O12 | O21 | O22 | R1 | R2 | C1 | C2 | N | E11 | E12 | E21 | E22 | z_score | t_score | log_likelihood | simple_ll | min_sensitivity | liddell | dice | log_ratio | conservative_log_ratio | mutual_information | local_mutual_information | ipm | ipm_reference | ipm_expected | in_nodes | marginal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
für | 46 | 730 | 831 | 148102 | 776 | 148933 | 877 | 148832 | 149709 | 4.54583 | 771.454 | 872.454 | 148061 | 19.4429 | 6.11208 | 134.301 | 130.019 | 0.052452 | 0.047547 | 0.055656 | 3.40925 | 2.26335 | 1.00514 | 46.2366 | 59278.4 | 5579.69 | 5858.03 | 0 | 877 |
, | 43 | 733 | 7827 | 141106 | 776 | 148933 | 7870 | 141839 | 149709 | 40.7933 | 735.207 | 7829.21 | 141104 | 0.345505 | 0.336523 | 0.124564 | 0.117278 | 0.005464 | 0.000296 | 0.009947 | 0.076412 | 0 | 0.02288 | 0.983836 | 55412.4 | 52553.8 | 52568.6 | 0 | 7870 |
. | 33 | 743 | 5626 | 143307 | 776 | 148933 | 5659 | 144050 | 149709 | 29.3328 | 746.667 | 5629.67 | 143303 | 0.677108 | 0.638378 | 0.461005 | 0.440481 | 0.005831 | 0.000673 | 0.010256 | 0.170891 | 0 | 0.05116 | 1.68829 | 42525.8 | 37775.4 | 37800 | 0 | 5659 |
und | 32 | 744 | 2848 | 146085 | 776 | 148933 | 2880 | 146829 | 149709 | 14.9282 | 761.072 | 2865.07 | 146068 | 4.41852 | 3.0179 | 15.1452 | 14.6555 | 0.011111 | 0.006044 | 0.017505 | 1.10866 | 0 | 0.331144 | 10.5966 | 41237.1 | 19122.7 | 19237.3 | 0 | 2880 |
in | 24 | 752 | 2474 | 146459 | 776 | 148933 | 2498 | 147211 | 149709 | 12.9481 | 763.052 | 2485.05 | 146448 | 3.07138 | 2.25596 | 7.72813 | 7.51722 | 0.009608 | 0.004499 | 0.014661 | 0.896724 | 0 | 0.268005 | 6.43212 | 30927.8 | 16611.5 | 16685.7 | 0 | 2498 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Setting p_query
allows calculating scores for arbitrary combinations of positional attributes, e.g. p_query=['lemma', 'pos']
. The dataframe contains the observed counts in contingency notation and is annotated with all available association measures from the pandas-association-measures package (parameter ams
).
Keyword Analyses
Having created a subcorpus
subcorpus = corpus.query(s_query='text_party', s_values={'CDU', 'CSU'})
you can use its keywords()
method for retrieving keywords:
subcorpus.keywords(order='conservative_log_ratio')
item | O11 | O12 | O21 | O22 | R1 | R2 | C1 | C2 | N | E11 | E12 | E21 | E22 | z_score | t_score | log_likelihood | simple_ll | min_sensitivity | liddell | dice | log_ratio | conservative_log_ratio | mutual_information | local_mutual_information | ipm | ipm_reference | ipm_expected |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deswegen | 55 | 41296 | 37 | 108412 | 41351 | 108449 | 92 | 149708 | 149800 | 25.3958 | 41325.6 | 66.6042 | 108382 | 5.87452 | 3.99183 | 41.5308 | 25.794 | 0.00133 | 0.321982 | 0.002654 | 1.96293 | 0.404166 | 0.335601 | 18.458 | 1330.08 | 341.174 | 614.152 |
CSU | 255 | 41096 | 380 | 108069 | 41351 | 108449 | 635 | 149165 | 149800 | 175.286 | 41175.7 | 459.714 | 107989 | 6.02087 | 4.99187 | 46.6543 | 31.7425 | 0.006167 | 0.126068 | 0.012147 | 0.81552 | 0.212301 | 0.162792 | 41.512 | 6166.72 | 3503.95 | 4238.99 |
CDU | 260 | 41091 | 390 | 108059 | 41351 | 108449 | 650 | 149150 | 149800 | 179.427 | 41171.6 | 470.573 | 107978 | 6.01515 | 4.99693 | 46.6055 | 31.7289 | 0.006288 | 0.124499 | 0.012381 | 0.80606 | 0.209511 | 0.161086 | 41.8823 | 6287.64 | 3596.16 | 4339.12 |
in | 867 | 40484 | 1631 | 106818 | 41351 | 108449 | 2498 | 147302 | 149800 | 689.551 | 40661.4 | 1808.45 | 106641 | 6.75755 | 6.02647 | 61.2663 | 42.1849 | 0.020967 | 0.072241 | 0.039545 | 0.47937 | 0.168901 | 0.099452 | 86.2253 | 20966.8 | 15039.3 | 16675.6 |
Wirtschaft | 39 | 41312 | 25 | 108424 | 41351 | 108449 | 64 | 149736 | 149800 | 17.6666 | 41333.3 | 46.3334 | 108403 | 5.07554 | 3.41607 | 30.9328 | 19.1002 | 0.000943 | 0.333476 | 0.001883 | 2.03257 | 0.150982 | 0.34391 | 13.4125 | 943.145 | 230.523 | 427.236 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Just as with collocates, the result is a DataFrame
with lemmata as index and frequency signatures and association measures as columns.
Testing
The module ships with a small test corpus ("GERMAPARL1386"), which contains all speeches of the 86th session of the 13th German Bundestag on Feburary 8, 1996.
corpus = Corpus("GERMAPARL1386", registry_dir="tests/corpora/registry/")
This corpus consists of 149,800 tokens in 7332 paragraphs (s-attribute "p" with annotation "type" ("regular" or "interjection")) split into 11,364 sentences (s-attribute "s"). The p-attributes are "pos" and "lemma":
corpus.available_attributes()
type | attribute | annotation | active |
---|---|---|---|
p-Att | word | False | True |
p-Att | pos | False | False |
p-Att | lemma | False | False |
s-Att | corpus | False | False |
s-Att | corpus_name | True | False |
s-Att | sitzung | False | False |
s-Att | sitzung_date | True | False |
s-Att | sitzung_period | True | False |
s-Att | sitzung_session | True | False |
s-Att | div | False | False |
s-Att | div_desc | True | False |
s-Att | div_n | True | False |
s-Att | div_type | True | False |
s-Att | div_what | True | False |
s-Att | text | False | False |
s-Att | text_id | True | False |
s-Att | text_name | True | False |
s-Att | text_parliamentary_group | True | False |
s-Att | text_party | True | False |
s-Att | text_position | True | False |
s-Att | text_role | True | False |
s-Att | text_who | True | False |
s-Att | p | False | False |
s-Att | p_type | True | False |
s-Att | s | False | False |
The corpus is located in this repository. All tests are written using this corpus as well as some reference counts and scores obtained from the UCS toolkit and some additional frequency lists. Make sure you install all development dependencies (especially pytest). You can then
pytest -m "not benchmark"
pytest -m benchmark
pytest --cov-report term-missing -v --cov=ccc/
Acknowledgements
- The module includes a slight adaptation of cwb-python, a Python port of Perl's CWB::CL; thanks to Yannick Versley for the implementation.
- Special thanks to Markus Opolka for the original implementation of association-measures and for forcing me to write tests.
- The test corpus was extracted from the GermaParl corpus (see the PolMine Project); many thanks to Andreas Blätte.
- This work was supported by the Emerging Fields Initiative (EFI) of Friedrich-Alexander-Universität Erlangen-Nürnberg, project title Exploring the Fukushima Effect (2017-2020).
- Further development of the package was funded by the Deutsche Forschungsgemeinschaft (DFG) within the projects Reconstructing Arguments from Noisy Text (2018-2021) and Newsworthy Debates (2021-2024), grant number 377333057, as part of the Priority Program Robust Argumentation Machines (SPP-1999).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file cwb_ccc-0.13.2.tar.gz
.
File metadata
- Download URL: cwb_ccc-0.13.2.tar.gz
- Upload date:
- Size: 364.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
3b8a1aeeba5fe263607f52a2d69ae5910423d59013963973d012c469fed5ae20
|
|
MD5 |
2f069bda53d30680843afeb77f6e3a44
|
|
BLAKE2b-256 |
6c62a781092c1d8a0fefe0d08c41dcebbd5b447e1cb2a37a21db011a2ad94c77
|