CWB wrapper to extract concordances and score frequency lists
Project description
Collocation and Concordance Computation
cwb-ccc is a Python 3 wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to run queries, extract concordance lines, and score frequency lists (particularly to extract collocates and keywords).
The Quickstart should get you started. For a more detailed overview of the functionality, see the Vignette.
Installation
The module needs a working installation of the CWB and operates on CWB-indexed corpora. If you want to run queries with more than two anchor points, you will need CWB version 3.4.16 or later.
You can install this module with pip from PyPI:
python -m pip install cwb-ccc
You can also clone the source from github, cd
in the respective folder, and build your own wheel:
python -m pip install pipenv
pipenv install --dev
pipenv run python3 setup.py bdist_wheel
Quickstart
Accessing Corpora
To list all available corpora, you can use
from ccc import Corpora
corpora = Corpora(registry_path="/usr/local/share/cwb/registry/")
All further methods rely on the Corpus
class, which establishes the connection to your CWB-indexed corpus:
from ccc import Corpus
corpus = Corpus(
corpus_name="GERMAPARL1386",
registry_path="/usr/local/share/cwb/registry/"
)
This will raise a KeyError
if the named corpus is not in the specified registry.
Queries and Dumps
The usual starting point for using this module is to run a query with corpus.query()
, which accepts valid CQP queries such as
dump = corpus.query('[lemma="Arbeit"]', context_break='s')
The result is a Dump
object; at its core is a pandas DataFrame with corpus positions (similar to CWB dumps).
Concordancing
You can access concordance lines via the concordance()
method of the dump. This method returns a DataFrame with information about the query matches in context:
dump.concordance()
match | matchend | word |
---|---|---|
151 | 151 | Er brachte diese Erfahrung in seine Arbeit im Ausschuß für Familie , Senioren , Frauen und Jugend sowie im Petitionsausschuß ein , wo er sich vor allem |
227 | 227 | Seine Arbeit und sein Rat werden uns fehlen . |
1493 | 1493 | Ausschuß für Arbeit und Sozialordnung |
1555 | 1555 | Ausschuß für Arbeit und Sozialordnung |
1598 | 1598 | Ausschuß für Arbeit und Sozialordnung |
... | ... | ... |
This retrieves concordance lines in simple format in the order in which they appear in the corpus. A better approach is
dump.concordance(form='kwic', order='random')
match | matchend | left_word | node_word | right_word |
---|---|---|---|---|
81769 | 81769 | Ich unterstütze daher nachträglich die Forderung , daß die Durchführung des Gesetzes auch künftig durch die Bundesanstalt für | Arbeit | vorgenommen wird ; denn beim Bund gibt es die entsprechend ausgebildeten Sachbearbeiter . |
8774 | 8774 | Glauben Sie im Ernst , Sie könnten am Ende ein Bündnis für | Arbeit | , eine Wende in der deutschen Politik , die Bekämpfung der Arbeitslosigkeit erreichen , wenn Sie nicht die Länder , |
8994 | 8994 | alle Entscheidungen gemeinsam zu treffen , die sich gegen Schwarzarbeit und illegale | Arbeit | wenden , und gemeinsam nach einem Weg zu suchen , |
80098 | 80098 | : Was der Vermittlungsausschuß mit Mehrheit zum Meister-BAföG beschlossen hat , heißt , daß die bewährten Institutionen der Bundesanstalt für | Arbeit | , die die Ausbildungsförderung für Meister bis zum Jahr 1993 durchgeführt haben , die darin große Erfahrung haben , die |
61056 | 61056 | Selbst wenn Sie ein Konstrukt anbieten , das tendenziell die zusätzliche Belastung der Bundesanstalt für | Arbeit | etwas geringer hielte als die Entlastung bei der gesetzlichen Rentenversicherung , so wäre dies bei einem deutlichen Aufwuchs der Arbeitslosigkeit |
... | ... | ... | ... | ... |
Collocation Analyses
After executing a query, you can use dump.collocates()
to extract collocates for a given window size (symmetric windows around the corpus matches). The result will be a DataFrame
with lemmata as index and frequency signatures and association measures as columns:
dump.collocates()
item | O11 | O12 | O21 | O22 | R1 | R2 | C1 | C2 | N | E11 | E12 | E21 | E22 | z_score | t_score | log_likelihood | simple_ll | min_sensitivity | liddell | dice | log_ratio | conservative_log_ratio | mutual_information | local_mutual_information | ipm | ipm_reference | ipm_expected | in_nodes | marginal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
für | 46 | 730 | 831 | 148102 | 776 | 148933 | 877 | 148832 | 149709 | 4.54583 | 771.454 | 872.454 | 148061 | 19.4429 | 6.11208 | 134.301 | 130.019 | 0.052452 | 0.047547 | 0.055656 | 3.40925 | 2.26335 | 1.00514 | 46.2366 | 59278.4 | 5579.69 | 5858.03 | 0 | 877 |
, | 43 | 733 | 7827 | 141106 | 776 | 148933 | 7870 | 141839 | 149709 | 40.7933 | 735.207 | 7829.21 | 141104 | 0.345505 | 0.336523 | 0.124564 | 0.117278 | 0.005464 | 0.000296 | 0.009947 | 0.076412 | 0 | 0.02288 | 0.983836 | 55412.4 | 52553.8 | 52568.6 | 0 | 7870 |
. | 33 | 743 | 5626 | 143307 | 776 | 148933 | 5659 | 144050 | 149709 | 29.3328 | 746.667 | 5629.67 | 143303 | 0.677108 | 0.638378 | 0.461005 | 0.440481 | 0.005831 | 0.000673 | 0.010256 | 0.170891 | 0 | 0.05116 | 1.68829 | 42525.8 | 37775.4 | 37800 | 0 | 5659 |
und | 32 | 744 | 2848 | 146085 | 776 | 148933 | 2880 | 146829 | 149709 | 14.9282 | 761.072 | 2865.07 | 146068 | 4.41852 | 3.0179 | 15.1452 | 14.6555 | 0.011111 | 0.006044 | 0.017505 | 1.10866 | 0 | 0.331144 | 10.5966 | 41237.1 | 19122.7 | 19237.3 | 0 | 2880 |
in | 24 | 752 | 2474 | 146459 | 776 | 148933 | 2498 | 147211 | 149709 | 12.9481 | 763.052 | 2485.05 | 146448 | 3.07138 | 2.25596 | 7.72813 | 7.51722 | 0.009608 | 0.004499 | 0.014661 | 0.896724 | 0 | 0.268005 | 6.43212 | 30927.8 | 16611.5 | 16685.7 | 0 | 2498 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The dataframe contains the counts and is annotated with all available association measures in the pandas-association-measures package (parameter ams
).
Keyword Analyses
Having created a subcorpus (a dump
)
dump = corpus.query(s_query='text_party', s_values={'CDU', 'CSU'})
you can use its keywords()
method for retrieving keywords:
dump.keywords(order='conservative_log_ratio')
item | O11 | O12 | O21 | O22 | R1 | R2 | C1 | C2 | N | E11 | E12 | E21 | E22 | z_score | t_score | log_likelihood | simple_ll | min_sensitivity | liddell | dice | log_ratio | conservative_log_ratio | mutual_information | local_mutual_information | ipm | ipm_reference | ipm_expected |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deswegen | 55 | 41296 | 37 | 108412 | 41351 | 108449 | 92 | 149708 | 149800 | 25.3958 | 41325.6 | 66.6042 | 108382 | 5.87452 | 3.99183 | 41.5308 | 25.794 | 0.00133 | 0.321982 | 0.002654 | 1.96293 | 0.404166 | 0.335601 | 18.458 | 1330.08 | 341.174 | 614.152 |
CSU | 255 | 41096 | 380 | 108069 | 41351 | 108449 | 635 | 149165 | 149800 | 175.286 | 41175.7 | 459.714 | 107989 | 6.02087 | 4.99187 | 46.6543 | 31.7425 | 0.006167 | 0.126068 | 0.012147 | 0.81552 | 0.212301 | 0.162792 | 41.512 | 6166.72 | 3503.95 | 4238.99 |
CDU | 260 | 41091 | 390 | 108059 | 41351 | 108449 | 650 | 149150 | 149800 | 179.427 | 41171.6 | 470.573 | 107978 | 6.01515 | 4.99693 | 46.6055 | 31.7289 | 0.006288 | 0.124499 | 0.012381 | 0.80606 | 0.209511 | 0.161086 | 41.8823 | 6287.64 | 3596.16 | 4339.12 |
in | 867 | 40484 | 1631 | 106818 | 41351 | 108449 | 2498 | 147302 | 149800 | 689.551 | 40661.4 | 1808.45 | 106641 | 6.75755 | 6.02647 | 61.2663 | 42.1849 | 0.020967 | 0.072241 | 0.039545 | 0.47937 | 0.168901 | 0.099452 | 86.2253 | 20966.8 | 15039.3 | 16675.6 |
Wirtschaft | 39 | 41312 | 25 | 108424 | 41351 | 108449 | 64 | 149736 | 149800 | 17.6666 | 41333.3 | 46.3334 | 108403 | 5.07554 | 3.41607 | 30.9328 | 19.1002 | 0.000943 | 0.333476 | 0.001883 | 2.03257 | 0.150982 | 0.34391 | 13.4125 | 943.145 | 230.523 | 427.236 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Just as with collocates, the result is a DataFrame
with lemmata as index and frequency signatures and association measures as columns.
Testing
The module ships with a small test corpus ("GERMAPARL1386"), which contains all speeches of the 86th session of the 13th German Bundestag on Feburary 8, 1996.
The corpus consists of 149,800 tokens in 7332 paragraphs (s-attribute "p" with annotation "type" ("regular" or "interjection")) split into 11,364 sentences (s-attribute "s"). The p-attributes are "pos" and "lemma":
corpus.attributes_available
type | attribute | annotation | active |
---|---|---|---|
p-Att | word | False | True |
p-Att | pos | False | False |
p-Att | lemma | False | False |
s-Att | corpus | False | False |
s-Att | corpus_name | True | False |
s-Att | sitzung | False | False |
s-Att | sitzung_date | True | False |
s-Att | sitzung_period | True | False |
s-Att | sitzung_session | True | False |
s-Att | div | False | False |
s-Att | div_desc | True | False |
s-Att | div_n | True | False |
s-Att | div_type | True | False |
s-Att | div_what | True | False |
s-Att | text | False | False |
s-Att | text_id | True | False |
s-Att | text_name | True | False |
s-Att | text_parliamentary_group | True | False |
s-Att | text_party | True | False |
s-Att | text_position | True | False |
s-Att | text_role | True | False |
s-Att | text_who | True | False |
s-Att | p | False | False |
s-Att | p_type | True | False |
s-Att | s | False | False |
The corpus is located in this repository. All tests are written using this corpus (as well as some reference counts and scores from the UCS toolkit and some additional frequency lists). Make sure you install all development dependencies (especially pytest):
python -m pip install pipenv
pipenv install --dev
You can then simply
make build
make test
make coverage
Acknowledgements
- The module includes a slight adaptation of cwb-python, a Python port of Perl's CWB::CL; thanks to Yannick Versley for the implementation.
- Special thanks to Markus Opolka for the original implementation of association-measures and for forcing me to write tests.
- The test corpus was extracted from the GermaParl corpus (see the PolMine Project); many thanks to Andreas Blätte.
- This work was supported by the Emerging Fields Initiative (EFI) of Friedrich-Alexander-Universität Erlangen-Nürnberg, project title Exploring the Fukushima Effect (2017-2020).
- Further development of the package is funded by the Deutsche Forschungsgemeinschaft (DFG) within the project Reconstructing Arguments from Noisy Text, grant number 377333057 (2018-2023), as part of the Priority Program Robust Argumentation Machines (SPP-1999).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.