CWB wrapper to extract concordances and collocates
Project description
Collocation and Concordance Computation
Introduction
This module is a wrapper around the IMS Open Corpus Workbench (CWB). It requires CWB version 3.4.16 or newer for anchored queries. Main purpose of the module is to extract concordance lines and to calculate collocates, as well as to extract the results of queries with more than two anchors.
Installation
The recommended way to install the module is to clone the repository
and use setup.py
.
python3 setup.py install
Alternatively, you can just install the requirements and make sure the
ccc
subfolder can be found by Python by including it in your
PYTHONPATH
.
Usage
CWBEngine
All methods rely on the CWBEngine
from ccc.cwb
, which you first
have to initialize with your system specific settings:
from ccc.cwb import CWBEngine
engine = CWBEngine(
corpus_name="EXAMPLE_CORPUS"
registry_path="/path/to/your/cwb/registry"
)
NB: this will raise a KeyError if the named corpus is not in the specified registry.
You can use the cqp_bin
to point the engine to a specific version of
cqp
(this is also helpful if cqp
is not in your PATH
):
engine = CWBEngine(
corpus_name="EXAMPLE_CORPUS",
registry_path="/path/to/your/cwb/registry",
cqp_bin="/usr/local/cwb-3.4.16/bin/cqp"
)
If you are using macros and wordlists, you have to store them in a
separate folder (with subfolders wordlists
and macros
). Make sure
you specify this folder via lib_path
when initializing the
engine:
engine = CWBEngine(
corpus_name="EXAMPLE_CORPUS",
registry_path="/path/to/your/cwb/registry",
lib_path="/path/to/your/lib/"
)
Concordancing
You can use the Concordance
class from ccc.concordances
for
concordancing. The concordancer has to be initialized with the engine
and accepts valid CQP queries:
from ccc.concordances import Concordance
# initialize the concordancer with the engine
concordance = Concordance(engine)
# extract concordance lines
concordance.query('[lemma="Angela"] [lemma="Merkel"]')
The result will be a dictionary with the cpos of the match as keys
and the entries one concordance line each. Each concordance line is
formatted as a pandas.DataFrame
with the cpos of each token as
index:
cpos | word | match | offset |
---|---|---|---|
188530363 | , | False | -5 |
188530364 | dass | False | -4 |
188530365 | die | False | -3 |
188530366 | Tage | False | -2 |
188530367 | von | False | -1 |
188530368 | Angela | True | 0 |
188530369 | Merkel | True | 0 |
188530370 | gezählt | False | 1 |
188530371 | sind | False | 2 |
188530372 | . | False | 3 |
The queries must not end on a "within" clause. If you want to
restrict your concordance lines by a structural attribute, use the
s_break
parameter (defaults to "text"). The default context window
is 20 tokens to the left and 20 tokens to the right of the query match
and matchend, respectively.
concordance = Concordance(engine, context=50, s_break='s')
concordance.query('[lemma="Angela"] [lemma="Merkel"]')
Further parameters for the Concordance
class are order
(one of
"random", "first", or "last"), cut_off
(for the number of
concordance lines to extract), and p_show
(a list
of additional
p-attributes besides the primary word layer to show, e.,g. "lemma" or
"pos"; these will be added as additional columns).
Anchored Queries
Concordance
detects anchored queries by default. The following query
concordance.query(
'@0[lemma="Angela"]? @1[lemma="Merkel"] '
'[word="\\("] @2[lemma="CDU"] [word="\\)"]'
)
will thus return DataFrame
s with an additional column indicating the
anchor positions:
cpos | word | match | offset | anchor |
---|---|---|---|---|
298906425 | auch | False | -5 | None |
298906426 | das | False | -4 | None |
298906427 | Handy | False | -3 | None |
298906428 | von | False | -2 | None |
298906429 | Kanzlerin | False | -1 | None |
298906430 | Angela | True | 0 | 0 |
298906431 | Merkel | True | 0 | 1 |
298906432 | ( | True | 0 | None |
298906433 | CDU | True | 0 | 2 |
298906434 | ) | True | 0 | None |
298906435 | sowie | False | 1 | None |
298906436 | ihres | False | 2 | None |
298906437 | Vorgängers | False | 3 | None |
298906438 | Gerhard | False | 4 | None |
298906439 | Schröder | False | 5 | None |
Argument Queries
Argument queries are anchored queries with additional information. (1) Each anchor can be modified by an offset (usually used to capture underspecified tokens near an anchor point). (2) Anchors can be mapped to external identifiers for further logical processing, and (3) be given a clear name:
anchor | offset | idx | clear name |
---|---|---|---|
0 | 0 | None | None |
1 | -1 | None | None |
2 | 0 | None | None |
3 | -1 | None | None |
Furthermore, several anchor queries can be combined to form regions, which in turn can be mapped to identifiers and be given a clear name:
start | end | idx | clear name |
---|---|---|---|
0 | 1 | "0" | "person X" |
2 | 3 | "1" | "person Y" |
Example: Given the definition of anchors and regions above, the follwing complex query extracts corpus positions where there's some correlation between "person X" (the region from anchor 0 to anchor 1) and "person Y" (anchor 2 to 3):
query = (
"<np> []* /ap[]* [lemma = $nouns_similarity] "
"[]*</np> \"between\" @0:[::](<np>[pos_simple=\"D|A\"]* "
"([pos_simple=\"Z|P\" | lemma = $nouns_person_common | "
"lemma = $nouns_person_origin | lemma = $nouns_person_support | "
"lemma = $nouns_person_negative | "
"lemma = $nouns_person_profession] |/region[ner])+ "
"[]*</np>)+@1:[::] \"and\" @2:[::](<np>[pos_simple=\"D|A\"]* "
"([pos_simple=\"Z|P\" | lemma = $nouns_person_common | "
"lemma = $nouns_person_origin | lemma = $nouns_person_support | "
"lemma = $nouns_person_negative | "
"lemma = $nouns_person_profession] | /region[ner])+ "
"[]*</np>) (/region[np] | <vp>[lemma!=\"be\"]</vp> | "
"/region[pp] |/be_ap[])* @3:[::]"
)
NB: the set of identifiers defined in the table of anchors and in the table of regions, respectively, should not overlap.
It is customary to store these queries in json query files such as the
example. You can directly process
these files using the process_argmin_file
method from ccc.anchors
:
from ccc.argmin import process_argmin_file
# process the query file
query_path = "tests/gold/query-example.json"
result = process_argmin_file(engine, query_path)
The result is a dict
with the same keys as specified in the query
file as well as an entry "result" with the following keys:
- "nr_matches": the number of query matches in the corpus.
- "matches": the actual concordance lines as returned from
Concordance().query()
(see above) converted to adict
. An additional entry "holes" contains a mapping from the idx specified in the anchor and region tables to the tokens or token sequences, respectively, for each concordance line (default: lemma layer). - "holes": a global list of all tokens of the entities specified in the "idx" columns (default: lemma layer).
Acknowledgements
The module relies on several other python modules (see the requirements). Special thanks to Yannick Versley and Jorg Asmussen for the implementation of cwb-python.
This work has been funded by the Deutsche Forschungsgemeinschaft (DFG) within the project "Reconstructing Arguments from Noisy Text", grant number 377333057, as part of the Priority Program "Robust Argumentation Machines (RATIO)" (SPP-1999).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.