Skip to main content

CWB wrapper to extract concordances and collocates

Project description

Collocation and Concordance Computation

Introduction

This module is a wrapper around the IMS Open Corpus Workbench (CWB). It requires CWB version 3.4.16 or newer for anchored queries. Main purpose of the module is to extract concordance lines and to calculate collocates, as well as to extract the results of queries with more than two anchors.

Installation

The recommended way to install the module is to clone the repository and use setup.py.

python3 setup.py install

Alternatively, you can just install the requirements and make sure the ccc subfolder can be found by Python by including it in your PYTHONPATH.

Usage

CWBEngine

All methods rely on the CWBEngine from ccc.cwb, which you first have to initialize with your system specific settings:

from ccc.cwb import CWBEngine

engine = CWBEngine(
	corpus_name="EXAMPLE_CORPUS"
	registry_path="/path/to/your/cwb/registry"
)

NB: this will raise a KeyError if the named corpus is not in the specified registry.

You can use the cqp_bin to point the engine to a specific version of cqp (this is also helpful if cqp is not in your PATH):

engine = CWBEngine(
	corpus_name="EXAMPLE_CORPUS",
	registry_path="/path/to/your/cwb/registry", 
	cqp_bin="/usr/local/cwb-3.4.16/bin/cqp"
)

If you are using macros and wordlists, you have to store them in a separate folder (with subfolders wordlists and macros). Make sure you specify this folder via lib_path when initializing the engine:

engine = CWBEngine(
	corpus_name="EXAMPLE_CORPUS", 
	registry_path="/path/to/your/cwb/registry",
	lib_path="/path/to/your/lib/"
)

Concordancing

You can use the Concordance class from ccc.concordances for concordancing. The concordancer has to be initialized with the engine and accepts valid CQP queries:

from ccc.concordances import Concordance

# initialize the concordancer with the engine
concordance = Concordance(engine)

# extract concordance lines
concordance.query('[lemma="Angela"] [lemma="Merkel"]')

The result will be a dictionary with the cpos of the match as keys and the entries one concordance line each. Each concordance line is formatted as a pandas.DataFrame with the cpos of each token as index:

cpos word match offset
188530363 , False -5
188530364 dass False -4
188530365 die False -3
188530366 Tage False -2
188530367 von False -1
188530368 Angela True 0
188530369 Merkel True 0
188530370 gezählt False 1
188530371 sind False 2
188530372 . False 3

The queries must not end on a "within" clause. If you want to restrict your concordance lines by a structural attribute, use the s_break parameter (defaults to "text"). The default context window is 20 tokens to the left and 20 tokens to the right of the query match and matchend, respectively.

concordance = Concordance(engine, context=50, s_break='s')
concordance.query('[lemma="Angela"] [lemma="Merkel"]')

Further parameters for the Concordance class are order (one of "random", "first", or "last"), cut_off (for the number of concordance lines to extract), and p_show (a list of additional p-attributes besides the primary word layer to show, e.,g. "lemma" or "pos"; these will be added as additional columns).

Anchored Queries

Concordance detects anchored queries by default. The following query

concordance.query(
	'@0[lemma="Angela"]? @1[lemma="Merkel"] '
	'[word="\\("] @2[lemma="CDU"] [word="\\)"]'
)

will thus return DataFrames with an additional column indicating the anchor positions:

cpos word match offset anchor
298906425 auch False -5 None
298906426 das False -4 None
298906427 Handy False -3 None
298906428 von False -2 None
298906429 Kanzlerin False -1 None
298906430 Angela True 0 0
298906431 Merkel True 0 1
298906432 ( True 0 None
298906433 CDU True 0 2
298906434 ) True 0 None
298906435 sowie False 1 None
298906436 ihres False 2 None
298906437 Vorgängers False 3 None
298906438 Gerhard False 4 None
298906439 Schröder False 5 None

Argument Queries

Argument queries are anchored queries with additional information. (1) Each anchor can be modified by an offset (usually used to capture underspecified tokens near an anchor point). (2) Anchors can be mapped to external identifiers for further logical processing, and (3) be given a clear name:

anchor offset idx clear name
0 0 None None
1 -1 None None
2 0 None None
3 -1 None None

Furthermore, several anchor queries can be combined to form regions, which in turn can be mapped to identifiers and be given a clear name:

start end idx clear name
0 1 "0" "person X"
2 3 "1" "person Y"

Example: Given the definition of anchors and regions above, the follwing complex query extracts corpus positions where there's some correlation between "person X" (the region from anchor 0 to anchor 1) and "person Y" (anchor 2 to 3):

query = (
	"<np> []* /ap[]* [lemma = $nouns_similarity] "
	"[]*</np> \"between\" @0:[::](<np>[pos_simple=\"D|A\"]* "
	"([pos_simple=\"Z|P\" | lemma = $nouns_person_common | "
	"lemma = $nouns_person_origin | lemma = $nouns_person_support | "
	"lemma = $nouns_person_negative | "
	"lemma = $nouns_person_profession] |/region[ner])+ "
	"[]*</np>)+@1:[::] \"and\" @2:[::](<np>[pos_simple=\"D|A\"]* "
	"([pos_simple=\"Z|P\" | lemma = $nouns_person_common | "
	"lemma = $nouns_person_origin | lemma = $nouns_person_support | "
	"lemma = $nouns_person_negative | "
	"lemma = $nouns_person_profession] | /region[ner])+ "
	"[]*</np>) (/region[np] | <vp>[lemma!=\"be\"]</vp> | "
	"/region[pp] |/be_ap[])* @3:[::]"
)

NB: the set of identifiers defined in the table of anchors and in the table of regions, respectively, should not overlap.

It is customary to store these queries in json query files such as the example. You can directly process these files using the process_argmin_file method from ccc.anchors:

from ccc.argmin import process_argmin_file

# process the query file
query_path = "tests/gold/query-example.json"
result = process_argmin_file(engine, query_path)

The result is a dict with the same keys as specified in the query file as well as an entry "result" with the following keys:

  • "nr_matches": the number of query matches in the corpus.
  • "matches": the actual concordance lines as returned from Concordance().query() (see above) converted to a dict. An additional entry "holes" contains a mapping from the idx specified in the anchor and region tables to the tokens or token sequences, respectively, for each concordance line (default: lemma layer).
  • "holes": a global list of all tokens of the entities specified in the "idx" columns (default: lemma layer).

Acknowledgements

The module relies on several other python modules (see the requirements). Special thanks to Yannick Versley and Jorg Asmussen for the implementation of cwb-python.

This work has been funded by the Deutsche Forschungsgemeinschaft (DFG) within the project "Reconstructing Arguments from Noisy Text", grant number 377333057, as part of the Priority Program "Robust Argumentation Machines (RATIO)" (SPP-1999).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cwb-ccc-0.9.2.tar.gz (15.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cwb_ccc-0.9.2-py3.6.egg (31.6 kB view details)

Uploaded Egg

cwb_ccc-0.9.2-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file cwb-ccc-0.9.2.tar.gz.

File metadata

  • Download URL: cwb-ccc-0.9.2.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.19.9 CPython/3.6.9

File hashes

Hashes for cwb-ccc-0.9.2.tar.gz
Algorithm Hash digest
SHA256 c402589493d1bd09ea550aada4411f52f9f045ffc3330e14222230cadb4faab8
MD5 825f63a26359a7313bf7029073bbc6a3
BLAKE2b-256 83a1de7dafeb56f36fb7af51d7f8cecf61e1ad9edd825971f7dc0e9cb7006099

See more details on using hashes here.

File details

Details for the file cwb_ccc-0.9.2-py3.6.egg.

File metadata

  • Download URL: cwb_ccc-0.9.2-py3.6.egg
  • Upload date:
  • Size: 31.6 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.19.9 CPython/3.6.9

File hashes

Hashes for cwb_ccc-0.9.2-py3.6.egg
Algorithm Hash digest
SHA256 81b3b792ed461a9176fe6e4b241b7414bccf021da169aa926ae1ef56c1d6d47c
MD5 5a6b90126c3cbfcf48d29c2ac738f528
BLAKE2b-256 4494af4daf47202e482c42f0df13219763a32164a08797cf2074c407e68daea9

See more details on using hashes here.

File details

Details for the file cwb_ccc-0.9.2-py3-none-any.whl.

File metadata

  • Download URL: cwb_ccc-0.9.2-py3-none-any.whl
  • Upload date:
  • Size: 28.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.19.9 CPython/3.6.9

File hashes

Hashes for cwb_ccc-0.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c41a8c008b3aceec643c7551ad5f5be1df6d64acb693321d6d92a1717102ba7b
MD5 899bb57b5c999ac52e85b77933577476
BLAKE2b-256 cce198ce004b940b1153d3f68fb20f29027ec07781d9fb6eb1fb93dd016af63f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page