CWB wrapper to extract concordances and collocates

These details have not been verified by PyPI

Project links

Homepage

Project description

Collocation and Concordance Computation

Introduction

This module is a wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to extract concordance lines, calculate collocates, and run queries with several anchor points.

If you want to extract the results of anchored queries with more than two anchor points, the module requires CWB version 3.4.16 or newer.

Installation

You can install this module with pip from PyPI:

pip3 install cwb-ccc

You can also clone the repository from gitlab.cs.fau.de and use setup.py:

python3 setup.py install

You can also just install all requirements specified in setup.py and make sure the ccc subfolder can be found by Python by including it in your PYTHONPATH.

Usage

CWBEngine

All methods rely on the CWBEngine from ccc.cwb, which you first have to initialize with your system specific settings:

from ccc.cwb import CWBEngine
engine = CWBEngine(
	corpus_name="EXAMPLE_CORPUS",
	registry_path="/path/to/your/cwb/registry/"
)

This will raise a KeyError if the named corpus is not in the specified registry.

If you are using macros and wordlists, you have to store them in a separate folder (with subfolders wordlists and macros). Make sure you specify this folder via lib_path when initializing the engine.

Point your engine to meta data stored as a tab-separated (and possibly gziped) file via the meta_path parameter. The first column must be a unique identifier, and the corresponding text regions have to be annotated in the corpus as structural attribute (whose name you can specify via the meta_s parameter, e.g. "text_id").

You can use the cqp_bin to point the engine to a specific version of cqp (this is also helpful if cqp is not in your PATH).

By default, the cache_path points to "/tmp/ccc-cache". Make sure that "/tmp/" exists and the appropriate rights are granted. Otherwise, change the parameter when envoking the engine (or set it to None).

Concordancing

You can use the Concordance class from ccc.concordances for concordancing. The concordancer has to be initialized with the engine and accepts valid CQP queries:

from ccc.concordances import Concordance
# initialize the concordancer with the engine
concordance = Concordance(engine)
# extract concordance lines
concordance.query('[lemma="Angela"] [lemma="Merkel"]')

The result will be a dictionary with the cpos of the match as keys and the entries one concordance line each. Each concordance line is formatted as a pandas.DataFrame with the cpos of each token as index:

cpos	word	match	offset
188530363	,	False	-5
188530364	dass	False	-4
188530365	die	False	-3
188530366	Tage	False	-2
188530367	von	False	-1
188530368	Angela	True	0
188530369	Merkel	True	0
188530370	gezählt	False	1
188530371	sind	False	2
188530372	.	False	3

The queries must not end on a "within" clause. If you want to restrict your concordance lines by a structural attribute, use the s_break parameter (defaults to "text"). The default context window is 20 tokens to the left and 20 tokens to the right of the query match and matchend, respectively.

Further parameters for the Concordance class are order (one of "random", "first", or "last"), cut_off (for the number of concordance lines to extract), and p_show (a list of additional p-attributes besides the primary word layer to show, e.g. "lemma" or "pos"; these will be added as additional columns).

Collocation Analyses

You can use the Collocates class to extract collocates for a given window size (symmetric windows around the corpus matches):

from ccc.collocates import Collocates
# initialize the collocation calculator with the engine
collocates = Collocates(engine)
# extract collocates
collocates.query('[lemma="Angela"] [lemma="Merkel"]', window=5)

The result will be a DataFrame with lexical items (lemmas by default) as index and frequency signatures and association measures as columns:

item	O11	f2	N	f1	O12	O21	O22	E11	E12	E21	E22	log_likelihood	...
die	1799	25125329	300917702	22832	21033	25123530	275771340	1906.373430	20925.626570	2.512342e+07	2.757714e+08	-2.459194	...
Bundeskanzlerin	1491	8816	300917702	22832	21341	7325	300887545	0.668910	22831.331090	8.815331e+03	3.008861e+08	1822.211827	...
.	1123	13677811	300917702	22832	21709	13676688	287218182	1037.797972	21794.202028	1.367677e+07	2.872181e+08	2.644804	...
,	814	17562059	300917702	22832	22018	17561245	283333625	1332.513602	21499.486398	1.756073e+07	2.833341e+08	-14.204447	...
Kanzlerin	648	17622	300917702	22832	22184	16974	300877896	1.337062	22830.662938	1.762066e+04	3.008772e+08	559.245198	...

By default, the dataframe is sorted by co-occurrence frequency (O11), and only the first 100 most frequently co-occurring collocates are retrieved. You can change the order and cut_off parameters when envoking the Collocates class.

By default, collocates are calculated on the "lemma"-layer (assuming that this is a valid p-attribute in the corpus) and windows are cut at the "text" s-attribute. The corresponding parameters are s_break and p_query.

By default, the dataframe is annotated with "z_score", "t_score", "dice", "log_likelihood", and "mutual_information" (parameter ams). For notation and further information regarding association measures, see collocations.de. Available association measures depend on their implementation in the association-measures module.

A further parameter of the Collocates class is the max_window_size. This is an internal parameter that determines the result of the initial query via the CWBEngine. The result of the initial query and its co-occurrence dataframe are cached (assuming the engine was not initialized with cache_path=None), which means that you can extract collocates for windows from 0 to max_window_size quickly after the first run.

Anchored Queries

The Concordance class detects anchored queries by default. The following query

concordance.query(
	'@0[lemma="Angela"]? @1[lemma="Merkel"] [word="\\("] @2[lemma="CDU"] [word="\\)"]'
)

will thus return DataFrames with an additional column indicating the anchor positions:

cpos	word	match	offset	anchor
298906425	auch	False	-5	None
298906426	das	False	-4	None
298906427	Handy	False	-3	None
298906428	von	False	-2	None
298906429	Kanzlerin	False	-1	None
298906430	Angela	True	0	0
298906431	Merkel	True	0	1
298906432	(	True	0	None
298906433	CDU	True	0	2
298906434	)	True	0	None
298906435	sowie	False	1	None
298906436	ihres	False	2	None
298906437	Vorgängers	False	3	None
298906438	Gerhard	False	4	None
298906439	Schröder	False	5	None

Argument Queries

Argument queries are anchored queries with additional information. (1) Each anchor can be modified by an offset (usually used to capture underspecified tokens near an anchor point). (2) Anchors can be mapped to external identifiers for further logical processing. (3) Anchors may be given a clear name:

anchor	offset	idx	clear name
0	0	None	None
1	-1	None	None
2	0	None	None
3	-1	None	None

Furthermore, several anchor points can be combined to form regions, which in turn can be mapped to identifiers and be given a clear name:

start	end	idx	clear name
0	1	"0"	"person X"
2	3	"1"	"person Y"

Example: Given the definition of anchors and regions above as well as suitable wordlists, the following complex query extracts corpus positions where there's some correlation between "person X" (the region from anchor 0 to anchor 1) and "person Y" (anchor 2 to 3):

query = (
	"<np> []* /ap[]* [lemma = $nouns_similarity] "
	"[]*</np> \"between\" @0:[::](<np>[pos_simple=\"D|A\"]* "
	"([pos_simple=\"Z|P\" | lemma = $nouns_person_common | "
	"lemma = $nouns_person_origin | lemma = $nouns_person_support | "
	"lemma = $nouns_person_negative | "
	"lemma = $nouns_person_profession] |/region[ner])+ "
	"[]*</np>)+@1:[::] \"and\" @2:[::](<np>[pos_simple=\"D|A\"]* "
	"([pos_simple=\"Z|P\" | lemma = $nouns_person_common | "
	"lemma = $nouns_person_origin | lemma = $nouns_person_support | "
	"lemma = $nouns_person_negative | "
	"lemma = $nouns_person_profession] | /region[ner])+ "
	"[]*</np>) (/region[np] | <vp>[lemma!=\"be\"]</vp> | "
	"/region[pp] |/be_ap[])* @3:[::]"
)

NB: the set of identifiers defined in the table of anchors and in the table of regions, respectively, should not overlap.

It is customary to store these queries in json objects (see an example in the repository). You can process argument queries with the argmin_query method from ccc.argmin:

import json
from ccc.argmin import argmin_query
# read the query file
query_path = "tests/gold/query-example.json"
with open(query_path, "rt") as f:
	query = json.loads(f.read())
# query the corpus
result = argmin_query(
	engine,
	query=query['query'],
	anchors=query['anchors'],
	regions=query['regions']
)

Further parameters for argmin_query are s_break, context, p_show, and match_strategy (one of "longest" or "standard", see documentation of CQP).

The result is a dict with the following keys:

"nr_matches": the number of query matches in the corpus.
"matches": a list of concordance lines. Each concordance line contains
- "position": the corpus position of the match
- "df": the actual concordance line as returned from Concordance().query() (see above) converted to a dict
- "holes": a mapping from the IDs specified in the anchor and region tables to the tokens or token sequences, respectively (default: lemma layer)
- "full": a reconstruction of the concordance line as a sequence of tokens (word layer)
"holes": a global list of all tokens of the entities specified in the "idx" columns (default: lemma layer).

Acknowledgements

The module relies on cwb-python, special thanks to Yannick Versley and Jorg Asmussen for the implementation. Thanks to Markus Opolka for the implementation of association-measures and for forcing me to write tests.

This work has been funded by the Deutsche Forschungsgemeinschaft (DFG) within the project "Reconstructing Arguments from Noisy Text", grant number 377333057, as part of the Priority Program "Robust Argumentation Machines (RATIO)" (SPP-1999).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12.1

Aug 16, 2023

0.12.0

Aug 15, 2023

0.11.8

Feb 22, 2023

0.11.7

Feb 7, 2023

0.11.6

Feb 7, 2023

0.11.5

Jan 31, 2023

0.11.4

Jan 30, 2023

0.11.3

Jan 25, 2023

0.11.2

Nov 25, 2022

0.11.1

Oct 18, 2022

0.11.0

Oct 17, 2022

0.10.3

Aug 30, 2022

0.10.2

Mar 6, 2022

0.10.1

Dec 1, 2021

0.10.0

Nov 21, 2021

0.9.15

Apr 14, 2021

0.9.14

Apr 6, 2021

0.9.13

Feb 17, 2021

0.9.12

Dec 3, 2020

0.9.11

Aug 29, 2020

0.9.10

Aug 4, 2020

0.9.9

Jul 27, 2020

0.9.8

Jul 19, 2020

0.9.7

May 8, 2020

0.9.6

Feb 9, 2020

This version

0.9.5

Feb 3, 2020

0.9.4

Jan 28, 2020

0.9.3

Jan 27, 2020

0.9.2

Jan 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cwb-ccc-0.9.5.tar.gz (21.1 kB view hashes)

Uploaded Feb 3, 2020 Source

Built Distribution

cwb_ccc-0.9.5-py3-none-any.whl (33.5 kB view hashes)

Uploaded Feb 3, 2020 Python 3

Hashes for cwb-ccc-0.9.5.tar.gz

Hashes for cwb-ccc-0.9.5.tar.gz
Algorithm	Hash digest
SHA256	`d9431100879897fee6028c5b1cb2acdd230767c507b5e635cc49ac8d1472ce80`
MD5	`745cd8bda0eeb0e4a786c831b69aace6`
BLAKE2b-256	`25a1d4cbcc8130724ad9db85f334c9f91bfe2974e0c339f70bf0c762f119d755`

Hashes for cwb_ccc-0.9.5-py3-none-any.whl

Hashes for cwb_ccc-0.9.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f381f3e8552c60990ac155cc3b0bede45ad969fdf9c51b987370d7181d67ffe`
MD5	`f78fe65a9a4ea7c0d028dd80e9ec07db`
BLAKE2b-256	`43d3d58652d3425a17dec52a93034df8b0840d91b13dd5b2a76d20975efbe1f2`