CWB wrapper to extract concordances and collocates

These details have not been verified by PyPI

Project links

Homepage

Project description

Collocation and Concordance Computation

Introduction

This module is a wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to run queries, extract concordance lines, and calculate collocates.

Introduction
Usage
Acknowledgements

Prerequisites

The module needs a working installation of the CWB and operates on CWB-indexed corpora.

If you want to run queries with more than two anchor points, the module requires CWB version 3.4.16 or later.

Installation

You can install this module with pip from PyPI:

pip3 install cwb-ccc

You can also clone the source from github, cd in the respective folder, and use setup.py:

python3 setup.py install

Corpus Setup

All methods rely on the Corpus class, which establishes the connection to your CWB-indexed corpus:

from ccc import Corpus
corpus = Corpus(
	corpus_name="EXAMPLE_CORPUS",
	registry_path='/usr/local/share/cwb/registry/'
)
print(corpus)

This will raise a KeyError if the named corpus is not in the specified registry.

If you are using macros and wordlists, you have to store them in a separate folder (with subfolders "wordlists/" and "macros/"). Make sure you specify this folder via lib_path when initializing the corpus.

You can use the cqp_bin to point the module to a specific version of cqp (this is also helpful if cqp is not in your PATH).

By default, the data_path points to "/tmp/ccc-data/". Make sure that "/tmp/" exists and appropriate rights are granted. Otherwise, change the parameter when initializing the corpus.

Usage

Queries and Dumps

Before you can display anything, you have to run a query with the corpus.query() method, which accepts valid CQP queries such as

query = '[lemma="Angela"]? [lemma="Merkel"] [word="\("] [lemma="CDU"] [word="\)"]'
dump = corpus.query(
	cqp_query=query
)
print(dump)

The result is a Dump object. Its core is a pandas DataFrame multi-indexed by CQP's "match" and "matchend" (similar to a CQP dump). All entries of the DataFrame, including the index, are integers representing corpus positions:

print(dump.df)

You can provide one or more parameters to define the context around the matches: a parameter context specifying the context window (defaults to 20) and an s-attribute defining the context (context_break). You can specify asymmetric windows via context_left and context_right.

dump = corpus.query(
	cqp_query=query,
	context=20,
	context_break='s'
)

In this case, the dump.df will contain two further columns, specifying the context: "context" and "contextend".

Note that queries may end on a "within" clause, which will limit the matches to regions defined by this structural attribute. If you provide a context_break parameter, the query will be automatically confined by this s-attribute.

You can set CQP's matching strategy ("standard", "longest", "shortest") via the match_strategy parameter.

By default, the result is cached: the query parameters will be used to create an identifier. The resulting Dump object contains the appropriate identifier as attribute name_cache. The resulting subcorpus will be saved to disk by CQP, and the extended dump containing the context put into a cache. This way, the result can be accessed directly by later queries with the same parameters on the same (sub)corpus, without the need for CQP to run again. You can disable caching by providing a name other than "mnemosyne".

Now you are set up to analyze your query result. Let's start with the frequency breakdown:

print(dump.breakdown())

word	freq
Angela Merkel ( CDU )	2253
Merkel ( CDU )	29
Angela Merkels ( CDU )	2

Concordancing

You can directly access concordance lines via the concordance method of the dump. This method returns a dataframe with information about the query matches in context:

lines = dump.concordance()
print(lines)

match	matchend	context	contextend	raw
676	680	656	700	{'cpos': [656, 657, 658, 659, 660, 661, 662, 6...
1190	1194	1170	1214	{'cpos': [1170, 1171, 1172, 1173, 1174, 1175, ...
543640	543644	543620	543664	{'cpos': [543620, 543621, 543622, 543623, 5436...
...	...	...	...	...

Column raw contains a dictionary with the following keys:

"match" (int): the cpos of the match
"cpos" (list): the cpos of all tokens in the concordance line
"offset" (list): the offset to match/matchend of all tokens
"word" (list): the words of all tokens
"anchors" (dict): a dictionary of {anchor: cpos} (see below)

You can create your own formatting from this, or use the form parameter to define how your lines should be formatted ("raw", "simple", "kwic", "dataframes" or "extended"). If form="dataframes" or form="extended", the dataframe contains a column df with each concordance line being formatted as a DataFrame with the cpos of each token as index:

lines = dump.concordance(form="dataframes")
print(lines['df'].iloc[0])

cpos	offset	word	match	matchend	context	contextend
48344	-5	Eine	False	False	True	False
48345	-4	entsprechende	False	False	False	False
48346	-3	Steuererleichterung	False	False	False	False
48347	-2	hat	False	False	False	False
48348	-1	Kanzlerin	False	False	False	False
48349	0	Angela	True	False	False	False
48350	0	Merkel	False	False	False	False
48351	0	(	False	False	False	False
48352	0	CDU	False	False	False	False
48353	0	)	False	True	False	False
48354	1	bisher	False	False	False	False
48355	2	ausgeschlossen	False	False	False	False
48356	3	.	False	False	False	True

Attribute selection is controlled via the p_show and s_show parameters (lists of p-attributes and s-attributes, respectively):

lines = dump.concordance(
	form="dataframes",
	p_show=['word', 'lemma'],
	s_show=['text_id']
)
print(lines)

match	matchend	context	contextend	df	text_id
676	680	656	700	...	A113224
1190	1194	1170	1214	...	A124124
543640	543644	543620	543664	...	A423523
...	...	...	...	...	...

print(lines['df'].iloc[0])

cpos	offset	word	lemma	match	matchend	context	contextend
48344	-5	Eine	eine	False	False	True	False
48345	-4	entsprechende	entsprechende	False	False	False	False
48346	-3	Steuererleichterung	Steuererleichterung	False	False	False	False
48347	-2	hat	haben	False	False	False	False
48348	-1	Kanzlerin	Kanzlerin	False	False	False	False
48349	0	Angela	Angela	True	False	False	False
48350	0	Merkel	Merkel	False	False	False	False
48351	0	(	(	False	False	False	False
48352	0	CDU	CDU	False	False	False	False
48353	0	)	)	False	True	False	False
48354	1	bisher	bisher	False	False	False	False
48355	2	ausgeschlossen	ausschließen	False	False	False	False
48356	3	.	.	False	False	False	True

You can decide which and how many concordance lines you want to retrieve by means of the parameters order ("first", "last", or "random") and cut_off. You can also provide a list of matches to get only specific concordance lines.

Anchored Queries

The concordancer detects anchored queries automatically. The following query

dump = corpus.query(
	'@0[lemma="Angela"]? @1[lemma="Merkel"] [word="\("] @2[lemma="CDU"] [word="\)"]',
)
dump.concordance(form='dataframes')

thus returns DataFrames with additional columns for each anchor point.

cpos	offset	word	match	matchend	context	contextend	0	1	2
48344	-5	Eine	False	False	True	False	False	False	False
48345	-4	entsprechende	False	False	False	False	False	False	False
48346	-3	Steuererleichterung	False	False	False	False	False	False	False
48347	-2	hat	False	False	False	False	False	False	False
48348	-1	Kanzlerin	False	False	False	False	False	False	False
48349	0	Angela	True	False	False	False	True	False	False
48350	0	Merkel	False	False	False	False	False	True	False
48351	0	(	False	False	False	False	False	False	False
48352	0	CDU	False	False	False	False	False	False	True
48353	0	)	False	True	False	False	False	False	False
48354	1	bisher	False	False	False	False	False	False	False
48355	2	ausgeschlossen	False	False	False	False	False	False	False
48356	3	.	False	False	False	True	False	False	False

Collocation Analyses

After executing a query, you can use the dump.collocates() method to extract collocates for a given window size (symmetric windows around the corpus matches). The result will be a DataFrame with lexical items as index and frequency signatures and association measures as columns.

dump = corpus.query(
    '[lemma="Angela"] [lemma="Merkel"]',
	context=10, context_break='s'
)
collocates = dump.collocates()
print(collocates)

item	O11	O12	O21	O22	E11	E12	E21	E22	log_likelihood	...
die	1189	13461	22082331	233975249	1263.407469	13386.592531	2.208226e+07	2.339753e+08	-4.883922	...
Bundeskanzlerin	1165	13485	5783	256051797	0.397498	14649.602502	6.947603e+03	2.560506e+08	16573.570027	...
,	603	14047	14436277	241621303	825.939978	13824.060022	1.443605e+07	2.416215e+08	-70.046255	...
Kanzlerin	492	14158	13274	256044306	0.787559	14649.212441	1.376521e+04	2.560438e+08	5386.275148	...
haben	379	14271	2433866	253623714	139.264180	14510.735820	2.434106e+06	2.536235e+08	283.416865	...
...	...	...	...	...	...	...	...	...	...	...

By default, collocates are calculated on the "lemma"-layer, assuming that this is a valid p-attribute in the corpus. The corresponding parameter is p_query (which will fall back to "word" if the specified attribute is not annotated in the corpus).

For improved performance, all hapax legomena in the context are dropped after calculating the context size. You can change this behaviour via the min_freq parameter.

By default, the dataframe is annotated with "z_score", "t_score", "dice", "log_likelihood", and "mutual_information" (parameter ams). For notation and further information regarding association measures, see collocations.de. Availability of association measures depends on their implementation in the pandas-association-measures package.

The dataframe is sorted by co-occurrence frequency (column "O11"), and only the first 100 most frequently co-occurring collocates are retrieved. You can (and should) change this behaviour via the order and cut_off parameters.

Keyword Analyses

For keyword analyses, you have to define a subcorpus. The natural way of doing so is by selecting text identifiers via spreadsheets or relational databases. If you have collected an appropriate set of ids, you can use the corpus.dump_from_s_att() method:

dump = corpus.dump_from_s_att('text_id', ids)
keywords = dump.keywords()

Just as with collocates, the result is a DataFrame with lexical items (p_query layer) as index and frequency signatures and association measures as columns.

You can of course also define a subcorpus via a corpus query, e.g.

dump = corpus.query('"Atomkraft" expand to s')
keywords = dump.keywords()

Acknowledgements

The module relies on cwb-python, thanks to Yannick Versley and Jorg Asmussen for the implementation. Special thanks to Markus Opolka for the implementation of association-measures and for forcing me to write tests.

This work was supported by the Emerging Fields Initiative (EFI) of Friedrich-Alexander-Universität Erlangen-Nürnberg, project title Exploring the Fukushima Effect.

Further development of the package has been funded by the Deutsche Forschungsgemeinschaft (DFG) within the project Reconstructing Arguments from Noisy Text, grant number 377333057, as part of the Priority Program Robust Argumentation Machines (SPP-1999).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12.1

Aug 16, 2023

0.12.0

Aug 15, 2023

0.11.8

Feb 22, 2023

0.11.7

Feb 7, 2023

0.11.6

Feb 7, 2023

0.11.5

Jan 31, 2023

0.11.4

Jan 30, 2023

0.11.3

Jan 25, 2023

0.11.2

Nov 25, 2022

0.11.1

Oct 18, 2022

0.11.0

Oct 17, 2022

0.10.3

Aug 30, 2022

0.10.2

Mar 6, 2022

0.10.1

Dec 1, 2021

0.10.0

Nov 21, 2021

0.9.15

Apr 14, 2021

0.9.14

Apr 6, 2021

0.9.13

Feb 17, 2021

0.9.12

Dec 3, 2020

0.9.11

Aug 29, 2020

This version

0.9.10

Aug 4, 2020

0.9.9

Jul 27, 2020

0.9.8

Jul 19, 2020

0.9.7

May 8, 2020

0.9.6

Feb 9, 2020

0.9.5

Feb 3, 2020

0.9.4

Jan 28, 2020

0.9.3

Jan 27, 2020

0.9.2

Jan 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cwb-ccc-0.9.10.tar.gz (38.5 kB view hashes)

Uploaded Aug 4, 2020 Source

Built Distribution

cwb_ccc-0.9.10-py3-none-any.whl (50.2 kB view hashes)

Uploaded Aug 4, 2020 Python 3

Hashes for cwb-ccc-0.9.10.tar.gz

Hashes for cwb-ccc-0.9.10.tar.gz
Algorithm	Hash digest
SHA256	`cdcd23e142ccae7a16391a3e4a846ad185d5d36c167d9df069212fc15dd8bfb7`
MD5	`916dd2768b66b8cbcedb4152b0973af8`
BLAKE2b-256	`88b5241be84109190b9941f0dff5610025bf0a3c7e605936b1e5227df112dc7d`

Hashes for cwb_ccc-0.9.10-py3-none-any.whl

Hashes for cwb_ccc-0.9.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c5c88341c05e149f24c2bb3616415dc7dbd66cb2fa10573c3301bde7e11fdde`
MD5	`a28e0495d49564e0f3969537f733bc10`
BLAKE2b-256	`1f8b38f1c3d8b6bb8d72842ce465d6d41eef92e6d11b11745402e3205510e69e`