CWB wrapper to extract concordances and score frequency lists

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Collocation and Concordance Computation

cwb-ccc is a Python 3 wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to run queries, extract concordance lines, and score frequency lists (particularly to extract collocates and keywords).

The Quickstart should get you started. For a more detailed overview of the functionality, see the Vignette.

Installation
Quickstart
Testing
Acknowledgements

Installation

The module needs a working installation of the CWB and operates on CWB-indexed corpora. If you want to run queries with more than two anchor points, you will need CWB version 3.4.16 or later.

You can install this module with pip from PyPI:

python -m pip install cwb-ccc

You can also clone the source from github, cd in the respective folder, and build your own wheel:

python -m pip install pipenv
pipenv install --dev
pipenv run python3 setup.py bdist_wheel

Quickstart

Accessing Corpora

To list all available corpora, you can use

from ccc import Corpora

corpora = Corpora(registry_path="/usr/local/share/cwb/registry/")

All further methods rely on the Corpus class, which establishes the connection to your CWB-indexed corpus:

from ccc import Corpus

corpus = Corpus(
  corpus_name="GERMAPARL1386",
  registry_path="/usr/local/share/cwb/registry/"
)

This will raise a KeyError if the named corpus is not in the specified registry.

Queries and Dumps

The usual starting point for using this module is to run a query with corpus.query(), which accepts valid CQP queries such as

dump = corpus.query('[lemma="Arbeit"]', context_break='s')

The result is a Dump object; at its core is a pandas DataFrame with corpus positions (similar to CWB dumps).

Concordancing

You can access concordance lines via the concordance() method of the dump. This method returns a DataFrame with information about the query matches in context:

dump.concordance()

match	matchend	word
151	151	Er brachte diese Erfahrung in seine Arbeit im Ausschuß für Familie , Senioren , Frauen und Jugend sowie im Petitionsausschuß ein , wo er sich vor allem
227	227	Seine Arbeit und sein Rat werden uns fehlen .
1493	1493	Ausschuß für Arbeit und Sozialordnung
1555	1555	Ausschuß für Arbeit und Sozialordnung
1598	1598	Ausschuß für Arbeit und Sozialordnung
...	...	...

This retrieves concordance lines in simple format in the order in which they appear in the corpus. A better approach is

dump.concordance(form='kwic', order='random')

match	matchend	left_word	node_word	right_word
81769	81769	Ich unterstütze daher nachträglich die Forderung , daß die Durchführung des Gesetzes auch künftig durch die Bundesanstalt für	Arbeit	vorgenommen wird ; denn beim Bund gibt es die entsprechend ausgebildeten Sachbearbeiter .
8774	8774	Glauben Sie im Ernst , Sie könnten am Ende ein Bündnis für	Arbeit	, eine Wende in der deutschen Politik , die Bekämpfung der Arbeitslosigkeit erreichen , wenn Sie nicht die Länder ,
8994	8994	alle Entscheidungen gemeinsam zu treffen , die sich gegen Schwarzarbeit und illegale	Arbeit	wenden , und gemeinsam nach einem Weg zu suchen ,
80098	80098	: Was der Vermittlungsausschuß mit Mehrheit zum Meister-BAföG beschlossen hat , heißt , daß die bewährten Institutionen der Bundesanstalt für	Arbeit	, die die Ausbildungsförderung für Meister bis zum Jahr 1993 durchgeführt haben , die darin große Erfahrung haben , die
61056	61056	Selbst wenn Sie ein Konstrukt anbieten , das tendenziell die zusätzliche Belastung der Bundesanstalt für	Arbeit	etwas geringer hielte als die Entlastung bei der gesetzlichen Rentenversicherung , so wäre dies bei einem deutlichen Aufwuchs der Arbeitslosigkeit
...	...	...	...	...

Collocation Analyses

After executing a query, you can use dump.collocates() to extract collocates for a given window size (symmetric windows around the corpus matches). The result will be a DataFrame with lemmata as index and frequency signatures and association measures as columns:

dump.collocates()

item	O11	O12	O21	O22	R1	R2	C1	C2	N	E11	E12	E21	E22	z_score	t_score	log_likelihood	simple_ll	min_sensitivity	liddell	dice	log_ratio	conservative_log_ratio	mutual_information	local_mutual_information	ipm	ipm_reference	ipm_expected	in_nodes	marginal
für	46	730	831	148102	776	148933	877	148832	149709	4.54583	771.454	872.454	148061	19.4429	6.11208	134.301	130.019	0.052452	0.047547	0.055656	3.40925	2.26335	1.00514	46.2366	59278.4	5579.69	5858.03	0	877
,	43	733	7827	141106	776	148933	7870	141839	149709	40.7933	735.207	7829.21	141104	0.345505	0.336523	0.124564	0.117278	0.005464	0.000296	0.009947	0.076412	0	0.02288	0.983836	55412.4	52553.8	52568.6	0	7870
.	33	743	5626	143307	776	148933	5659	144050	149709	29.3328	746.667	5629.67	143303	0.677108	0.638378	0.461005	0.440481	0.005831	0.000673	0.010256	0.170891	0	0.05116	1.68829	42525.8	37775.4	37800	0	5659
und	32	744	2848	146085	776	148933	2880	146829	149709	14.9282	761.072	2865.07	146068	4.41852	3.0179	15.1452	14.6555	0.011111	0.006044	0.017505	1.10866	0	0.331144	10.5966	41237.1	19122.7	19237.3	0	2880
in	24	752	2474	146459	776	148933	2498	147211	149709	12.9481	763.052	2485.05	146448	3.07138	2.25596	7.72813	7.51722	0.009608	0.004499	0.014661	0.896724	0	0.268005	6.43212	30927.8	16611.5	16685.7	0	2498
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

The dataframe contains the counts and is annotated with all available association measures in the pandas-association-measures package (parameter ams).

Keyword Analyses

Having created a subcorpus (a dump)

dump = corpus.query(s_query='text_party', s_values={'CDU', 'CSU'})

you can use its keywords() method for retrieving keywords:

dump.keywords(order='conservative_log_ratio')

item	O11	O12	O21	O22	R1	R2	C1	C2	N	E11	E12	E21	E22	z_score	t_score	log_likelihood	simple_ll	min_sensitivity	liddell	dice	log_ratio	conservative_log_ratio	mutual_information	local_mutual_information	ipm	ipm_reference	ipm_expected
deswegen	55	41296	37	108412	41351	108449	92	149708	149800	25.3958	41325.6	66.6042	108382	5.87452	3.99183	41.5308	25.794	0.00133	0.321982	0.002654	1.96293	0.404166	0.335601	18.458	1330.08	341.174	614.152
CSU	255	41096	380	108069	41351	108449	635	149165	149800	175.286	41175.7	459.714	107989	6.02087	4.99187	46.6543	31.7425	0.006167	0.126068	0.012147	0.81552	0.212301	0.162792	41.512	6166.72	3503.95	4238.99
CDU	260	41091	390	108059	41351	108449	650	149150	149800	179.427	41171.6	470.573	107978	6.01515	4.99693	46.6055	31.7289	0.006288	0.124499	0.012381	0.80606	0.209511	0.161086	41.8823	6287.64	3596.16	4339.12
in	867	40484	1631	106818	41351	108449	2498	147302	149800	689.551	40661.4	1808.45	106641	6.75755	6.02647	61.2663	42.1849	0.020967	0.072241	0.039545	0.47937	0.168901	0.099452	86.2253	20966.8	15039.3	16675.6
Wirtschaft	39	41312	25	108424	41351	108449	64	149736	149800	17.6666	41333.3	46.3334	108403	5.07554	3.41607	30.9328	19.1002	0.000943	0.333476	0.001883	2.03257	0.150982	0.34391	13.4125	943.145	230.523	427.236
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Just as with collocates, the result is a DataFrame with lemmata as index and frequency signatures and association measures as columns.

Testing

The module ships with a small test corpus ("GERMAPARL1386"), which contains all speeches of the 86th session of the 13th German Bundestag on Feburary 8, 1996.

The corpus consists of 149,800 tokens in 7332 paragraphs (s-attribute "p" with annotation "type" ("regular" or "interjection")) split into 11,364 sentences (s-attribute "s"). The p-attributes are "pos" and "lemma":

corpus.attributes_available

type	attribute	annotation	active
p-Att	word	False	True
p-Att	pos	False	False
p-Att	lemma	False	False
s-Att	corpus	False	False
s-Att	corpus_name	True	False
s-Att	sitzung	False	False
s-Att	sitzung_date	True	False
s-Att	sitzung_period	True	False
s-Att	sitzung_session	True	False
s-Att	div	False	False
s-Att	div_desc	True	False
s-Att	div_n	True	False
s-Att	div_type	True	False
s-Att	div_what	True	False
s-Att	text	False	False
s-Att	text_id	True	False
s-Att	text_name	True	False
s-Att	text_parliamentary_group	True	False
s-Att	text_party	True	False
s-Att	text_position	True	False
s-Att	text_role	True	False
s-Att	text_who	True	False
s-Att	p	False	False
s-Att	p_type	True	False
s-Att	s	False	False

The corpus is located in this repository. All tests are written using this corpus (as well as some reference counts and scores from the UCS toolkit and some additional frequency lists). Make sure you install all development dependencies (especially pytest):

python -m pip install pipenv
pipenv install --dev

You can then simply

make build
make test
make coverage

Acknowledgements

The module includes a slight adaptation of cwb-python, a Python port of Perl's CWB::CL; thanks to Yannick Versley for the implementation.
Special thanks to Markus Opolka for the original implementation of association-measures and for forcing me to write tests.
The test corpus was extracted from the GermaParl corpus (see the PolMine Project); many thanks to Andreas Blätte.
This work was supported by the Emerging Fields Initiative (EFI) of Friedrich-Alexander-Universität Erlangen-Nürnberg, project title Exploring the Fukushima Effect (2017-2020).
Further development of the package is funded by the Deutsche Forschungsgemeinschaft (DFG) within the project Reconstructing Arguments from Noisy Text, grant number 377333057 (2018-2023), as part of the Priority Program Robust Argumentation Machines (SPP-1999).

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.12.1

Aug 16, 2023

0.12.0

Aug 15, 2023

0.11.8

Feb 22, 2023

0.11.7

Feb 7, 2023

0.11.6

Feb 7, 2023

0.11.5

Jan 31, 2023

0.11.4

Jan 30, 2023

0.11.3

Jan 25, 2023

0.11.2

Nov 25, 2022

0.11.1

Oct 18, 2022

This version

0.11.0

Oct 17, 2022

0.10.3

Aug 30, 2022

0.10.2

Mar 6, 2022

0.10.1

Dec 1, 2021

0.10.0

Nov 21, 2021

0.9.15

Apr 14, 2021

0.9.14

Apr 6, 2021

0.9.13

Feb 17, 2021

0.9.12

Dec 3, 2020

0.9.11

Aug 29, 2020

0.9.10

Aug 4, 2020

0.9.9

Jul 27, 2020

0.9.8

Jul 19, 2020

0.9.7

May 8, 2020

0.9.6

Feb 9, 2020

0.9.5

Feb 3, 2020

0.9.4

Jan 28, 2020

0.9.3

Jan 27, 2020

0.9.2

Jan 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cwb-ccc-0.11.0.tar.gz (130.7 kB view hashes)

Uploaded Oct 17, 2022 Source

Hashes for cwb-ccc-0.11.0.tar.gz

Hashes for cwb-ccc-0.11.0.tar.gz
Algorithm	Hash digest
SHA256	`f86d0cbb0460255ee7a16f9b0c670b74cd16cb90078b3fbaf40d6c2db69f66b7`
MD5	`2b46dfd990ea00337daf3601575c1e6d`
BLAKE2b-256	`6e07c966c96125d5faad9d0b9287cf38cb912baf758f741ed95e66c9a38c750a`