A Python library for efficient corpus analysis, enabling corpus linguistic analysis in Jupyter notebooks.
Project description
Conc
Introduction to Conc
Conc is a Python library that brings corpus linguistic analysis to Jupyter notebooks. A staple of data science, Jupyter notebooks are a great model for presenting analysis that combines code, reporting and discussion in a way that can be reproduced. Conc aims to allow researchers to analyse large corpora in efficient ways using standard hardware, with the ability to produce clear, publication-ready reports and extend analysis where required using standard Python libraries.
Conc uses spaCy for tokenising texts. More spaCy functionality will be supported in future releases.
Conc Principles
- use standard Python libraries for data analysis (i.e. Numpy, Scipy, Jupyterlab)
- use vector operations where possible
- use fast code libraries over slow code libraries (i.e. Conc uses Polars vs Pandas - you can still output Pandas dataframes if you want to use them)
- provide important information when reporting results
- pre-compute time-intensive and repeatedly used views of the data
- work with smaller slices of the data where possible
- cache specific anaysis during a session to reduce computation for repeated calls
- document corpus representations so that they can be worked with directly
- provide a way to work with access Conc results for further processing with standard Python libraries
Development Status
Conc is in active development. It is currently
released for beta testing. The Github
site may be ahead of the Pypi version, so for latest functionality
install from Github (see below). The Github code is pre-release and may
change. For the latest release, install from Pypi (pip install conc).
The documentation reflects the most recent
functionality. See the
CHANGELOG for
notes on releases and the Roadmap below for upcoming features.
Acknowledgements
Conc is developed by Dr Geoff Ford.
Conc originated in my PhD research, which included development of a web-based corpus browser to handle analysis of large corpora. I’ve been developing Conc through my subsequent research.
Work to create this Python library has been made possible by funding/support from:
- “Mapping LAWS: Issue Mapping and Analyzing the Lethal Autonomous Weapons Debate” (Royal Society of New Zealand’s Marsden Fund Grant 19-UOC-068)
- “Into the Deep: Analysing the Actors and Controversies Driving the Adoption of the World’s First Deep Sea Mining Governance” (Royal Society of New Zealand’s Marsden Fund Grant 22-UOC-059)
- Sabbatical, University of Canterbury, Semester 1 2025.
Thanks to the Mapping LAWS project team for their support and feedback as first users of ConText (a web-based application built on an earlier version of Conc).
Dr Ford is a researcher with Te Pokapū Aronui ā-Matihiko | UC Arts Digital Lab (ADL). Thanks to the ADL team and the ongoing support of the University of Canterbury’s Faculty of Arts who make work like this possible.
Installation
Install via pip
You can install Conc from pypi using this command:
$ pip install conc
To install the latest development version of Conc, which may be ahead of the version on Pypi, you can install from the repository:
$ pip install git+https://github.com/polsci/conc.git
Install a language model
The first releases of Conc require a SpaCy language model for tokenization. After installing Conc, install a model. Here’s an example of how to install SpaCy’s small English model, which is Conc’s default language model:
python -m spacy download en_core_web_sm
If you are working with a different language or want to use a different ‘en’ model, check the SpaCy models documentation for the relevant model name.
Install optional dependencies
Conc has some optional dependencies you can install to download source texts to create sample corpora. These are primarily intended for creating corpora for development. To minimize Conc’s requirements these are not installed by default. If you want to get sample corpora to test out Conc’s functionality you can install these with the following command.
$ pip install nltk requests datasets
Pre-2013 CPU? Install Polars with support for older machines
Polars is optimized for modern CPUs with support for AVX2 instructions. If you get kernel crashes running Conc on an older machine (probably pre-2013), this is likely to be an issue with Polars. Polars has an alternate installation option to support older machines, which installs a Polars build compiled without AVX2 support. Replace the standard Polars package with the legacy-support package to use Conc on older machines.
$ pip uninstall polars
$ pip install polars-lts-cpu
Using Conc
A good place to start is TODO, which demonstrates how to build a corpus and output Conc reports.
The documentation site provides a reference for Conc functionality and examples of how to create reports for analysis. The current Conc components are listed below.
| Class / Function | Module | Functionality | Note |
|---|---|---|---|
Corpus |
conc.corpus | Build and load and get information on a corpus, methods to work with a corpus | Required |
Conc |
conc.conc | Inferface to Conc reports for corpus analysis | Recommended way to access reports for analysis, requires a corpus created by Corpus module |
Text |
conc.text | Output text from the corpus | Access via Corpus |
Frequency |
conc.frequency | Frequency reporting | Access via Conc |
Ngrams |
conc.ngrams | Reporting on ngram_frequencies across corpus and ngrams containing specific tokens |
Access via Conc |
Concordance |
conc.concordance | Concordancing | Access via Conc |
Keyness |
conc.keyness | Reporting for keyness analysis | Access via Conc |
Collocates |
conc.collocates | Reporting for collocation analysis | Access via Conc |
Result |
conc.result | Handles report results, output result as table or get dataframe | Used by all reports |
ConcLogger |
conc.core | Logger | Logging implemented in all modules |
CorpusMetadata |
conc.core | Class to validate Corpus Metadata JSON | Used by Corpus class |
The conc.core module implements a number of helpful functions …
| Function | Functionality |
|---|---|
list_corpora |
Scan a directory for corpora and return a summary |
get_stop_words |
Get a spaCy stop word list list for a specific model |
Various - see Get data sources |
Functions to download source texts to create sample corpora. Primarily intended for development/testing. To minimize requirements not all libraries are installed by default. Functions will raise errors with information on installing required libraries. |
Roadmap
Short-term
- add tutorial / getting started notebook
- add citation information
- extend caching support to all intensive reports, revise storage of cached results for in-memory/disk option
- relegate some logger warnings to debug level and audit logger messages for consistency and clarity for users
- add support for build from datasets library
- anatomy - explain token2doc_index -1 and has_spaces on tokens display and various other fields for vocab.
- Corpus tokenize support for functionality from earlier versions of Conc for wildcards, multiple strings, case insensitive tokenization
- ngrams method - implement case handling
- get_ngrams_by_index - implement case handling
- improve concordance ordering so not fixed options e.g. include 3R1R2R
- improve ngram support for ngram token position beyond LEFT/RIGHT (i.e. define positions relative to ngram, or ANY)
- concordancing - add in ordering by metadata columns or doc
- annotations support for spaCy POS, TAG, SENT_START, LEMMA
- move tokens sort order to build process - takes > 1 second for large corpora, but not needed for all results
- shift more processing from in-memory to polars with support for streaming or in-memory processing
- revisit polars streaming - potentially implement a batched write for very large files i.e. splitting vocab/tokens files into smaller chunks to reduce memory usage.
Medium-term
- Support for processing backends other than spaCy (i.e. other tokenizers)
Developer Guide
The instructions below are only relevant if you want to contribute to Conc. The nbdev library is being used for development. If you are new to using nbdevc, here are some useful pointers to get you started (or visit the nbdev website).
Install conc in Development mode
# make sure conc package is installed in development mode
$ pip install -e .
# make changes under nbs/ directory
# ...
# compile to have changes apply to conc
$ nbdev_prepare
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file conc-0.1.0.tar.gz.
File metadata
- Download URL: conc-0.1.0.tar.gz
- Upload date:
- Size: 47.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4aa98a8ff3c9386d0ed34ea669ac12e3a7711448583cdd18ca674712725a3ed1
|
|
| MD5 |
c302cc1e06fc7310dbf6008bf4c4bdeb
|
|
| BLAKE2b-256 |
892c40fb6c5c39dc96e07496a36fef38afccdc47d1d1cb78df2ec8d27b7aa0e8
|
File details
Details for the file conc-0.1.0-py3-none-any.whl.
File metadata
- Download URL: conc-0.1.0-py3-none-any.whl
- Upload date:
- Size: 48.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef47fcb2956d38c275b0d0cb537b3bff3794bcc01e07e20ac91aa9ff6430426e
|
|
| MD5 |
ed5f5a2f5e33211f5a2e9cd8f4428eee
|
|
| BLAKE2b-256 |
dbfd701416e5e393080b20df77db3dbfbc95e45dd25018fce62e8f1b09ddd0c4
|