Skip to main content

A Python library for efficient corpus analysis, enabling corpus linguistic analysis in Jupyter notebooks.

Project description

Conc

Introduction to Conc

Conc is a Python library that brings corpus linguistic analysis to Jupyter notebooks. A staple of data science, Jupyter notebooks are a great model for presenting analysis that combines code, reporting and discussion in a way that can be reproduced. Conc aims to allow researchers to analyse large corpora in efficient ways using standard hardware, with the ability to produce clear, publication-ready reports and extend analysis where required using standard Python libraries.

Conc uses spaCy for tokenising texts. More spaCy functionality will be supported in future releases.

Conc Principles

  • use standard Python libraries for data analysis (i.e. Numpy, Scipy, Jupyterlab)
  • use vector operations where possible
  • use fast code libraries over slow code libraries (i.e. Conc uses Polars vs Pandas - you can still output Pandas dataframes if you want to use them)
  • provide important information when reporting results
  • pre-compute time-intensive and repeatedly used views of the data
  • work with smaller slices of the data where possible
  • cache specific anaysis during a session to reduce computation for repeated calls
  • document corpus representations so that they can be worked with directly
  • provide a way to work with access Conc results for further processing with standard Python libraries

Development Status

Conc is in active development. It is currently released for beta testing. The Github site may be ahead of the Pypi version, so for latest functionality install from Github (see below). The Github code is pre-release and may change. For the latest release, install from Pypi (pip install conc). The documentation reflects the most recent functionality. See the CHANGELOG for notes on releases and the Roadmap below for upcoming features.

Acknowledgements

Conc is developed by Dr Geoff Ford.

Conc originated in my PhD research, which included development of a web-based corpus browser to handle analysis of large corpora. I’ve been developing Conc through my subsequent research.

Work to create this Python library has been made possible by funding/support from:

  • “Mapping LAWS: Issue Mapping and Analyzing the Lethal Autonomous Weapons Debate” (Royal Society of New Zealand’s Marsden Fund Grant 19-UOC-068)
  • “Into the Deep: Analysing the Actors and Controversies Driving the Adoption of the World’s First Deep Sea Mining Governance” (Royal Society of New Zealand’s Marsden Fund Grant 22-UOC-059)
  • Sabbatical, University of Canterbury, Semester 1 2025.

Thanks to the Mapping LAWS project team for their support and feedback as first users of ConText (a web-based application built on an earlier version of Conc).

Dr Ford is a researcher with Te Pokapū Aronui ā-Matihiko | UC Arts Digital Lab (ADL). Thanks to the ADL team and the ongoing support of the University of Canterbury’s Faculty of Arts who make work like this possible.

Installation

Install via pip

You can install Conc from pypi using this command:

$ pip install conc

To install the latest development version of Conc, which may be ahead of the version on Pypi, you can install from the repository:

$ pip install git+https://github.com/polsci/conc.git

Install a language model

The first releases of Conc require a SpaCy language model for tokenization. After installing Conc, install a model. Here’s an example of how to install SpaCy’s small English model, which is Conc’s default language model:

python -m spacy download en_core_web_sm

If you are working with a different language or want to use a different ‘en’ model, check the SpaCy models documentation for the relevant model name.

Install optional dependencies

Conc has some optional dependencies you can install to download source texts to create sample corpora. These are primarily intended for creating corpora for development. To minimize Conc’s requirements these are not installed by default. If you want to get sample corpora to test out Conc’s functionality you can install these with the following command.

$ pip install nltk requests datasets

Pre-2013 CPU? Install Polars with support for older machines

Polars is optimized for modern CPUs with support for AVX2 instructions. If you get kernel crashes running Conc on an older machine (probably pre-2013), this is likely to be an issue with Polars. Polars has an alternate installation option to support older machines, which installs a Polars build compiled without AVX2 support. Replace the standard Polars package with the legacy-support package to use Conc on older machines.

$ pip uninstall polars
$ pip install polars-lts-cpu

Using Conc

A good place to start is TODO, which demonstrates how to build a corpus and output Conc reports.

The documentation site provides a reference for Conc functionality and examples of how to create reports for analysis. The current Conc components are listed below.

Class / Function Module Functionality Note
Corpus conc.corpus Build and load and get information on a corpus, methods to work with a corpus Required
Conc conc.conc Inferface to Conc reports for corpus analysis Recommended way to access reports for analysis, requires a corpus created by Corpus module
Text conc.text Output text from the corpus Access via Corpus
Frequency conc.frequency Frequency reporting Access via Conc
Ngrams conc.ngrams Reporting on ngram_frequencies across corpus and ngrams containing specific tokens Access via Conc
Concordance conc.concordance Concordancing Access via Conc
Keyness conc.keyness Reporting for keyness analysis Access via Conc
Collocates conc.collocates Reporting for collocation analysis Access via Conc
Result conc.result Handles report results, output result as table or get dataframe Used by all reports
ConcLogger conc.core Logger Logging implemented in all modules
CorpusMetadata conc.core Class to validate Corpus Metadata JSON Used by Corpus class

The conc.core module implements a number of helpful functions …

Function Functionality
list_corpora Scan a directory for corpora and return a summary
get_stop_words Get a spaCy stop word list list for a specific model
Various - see Get data sources Functions to download source texts to create sample corpora. Primarily intended for development/testing. To minimize requirements not all libraries are installed by default. Functions will raise errors with information on installing required libraries.

Roadmap

Short-term

  • add tutorial / getting started notebook
  • add citation information
  • extend caching support to all intensive reports, revise storage of cached results for in-memory/disk option
  • relegate some logger warnings to debug level and audit logger messages for consistency and clarity for users
  • add support for build from datasets library
  • anatomy - explain token2doc_index -1 and has_spaces on tokens display and various other fields for vocab.
  • Corpus tokenize support for functionality from earlier versions of Conc for wildcards, multiple strings, case insensitive tokenization
  • ngrams method - implement case handling
  • get_ngrams_by_index - implement case handling
  • improve concordance ordering so not fixed options e.g. include 3R1R2R
  • improve ngram support for ngram token position beyond LEFT/RIGHT (i.e. define positions relative to ngram, or ANY)
  • concordancing - add in ordering by metadata columns or doc
  • annotations support for spaCy POS, TAG, SENT_START, LEMMA
  • move tokens sort order to build process - takes > 1 second for large corpora, but not needed for all results
  • shift more processing from in-memory to polars with support for streaming or in-memory processing
  • revisit polars streaming - potentially implement a batched write for very large files i.e. splitting vocab/tokens files into smaller chunks to reduce memory usage.

Medium-term

  • Support for processing backends other than spaCy (i.e. other tokenizers)

Developer Guide

The instructions below are only relevant if you want to contribute to Conc. The nbdev library is being used for development. If you are new to using nbdevc, here are some useful pointers to get you started (or visit the nbdev website).

Install conc in Development mode

# make sure conc package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to conc
$ nbdev_prepare

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conc-0.1.0.tar.gz (47.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

conc-0.1.0-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file conc-0.1.0.tar.gz.

File metadata

  • Download URL: conc-0.1.0.tar.gz
  • Upload date:
  • Size: 47.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for conc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4aa98a8ff3c9386d0ed34ea669ac12e3a7711448583cdd18ca674712725a3ed1
MD5 c302cc1e06fc7310dbf6008bf4c4bdeb
BLAKE2b-256 892c40fb6c5c39dc96e07496a36fef38afccdc47d1d1cb78df2ec8d27b7aa0e8

See more details on using hashes here.

File details

Details for the file conc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: conc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for conc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ef47fcb2956d38c275b0d0cb537b3bff3794bcc01e07e20ac91aa9ff6430426e
MD5 ed5f5a2f5e33211f5a2e9cd8f4428eee
BLAKE2b-256 dbfd701416e5e393080b20df77db3dbfbc95e45dd25018fce62e8f1b09ddd0c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page