Skip to main content

A Python library for efficient corpus analysis, enabling corpus linguistic analysis in Jupyter notebooks.

Project description

Conc

Introduction to Conc

Conc is a Python library that brings corpus linguistic analysis to Jupyter notebooks. A staple of data science, Jupyter notebooks are a great model for presenting analysis through an interactive form that combines code, reporting and discussion that allows other researchers to reproduce and interact with your analysis. Conc aims to allow researchers to analyse large corpora in efficient ways using standard hardware, with the ability to produce clear, publication-ready reports and extend analysis where required using standard Python libraries.

Conc uses spaCy for tokenising texts. SpaCy functionality to annotate texts will be supported soon.

Conc Principles

  • use standard Python libraries for data processing, analysis and visualisation (i.e. Numpy, Scipy, Polars, Plotly)
  • use vector operations where possible
  • use fast code libraries and fast data structures (i.e. Conc uses Polars vs Pandas - you can still output Pandas dataframes if you want to use them)
  • provide clear and complete information when reporting results
  • pre-compute time-intensive and repeatedly used views of the data
  • work with smaller slices of the data where possible
  • cache specific anaysis during a session to reduce computation for repeated calls
  • document corpus representations so that they can be worked with directly
  • TODO - document this in walkthrough and results module and link here - allow researchers to work with Conc results and extend analysis using other Python libraries

Table of Contents

Acknowledgements

Conc is developed by Dr Geoff Ford.

Conc originated in my PhD research, which included development of a web-based corpus browser to handle analysis of large corpora. I’ve been developing Conc through my subsequent research.

Work to create this Python library has been made possible by funding/support from:

  • “Mapping LAWS: Issue Mapping and Analyzing the Lethal Autonomous Weapons Debate” (Royal Society of New Zealand’s Marsden Fund Grant 19-UOC-068)
  • “Into the Deep: Analysing the Actors and Controversies Driving the Adoption of the World’s First Deep Sea Mining Governance” (Royal Society of New Zealand’s Marsden Fund Grant 22-UOC-059)
  • Sabbatical, University of Canterbury, Semester 1 2025.

Thanks to the Mapping LAWS project team for their support and feedback as first users of ConText (a web-based application built on an earlier version of Conc).

Dr Ford is a researcher with Te Pokapū Aronui ā-Matihiko | UC Arts Digital Lab (ADL). Thanks to the ADL team and the ongoing support of the University of Canterbury’s Faculty of Arts who make work like this possible.

Thanks to Dr Chris Thomson and Karin Stahel for their feedback on early versions of Conc.

Development Status

Conc is in active development. It is currently released for beta testing. The Github site may be ahead of the Pypi version, so for latest functionality install from Github (see below). The Github code is pre-release and may change. For the latest release, install from Pypi (pip install conc). The documentation reflects the most recent functionality. See the CHANGELOG for notes on releases and the Roadmap below for upcoming features.

Installation

Install via pip

You can install Conc from pypi using this command:

$ pip install conc

To install the latest development version of Conc, which may be ahead of the version on Pypi, you can install from the repository:

$ pip install git+https://github.com/polsci/conc.git

Install a spaCy model for tokenization

The first releases of Conc require a SpaCy language model for tokenization. After installing Conc, install a model. Here’s an example of how to install SpaCy’s small English model, which is Conc’s default language model:

python -m spacy download en_core_web_sm

If you are working with a different language or want to use a different ‘en’ model, check the SpaCy models documentation for the relevant model name.

Install optional dependencies

Conc has some optional dependencies you can install to download source texts to create sample corpora. These are primarily intended for creating corpora for development. To minimize Conc’s requirements these are not installed by default. If you want to get sample corpora to test out Conc’s functionality you can install these with the following command.

$ pip install nltk requests datasets

Pre-2013 CPU? Install Polars with support for older machines

Polars is optimized for modern CPUs with support for AVX2 instructions. If you get kernel crashes running Conc on an older machine (probably pre-2013), this is likely to be an issue with Polars. Polars has an alternate installation option to support older machines, which installs a Polars build compiled without AVX2 support. Replace the standard Polars package with the legacy-support package to use Conc on older machines.

$ pip uninstall polars
$ pip install polars-lts-cpu

Using Conc

A good place to start is TODO, which demonstrates how to build a corpus and output Conc reports.

The documentation site provides a reference for Conc functionality and examples of how to create reports for analysis. The current Conc components are listed below.

Class / Function Module Functionality Note
Corpus conc.corpus Build and load and get information on a corpus, methods to work with a corpus Required
Conc conc.conc Inferface to Conc reports for corpus analysis Recommended way to access reports for analysis, requires a corpus created by Corpus module
Text conc.text Output text from the corpus Access via Corpus
Frequency conc.frequency Frequency reporting Access via Conc
Ngrams conc.ngrams Reporting on ngram_frequencies across corpus and ngrams containing specific tokens Access via Conc
Concordance conc.concordance Concordancing Access via Conc
Keyness conc.keyness Reporting for keyness analysis Access via Conc
Collocates conc.collocates Reporting for collocation analysis Access via Conc
Result conc.result Handles report results, output result as table or get dataframe Used by all reports
ConcLogger conc.core Logger Logging implemented in all modules
CorpusMetadata conc.core Class to validate Corpus Metadata JSON Used by Corpus class

The conc.core module implements a number of helpful functions …

Function Functionality
list_corpora Scan a directory for corpora and return a summary
get_stop_words Get a spaCy stop word list list for a specific model
Various - see Get data sources Functions to download source texts to create sample corpora. Primarily intended for development/testing. To minimize requirements not all libraries are installed by default. Functions will raise errors with information on installing required libraries.

Roadmap

Short-term

  • add tutorial / getting started notebook
  • add citation information
  • Corpus tokenize support for functionality from earlier versions of Conc for wildcards, multiple strings, case insensitive tokenization
  • extend caching support to all intensive reports, revise storage of cached results for in-memory/disk option
  • relegate some logger warnings to debug level and audit logger messages for consistency and clarity for users
  • add support for build from datasets library
  • ngrams method - implement case handling
  • get_ngrams_by_index - implement case handling
  • improve concordance ordering so not fixed options e.g. include 3R1R2R
  • improve ngram support for ngram token position beyond LEFT/RIGHT (i.e. define positions relative to ngram, or ANY)
  • concordancing - add in ordering by metadata columns or doc
  • annotations support for spaCy POS, TAG, SENT_START, LEMMA
  • shift more processing from in-memory to polars with support for streaming or in-memory processing
  • revisit polars streaming - potentially implement a batched write for very large files i.e. splitting vocab/tokens files into smaller chunks to reduce memory usage.

Medium-term

  • Support for processing backends other than spaCy (i.e. other tokenizers)

Developer Guide

The instructions below are only relevant if you want to contribute to Conc. The nbdev library is being used for development. If you are new to using nbdevc, here are some useful pointers to get you started (or visit the nbdev website).

Install conc in Development mode

# make sure conc package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to conc
$ nbdev_prepare

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conc-0.1.2.tar.gz (50.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

conc-0.1.2-py3-none-any.whl (52.7 kB view details)

Uploaded Python 3

File details

Details for the file conc-0.1.2.tar.gz.

File metadata

  • Download URL: conc-0.1.2.tar.gz
  • Upload date:
  • Size: 50.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for conc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5d5d94b50d05f21502abee37153b7b2ad32f6971f0d6d7737f830746bcd4d9f8
MD5 03f46fb04c1125e347f4c16d4016172f
BLAKE2b-256 21d28a48779a5928f04dbb3d2621552d870b3f3ceb5885d5f75967e3ad7a19b9

See more details on using hashes here.

File details

Details for the file conc-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: conc-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 52.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for conc-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ce6f4f6ee95f9b21fc56f782ec35b410f8cc783ec8334039dcc73e7412d4d049
MD5 921b0646205fc1caafec4053ac5fc836
BLAKE2b-256 900a4cdc40fb610183033cf3f0f559ccfda19589b8d71d4d7ebb9923687b2881

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page