Skip to main content

tools for loading corpora

Project description

Corpus Interface

build PyPI version

tests codecov

License: GPL v3

Basic functionality to maintain and load corpora.


pip install corpusinterface

Managing Corpora

Adding your own corpus

Say, you packaged a number of files into a corpus

  |- file_1.txt
  |- file_2.txt
  |- dir_1
    |- file_3.txt
    |- file_4.txt

and let's assume you made it available as a zip archive at Using the corpus interface these file can be accessed as follows:

from corpusinterface import config, load

# initialise config

# add your corpus
config.add_corpus("Your Corpus",

# load the corpus
corpus = load("Your Corpus", download=True)

# access the data (using a file_reader of your choice)
for file in file, **kwargs: f"reading: {file}"):

This will print

reading: ~/corpora/Your Corpus/file_1.txt
reading: ~/corpora/Your Corpus/file_2.txt
reading: ~/corpora/Your Corpus/dir_1/file_3.txt
reading: ~/corpora/Your Corpus/dir_1/file_4.txt

with ~ being replaced with your home directory (paths might be displayed differently, depending on your operating system).


Instead of specifying the necessary information from within Python, you can also put it in a config file:

[Your Corpus]
access: zip

If you put this file at the default location ~/corpora/corpora.ini in your home directory or a file corpora.ini in the current working directory, it is automatically loaded when calling config.init_config(). Otherwise, you can load any config file by either providing it to init_config


or loading it manually later


Default config

A default config file is shipped with the corpusinterface package and automatically loaded by init_config. It defines some useful defaults that are used for newly added corpora if no corpus-specific values are specified. You can see all the config information associated to your corpus by printing a summary:

print(config.summary(corpus="Your Corpus"))
[Your Corpus]
    access: zip
    info: None
    root: ~/corpora
    path: ~/corpora/Your Corpus
    parent: None
    loader: FileCorpus

In particular, the default root directory ~/corpora was added and the corpus is stored in a path that is a subdirectory ~/corpora/Your Corpus according to its name (more on root and path below). Moreover, by default we assume to have a FileCorpus consisting of a simple collection of files.

Special parameters

The parameters root, path, parent, download, loader, access, and url are special and their values are treated in a particular way.


Root directory to store the corpus in. This should be an absolute path, ~ is expanded to the user home. If a relative path is specified, a warning is issued and it is interpreted relative to the current working directory. If parent is non-empty, the value of root is ignored and instead the parent's path is used. A call to config.get(Name, 'root') returns the effective value.


Directory to store the corpus in. This can be

  1. an absolute path (~ is expanded to the user home), in which case root is ignored
  2. a relative path, in which case it is appended to root or
  3. be empty, in which case the corpus [Name] is appended to root.

A call to config.get(Name, 'path') returns the effective value. Note that for sub-corpora (with non-empty parent) the parent's path is used instead of root.


A parent corpus name or empty. If non-emtpy, the parent corpus should be defined separately and the value of root is ignored and replaced by the parent's path.

Initialisation (e.g. downloading from url with access method) is delegated to the parent corpus when loading a sub-corpus.


Additional parameters

You can specify additional parameters that are handed over to the loader and (in case of the FileCorpus loader) further passed on the your file_reader function. For instance, you could specify

prefix: my prefix

in the config file or equivalently

config.add_corpus("Your Corpus",
                  prefix="my prefix")

from within Python. Your file reader can then make use of this parameter (provided as a keyword argument, so you have to refer to it by the correct name)

file_reader=lambda file, prefix, **kwargs: f"{prefix}: {file}"
my prefix: ~/corpora/Your Corpus/file_1.txt

This is also the reason why we always need **kwargs in a reader function to accept all keyword arguments that are provided, even if we decide to not use them.

The config values can be dynamically overwritten in the load function

corpus = load("Your Corpus",
              prefix="other prefix")
other prefix: ~/corpora/Your Corpus/file_1.txt

or in the data function:

for file in, prefix="still different"):
still different: ~/corpora/Your Corpus/file_1.txt

Controlling initialisation

You have full control over how the config is initialised. A call to config.init_config() without any arguments will load the default config, look for corpora.ini in ~/corpora and the current working directory and load them, too, if present. This is equivalent to calling

config.init_config(default=None, home=None, local=None)

For each of these parameters you may alternatively specify a value of True (meaning that you expect the respective config file to be loaded and otherwise an error is raised), or False (meaning the the respective config file is not loaded). Additionally, you may specify one or more files that should additionally be loaded

config.init_config("/path/to/file_1.ini", "/path/to/file_2.ini", ...)

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpusinterface-0.1.0.tar.gz (20.1 kB view hashes)

Uploaded source

Built Distribution

corpusinterface-0.1.0-py3-none-any.whl (21.2 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page