Skip to main content

tools for loading corpora

Project description

Corpus Interface

build PyPI version

tests codecov

License: GPL v3

Basic functionality to maintain and load corpora.

Installation

pip install corpusinterface

Managing Corpora

Adding your own corpus

Say, you packaged a number of files into a corpus

your-corpus
  |- file_1.txt
  |- file_2.txt
  |- dir_1
    |- file_3.txt
    |- file_4.txt

and let's assume you made it available as a zip archive at http://your-website.com/your-corpus.zip. Using the corpus interface these file can be accessed as follows:

from corpusinterface import config, load

# initialise config
config.init_config()

# add your corpus
config.add_corpus("Your Corpus",
                  access="zip",
                  url="http://your-website.com/your-corpus.zip")

# load the corpus
corpus = load("Your Corpus", download=True)

# access the data (using a file_reader of your choice)
for file in corpus.data(file_reader=lambda file, **kwargs: f"reading: {file}"):
    print(file)

This will print

reading: ~/corpora/Your Corpus/file_1.txt
reading: ~/corpora/Your Corpus/file_2.txt
reading: ~/corpora/Your Corpus/dir_1/file_3.txt
reading: ~/corpora/Your Corpus/dir_1/file_4.txt

with ~ being replaced with your home directory (paths might be displayed differently, depending on your operating system).

Config

Instead of specifying the necessary information from within Python, you can also put it in a config file:

[Your Corpus]
access: zip
url: http://your-website.com/your-corpus.zip

If you put this file at the default location ~/corpora/corpora.ini in your home directory or a file corpora.ini in the current working directory, it is automatically loaded when calling config.init_config(). Otherwise, you can load any config file by either providing it to init_config

config.init_config("your-config-file.ini")

or loading it manually later

config.load_config("your-config-file.ini")

Default config

A default config file is shipped with the corpusinterface package and automatically loaded by init_config. It defines some useful defaults that are used for newly added corpora if no corpus-specific values are specified. You can see all the config information associated to your corpus by printing a summary:

print(config.summary(corpus="Your Corpus"))
[Your Corpus]
    access: zip
    url: http://your-website.com/your-corpus.zip
    info: None
    root: ~/corpora
    path: ~/corpora/Your Corpus
    parent: None
    loader: FileCorpus

In particular, the default root directory ~/corpora was added and the corpus is stored in a path that is a subdirectory ~/corpora/Your Corpus according to its name (more on root and path below). Moreover, by default we assume to have a FileCorpus consisting of a simple collection of files.

Special parameters

The parameters root, path, parent, download, loader, access, and url are special and their values are treated in a particular way.

root

Root directory to store the corpus in. This should be an absolute path, ~ is expanded to the user home. If a relative path is specified, a warning is issued and it is interpreted relative to the current working directory. If parent is non-empty, the value of root is ignored and instead the parent's path is used. A call to config.get(Name, 'root') returns the effective value.

path

Directory to store the corpus in. This can be

  1. an absolute path (~ is expanded to the user home), in which case root is ignored
  2. a relative path, in which case it is appended to root or
  3. be empty, in which case the corpus [Name] is appended to root.

A call to config.get(Name, 'path') returns the effective value. Note that for sub-corpora (with non-empty parent) the parent's path is used instead of root.

parent

A parent corpus name or empty. If non-emtpy, the parent corpus should be defined separately and the value of root is ignored and replaced by the parent's path.

Initialisation (e.g. downloading from url with access method) is delegated to the parent corpus when loading a sub-corpus.

download
loader
access
url

Additional parameters

You can specify additional parameters that are handed over to the loader and (in case of the FileCorpus loader) further passed on the your file_reader function. For instance, you could specify

prefix: my prefix

in the config file or equivalently

config.add_corpus("Your Corpus",
                  ...,
                  prefix="my prefix")

from within Python. Your file reader can then make use of this parameter (provided as a keyword argument, so you have to refer to it by the correct name)

file_reader=lambda file, prefix, **kwargs: f"{prefix}: {file}"
my prefix: ~/corpora/Your Corpus/file_1.txt
...

This is also the reason why we always need **kwargs in a reader function to accept all keyword arguments that are provided, even if we decide to not use them.

The config values can be dynamically overwritten in the load function

corpus = load("Your Corpus",
              ...,
              prefix="other prefix")
other prefix: ~/corpora/Your Corpus/file_1.txt
...

or in the data function:

for file in corpus.data(..., prefix="still different"):
    ...
still different: ~/corpora/Your Corpus/file_1.txt
...

Controlling initialisation

You have full control over how the config is initialised. A call to config.init_config() without any arguments will load the default config, look for corpora.ini in ~/corpora and the current working directory and load them, too, if present. This is equivalent to calling

config.init_config(default=None, home=None, local=None)

For each of these parameters you may alternatively specify a value of True (meaning that you expect the respective config file to be loaded and otherwise an error is raised), or False (meaning the the respective config file is not loaded). Additionally, you may specify one or more files that should additionally be loaded

config.init_config("/path/to/file_1.ini", "/path/to/file_2.ini", ...)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpusinterface-0.1.0.tar.gz (20.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corpusinterface-0.1.0-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file corpusinterface-0.1.0.tar.gz.

File metadata

  • Download URL: corpusinterface-0.1.0.tar.gz
  • Upload date:
  • Size: 20.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.9.0

File hashes

Hashes for corpusinterface-0.1.0.tar.gz
Algorithm Hash digest
SHA256 93d33e4f43fdb2aa7c0cd09264ad157dee4cd877335e1ebda624cb13d9c41757
MD5 e98ed8ecc394c57187d9870080f1fbed
BLAKE2b-256 388d9848285ed36f41feefc777c486b65f7d0df8789c34eec8d87df0a80a12e7

See more details on using hashes here.

File details

Details for the file corpusinterface-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: corpusinterface-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.9.0

File hashes

Hashes for corpusinterface-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c3ab56c260d2e04443adad60d591aa01bbeac838e9351a165e70c2c365df513a
MD5 bacc2d64a1f657324cd7f1329297e7a6
BLAKE2b-256 1974bb0e63b18d95aab3a696a371a846b3fcc1c7f400a063624a46f5499ba1c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page