tools for loading corpora
Project description
Corpus Interface
Basic functionality to maintain and load corpora.
Installation
pip install corpusinterface
Managing Corpora
Adding your own corpus
Say, you packaged a number of files into a corpus
your-corpus
|- file_1.txt
|- file_2.txt
|- dir_1
|- file_3.txt
|- file_4.txt
and let's assume you made it available as a zip archive at http://your-website.com/your-corpus.zip
. Using the corpus interface these file can be accessed as follows:
from corpusinterface import config, load
# initialise config
config.init_config()
# add your corpus
config.add_corpus("Your Corpus",
access="zip",
url="http://your-website.com/your-corpus.zip")
# load the corpus
corpus = load("Your Corpus", download=True)
# access the data (using a file_reader of your choice)
for file in corpus.data(file_reader=lambda file, **kwargs: f"reading: {file}"):
print(file)
This will print
reading: ~/corpora/Your Corpus/file_1.txt
reading: ~/corpora/Your Corpus/file_2.txt
reading: ~/corpora/Your Corpus/dir_1/file_3.txt
reading: ~/corpora/Your Corpus/dir_1/file_4.txt
with ~
being replaced with your home directory (paths might be displayed differently, depending on your operating system).
Config
Instead of specifying the necessary information from within Python, you can also put it in a config file:
[Your Corpus]
access: zip
url: http://your-website.com/your-corpus.zip
If you put this file at the default location ~/corpora/corpora.ini
in your home directory or a file corpora.ini
in the current working directory, it is automatically loaded when calling config.init_config()
. Otherwise, you can load any config file by either providing it to init_config
config.init_config("your-config-file.ini")
or loading it manually later
config.load_config("your-config-file.ini")
Default config
A default config file is shipped with the corpusinterface
package and automatically loaded by init_config
. It defines some useful defaults that are used for newly added corpora if no corpus-specific values are specified. You can see all the config information associated to your corpus by printing a summary:
print(config.summary(corpus="Your Corpus"))
[Your Corpus]
access: zip
url: http://your-website.com/your-corpus.zip
info: None
root: ~/corpora
path: ~/corpora/Your Corpus
parent: None
loader: FileCorpus
In particular, the default root
directory ~/corpora
was added and the corpus is stored in a path
that is a subdirectory ~/corpora/Your Corpus
according to its name (more on root
and path
below). Moreover, by default we assume to have a FileCorpus
consisting of a simple collection of files.
Special parameters
The parameters root
, path
, parent
, download
, loader
, access
, and url
are special and their values are treated in a particular way.
root
Root directory to store the corpus in. This should be an absolute path, ~
is expanded to the user home. If a relative path is specified, a warning is issued and it is interpreted relative to the current working directory. If parent
is non-empty, the value of root
is ignored and instead the parent's path
is used. A call to config.get(Name, 'root')
returns the effective value.
path
Directory to store the corpus in. This can be
- an absolute path (
~
is expanded to the user home), in which caseroot
is ignored - a relative path, in which case it is appended to
root
or - be empty, in which case the corpus
[Name]
is appended toroot
.
A call to config.get(Name, 'path')
returns the effective value. Note that for sub-corpora (with non-empty parent
) the parent's path
is used instead of root
.
parent
A parent corpus name or empty. If non-emtpy, the parent corpus should be defined separately and the value of root
is ignored and replaced by the parent's path
.
Initialisation (e.g. downloading from url
with access
method) is delegated to the parent corpus when loading a sub-corpus.
download
loader
access
url
Additional parameters
You can specify additional parameters that are handed over to the loader and (in case of the FileCorpus
loader) further passed on the your file_reader
function. For instance, you could specify
prefix: my prefix
in the config file or equivalently
config.add_corpus("Your Corpus",
...,
prefix="my prefix")
from within Python. Your file reader can then make use of this parameter (provided as a keyword argument, so you have to refer to it by the correct name)
file_reader=lambda file, prefix, **kwargs: f"{prefix}: {file}"
my prefix: ~/corpora/Your Corpus/file_1.txt
...
This is also the reason why we always need **kwargs
in a reader function to accept all keyword arguments that are provided, even if we decide to not use them.
The config values can be dynamically overwritten in the load
function
corpus = load("Your Corpus",
...,
prefix="other prefix")
other prefix: ~/corpora/Your Corpus/file_1.txt
...
or in the data
function:
for file in corpus.data(..., prefix="still different"):
...
still different: ~/corpora/Your Corpus/file_1.txt
...
Controlling initialisation
You have full control over how the config is initialised. A call to config.init_config()
without any arguments will load the default config, look for corpora.ini
in ~/corpora
and the current working directory and load them, too, if present. This is equivalent to calling
config.init_config(default=None, home=None, local=None)
For each of these parameters you may alternatively specify a value of True
(meaning that you expect the respective config file to be loaded and otherwise an error is raised), or False
(meaning the the respective config file is not loaded). Additionally, you may specify one or more files that should additionally be loaded
config.init_config("/path/to/file_1.ini", "/path/to/file_2.ini", ...)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for corpusinterface-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3ab56c260d2e04443adad60d591aa01bbeac838e9351a165e70c2c365df513a |
|
MD5 | bacc2d64a1f657324cd7f1329297e7a6 |
|
BLAKE2b-256 | 1974bb0e63b18d95aab3a696a371a846b3fcc1c7f400a063624a46f5499ba1c1 |