Skip to main content

Pipeline for the ACQDIV database

Project description

ACQDIV

CircleCI

This repository contains the code and configuration files for transforming the child language acquisition corpora into the ACQDIV database.

Resources

Download the ACQDIV database (only open-access corpora):

To request access to the full database including the private corpora (for research purposes only!), please refer to Sabine Stoll. In case of technical questions, please open an issue on this repository.


Corpora

Our full database consists of the following corpora (open-access corpora are marked with *):


Running the pipeline

To run the pipeline yourself:

Install the package

Create a virtual environment [optional]:

python3 -m venv venv
source venv/bin/activate

You can install the package from PyPI or directly from source:

PyPI

pip install acqdiv

From source

# Clone Repository
git clone git@github.com:uzling/acqdiv.git
cd acqdiv

# Install package (for users!)
pip install .

# Developer mode (for developers!)
pip install -r requirements.txt

Download the corpora

Create a directory corpora.

For the CHAT corpora:

  • Download the CHAT files on the CHILDES TalkBank website (where available) (see Download transcripts link)
  • Unzip the data
  • Copy the python script src/acqdiv/util/cha_extractor.py into the directory
  • Run the script: python cha_extractor.py. A directory cha/ will be created.
  • Place the cha/ directory in corpora/<corpus_name>/ (also see the corresponding ini file in src/acqdiv/ini/<corpus_name> for which corpus name to use as a directory name).

For the toolbox corpora:

  • Download the toolbox and metadata files (IMDI/CMDI).
  • Place the toolbox files in corpora/<corpus_name>/toolbox/ and the IMDI files in corpora/<corpus_name>/imdi/.

Create the database

Get the configuration file src/acqdiv/config.ini and specify the paths for the corpora directory (corpora_dir) and the directory where the database should be written to (db_dir):

[.global]
# directory containing corpora
corpora_dir = corpora
# directory where the database is written to
db_dir = database
...

Run the pipeline specifying the path to the configuration file:
acqdiv load -c path/to/config.ini

Run the unittests:
$ pytest tests/unittests

Run the integrity tests on the database:
$ pytest tests/systemtests

For more options:
acqdiv load -h

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acqdiv-0.2.0.tar.gz (153.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page