Pipeline for the ACQDIV database
Project description
ACQDIV
This repository contains the code and configuration files for transforming the child language acquisition corpora into the ACQDIV database.
Resources
Download the ACQDIV database (only open-access corpora):
To request access to the full database including the private corpora (for research purposes only!), please refer to Sabine Stoll. In case of technical questions, please open an issue on this repository.
Corpora
Our full database consists of the following corpora (open-access corpora are marked with *):
- Chintang Language Corpus (Chintang)
- Corpus of the Chisasibi Child Language Acquisition Study (Cree) *
- English Manchester Corpus (English) *
- MPI-EVA Jakarta Child Language Database (Indonesian) *
- Allen Inuktitut Child Language Corpus (Inuktitut)
- MiiPro Japanese Corpus (Japanese) *
- Miyata Japanese Corpus (Japanese) *
- Ku Waru Child Language Socialization Study (Ku Waru) *
- Sarvasy Nungon Corpus (Nungon) *
- Qaqet Child Language Documentation (Qaqet)
- Stoll Russian Corpus (Russian)
- Demuth Sesotho Corpus (Sesotho) *
- Tuatschin Corpus (Tuatschin)
- Koç University Longitudinal Language Development Database (Turkish)
- Pfeiler Yucatec Child Language Corpus (Yucatec)
Running the pipeline
To run the pipeline yourself:
Install the package
Create a virtual environment [optional]:
python3 -m venv venv
source venv/bin/activate
You can install the package from PyPI or directly from source:
PyPI
pip install acqdiv
From source
# Clone Repository
git clone git@github.com:uzling/acqdiv.git
cd acqdiv
# Install package (for users!)
pip install .
# Developer mode (for developers!)
pip install -r requirements.txt
Download the corpora
Create a directory corpora
.
For the CHAT corpora:
- Download the CHAT files on the CHILDES TalkBank website (where available)
(see
Download transcripts
link) - Unzip the data
- Copy the python script
src/acqdiv/util/cha_extractor.py
into the directory - Run the script:
python cha_extractor.py
. A directorycha/
will be created. - Place the
cha/
directory incorpora/<corpus_name>/
(also see the corresponding ini file insrc/acqdiv/ini/<corpus_name>
for which corpus name to use as a directory name).
For the toolbox corpora:
- Download the toolbox and metadata files (IMDI/CMDI).
- Place the toolbox files in
corpora/<corpus_name>/toolbox/
and the IMDI files incorpora/<corpus_name>/imdi/
.
Create the database
Get the configuration file src/acqdiv/config.ini
and specify the absolute
paths (without trailing slashes) for the corpora directory (corpora_dir
) and
the directory where the database should be written to (db_dir
):
[.global]
# directory containing corpora
corpora_dir = /absolute/path/to/corpora/dir
# directory where the database is written to
db_dir = /absolute/path/to/database/dir
...
Run the pipeline specifying the absolute path to the configuration file:
acqdiv load -c /absolute/path/to/config.ini
Run the unittests:
$ pytest tests/unittests
Run the integrity tests on the database:
$ pytest tests/systemtests
For more options:
acqdiv load -h
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.