Pipeline for the ACQDIV database
Project description
ACQDIV
This repository contains the code and configuration files for transforming the child language acquisition corpora into the ACQDIV database.
Publication
If you use the database in your reasearch, please cite as follows:
Jancso, Anna, Steven Moran, and Sabine Stoll.
"The ACQDIV Corpus Database and Aggregation Pipeline."
Proceedings of The 12th Language Resources and Evaluation Conference. 2020.
Resources
Download the ACQDIV database (only public corpora):
To request access to the full database including the private corpora (for research purposes only!), please refer to Sabine Stoll. In case of technical questions, please open an issue on this repository.
Corpora
Our full database consists of the following corpora:
Corpus | ISO | Public | # Words |
---|---|---|---|
Chintang Language Corpus | ctn | no | 987'673 |
Cree Child Language Acquisition Study (CCLAS) Corpus | cre | yes | 44'751 |
English Manchester Corpus | eng | yes | 2'016'043 |
MPI-EVA Jakarta Child Language Database | ind | yes | 2'489'329 |
Allen Inuktitut Child Language Corpus | ike | no | 71'191 |
MiiPro Japanese Corpus | jpn | yes | 1'011'670 |
Miyata Japanese Corpus | jpn | yes | 373'021 |
Ku Waru Child Language Socialization Study | mux | yes | 65'723 |
Sarvasy Nungon Corpus | yuw | yes | 19'659 |
Qaqet Child Language Documentation | byx | no | 56'239 |
Stoll Russian Corpus | rus | no | 2'029'704 |
Demuth Sesotho Corpus | sot | yes | 177'963 |
Tuatschin Corpus | roh | no | 118'310 |
Koç University Longitudinal Language Development Database | tur | no | 1'120'077 |
Pfeiler Yucatec Child Language Corpus | yua | no | 262'382 |
Total | 10'843'735 |
Running the pipeline
For Windows users, follow the installation/run instructions here: https://github.com/acqdiv/acqdiv/wiki/Installation-Run-instructions-for-Windows
For Mac and Linux user, continue here to run the pipeline yourself:
Install the package
Create a virtual environment [optional]:
python3 -m venv venv
source venv/bin/activate
You can install the package from PyPI or directly from source:
PyPI
pip install acqdiv
From source
# Clone Repository
git clone git@github.com:acqdiv/acqdiv.git
cd acqdiv
# Install package (for users!)
pip install .
# Developer mode (for developers!)
pip install -r requirements.txt
Get the corpora
Run the following script to download the public corpora:
python util/download_public_corpora.py
The corpora are in the folder corpora
.
For the private corpora, either place the session files in corpora/<corpus_name>/{cha|toolbox}/
and the metadata files (only Toolbox corpora) in corpora/<corpus_name>/imdi/
or
edit the paths to those files in the config.ini
(also see below).
Generate the database
Get the configuration file src/acqdiv/config.ini
and specify the absolute
paths (without trailing slashes) for the corpora directory (corpora_dir
) and
the directory where the database should be written to (db_dir
):
[.global]
# directory containing corpora
corpora_dir = /absolute/path/to/corpora/dir
# directory where the database is written to
db_dir = /absolute/path/to/database/dir
...
Optionally adapt the paths for the individual corpora (sessions
and metadata_dir
).
Run the pipeline specifying the absolute path to the configuration file:
acqdiv load -c /absolute/path/to/config.ini
Generate the R object
Install dependencies
$ R
> install.packages("RSQLite")
> install.packages("rlang")
Navigate to src/acqdiv/database
and run:
Rscript sqlite_to_r.R /absolute/path/to/sqlite-DB
Run tests
Run the unittests:
pytest tests/unittests
Run the integrity tests on the database:
pytest tests/systemtests
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file acqdiv-1.1.0.tar.gz
.
File metadata
- Download URL: acqdiv-1.1.0.tar.gz
- Upload date:
- Size: 148.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ca05d0058cc04e9fbae16b9df244c851dbf87b483d97d2e500874c8851d643a |
|
MD5 | ddd74c4f27ae54fd41705e53854d7b32 |
|
BLAKE2b-256 | e62e50039684b6521d5a8aab314e7687feedd218ecfcb370a6ef0003185e83f2 |