Pipeline for the ACQDIV database

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
Programming Language
Topic
- Text Processing :: Linguistic

Project description

ACQDIV

This repository contains the code and configuration files for transforming the child language acquisition corpora into the ACQDIV database.

Publication

If you use the database in your reasearch, please cite as follows:

Jancso, Anna, Steven Moran, and Sabine Stoll.
"The ACQDIV Corpus Database and Aggregation Pipeline."
Proceedings of The 12th Language Resources and Evaluation Conference. 2020.

Link to Paper

Resources

Download the ACQDIV database (only public corpora):

To request access to the full database including the private corpora (for research purposes only!), please refer to Sabine Stoll. In case of technical questions, please open an issue on this repository.

Corpora

Our full database consists of the following corpora:

Corpus	ISO	Public	# Words
Chintang Language Corpus	ctn	no	987'673
Cree Child Language Acquisition Study (CCLAS) Corpus	cre	yes	44'751
English Manchester Corpus	eng	yes	2'016'043
MPI-EVA Jakarta Child Language Database	ind	yes	2'489'329
Allen Inuktitut Child Language Corpus	ike	no	71'191
MiiPro Japanese Corpus	jpn	yes	1'011'670
Miyata Japanese Corpus	jpn	yes	373'021
Ku Waru Child Language Socialization Study	mux	yes	65'723
Sarvasy Nungon Corpus	yuw	yes	19'659
Qaqet Child Language Documentation	byx	no	56'239
Stoll Russian Corpus	rus	no	2'029'704
Demuth Sesotho Corpus	sot	yes	177'963
Tuatschin Corpus	roh	no	118'310
Koç University Longitudinal Language Development Database	tur	no	1'120'077
Pfeiler Yucatec Child Language Corpus	yua	no	262'382
Total			10'843'735

Running the pipeline

For Windows users, follow the installation/run instructions here: https://github.com/acqdiv/acqdiv/wiki/Installation-Run-instructions-for-Windows

For Mac and Linux user, continue here to run the pipeline yourself:

Install the package

Create a virtual environment [optional]:

python3 -m venv venv
source venv/bin/activate

You can install the package from PyPI or directly from source:

PyPI

pip install acqdiv

From source

# Clone Repository
git clone git@github.com:acqdiv/acqdiv.git
cd acqdiv

# Install package (for users!)
pip install .

# Developer mode (for developers!)
pip install -r requirements.txt

Get the corpora

Run the following script to download the public corpora:

python util/download_public_corpora.py

The corpora are in the folder corpora.

For the private corpora, either place the session files in corpora/<corpus_name>/{cha|toolbox}/ and the metadata files (only Toolbox corpora) in corpora/<corpus_name>/imdi/ or edit the paths to those files in the config.ini (also see below).

Generate the database

Get the configuration file src/acqdiv/config.ini and specify the absolute paths (without trailing slashes) for the corpora directory (corpora_dir) and the directory where the database should be written to (db_dir):

[.global]
# directory containing corpora
corpora_dir = /absolute/path/to/corpora/dir
# directory where the database is written to
db_dir = /absolute/path/to/database/dir
...

Optionally adapt the paths for the individual corpora (sessions and metadata_dir).

Run the pipeline specifying the absolute path to the configuration file:
acqdiv load -c /absolute/path/to/config.ini

Generate the R object

Install dependencies

$ R
> install.packages("RSQLite")
> install.packages("rlang")

Navigate to src/acqdiv/database and run:

Rscript sqlite_to_r.R /absolute/path/to/sqlite-DB

Run tests

Run the unittests:
pytest tests/unittests

Run the integrity tests on the database:
pytest tests/systemtests

Project details

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
Programming Language
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

This version

1.1.0

Dec 1, 2020

1.0.0

Dec 1, 2019

0.2.1

Nov 24, 2019

0.2.0

Nov 24, 2019

0.1.0

Nov 11, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acqdiv-1.1.0.tar.gz (148.8 kB view details)

Uploaded Dec 1, 2020 Source

File details

Details for the file acqdiv-1.1.0.tar.gz.

File metadata

Download URL: acqdiv-1.1.0.tar.gz
Upload date: Dec 1, 2020
Size: 148.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.8.5

File hashes

Hashes for acqdiv-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8ca05d0058cc04e9fbae16b9df244c851dbf87b483d97d2e500874c8851d643a`
MD5	`ddd74c4f27ae54fd41705e53854d7b32`
BLAKE2b-256	`e62e50039684b6521d5a8aab314e7687feedd218ecfcb370a6ef0003185e83f2`

See more details on using hashes here.

acqdiv 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ACQDIV

Publication

Resources

Corpora

Running the pipeline

Install the package

Get the corpora

Generate the database

Generate the R object

Run tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes