Skip to main content

Library for downloading CHILDES, preprocessing it, and extracting phonetic transcriptions

Project description

CHILDES Processor

Scripts for processing the CHILDES dataset and converting it to a phonemic representation. Used to create the IPA-CHILDES dataset (see scripts/create_ipa_childes).

Installation

The simplest way is using pip:

pip install childes-processor

Or you can install from source:

git clone https://github.com/codebyzeb/childes-processor
cd childes-processor
pip install .

Dependencies

If using the process command to convert CHILDES to IPA, you may require additional dependencies for G2P+.

If you are using the download, make sure you have R installed.

Usage

CHILDES processor can be used as a command-line interface using childes-processor or by importing ChildesDownloader or ChildesProcessor in python. The CLI has three modes: download, process and extract, allowing the user to download and transcribe the CHILDES dataset.

To bring up the help menu, simply type:

childes_processor -h

Or for each mode, there is also a help menu:

childes_processor extract -h

Download

The download mode allows for corpora to be downloaded from CHILDES. For example, to download the Warren corpus from the Eng-NA collection, run the following:

childes_processor download Eng-NA --corpus Warren -o childes/downloaded

This will save the utterances to downloaded/Eng-NA/Warren.csv. If -s is used, the data will be separated by speaker. The command can also be run without the corpus provided, downloading all corpora available in the collection:

childes_processor download Eng-NA -o downloaded

Process

The process mode will process downloaded CSVs from CHILDES (those downloaded from the download tool) and provide a new CSV with additional columns and utterances sorted by child age. The additional columns are as follows:

Column Description
is_child Whether the utterance was spoken by a child or not. Note that unless the -k or --keep flag is set, all child utterances will be dicarded so this column will only contain False.
processed_gloss The pre-processed orthographic utterance. This includes lowercasing, fixing English spelling and adding punctuation marks. This is based on the AOChildes preprocessing.
ipa_transcription A phonemic transcription of the utterance in IPA, space-separated with word boundaries marked with the WORD_BOUNDARY token. This uses G2P+ using specifically-configured backends and language codes.
character_split_utterance A space separated transcription of the utterance, produced simply by splitting the processed gloss by character. This is intended to have a very similar format to ipa_transcription for studies comparing phonemic to orthographic transcriptions.

The first required argument is the CSV or folder of CSVs to process. The second argument is the language that will be used for producing the phonemic transcription. To view supported languages, use -h.

The -k or --keep flag is used to keep child utterances. The -s or --split flag is used to split the resulting dataset into training set and a validation set containing 10,000 utterances. The -m or --max_age flag is used to discard all utterances produced when the child's age greater than the provided number of months.

For example, to process all downloaded Eng-NA corpora, run the following:

childes_processor process downloaded/Eng-NA EnglishNA -o processed/Eng-NA -s

This will take all the CSVs in the downloaded/Eng-NA folder and create two new CSVs, train.csv and valid.csv in the processed/Eng-NA folder specified containing processed utterances and additional useful information. These datasets contain phonemic transcriptions of each utterance that have been produced using the en-us language backend. If the path provided is a CSV instead of a folder, just that CSV will be processed.

Extract

The extract mode will take a CSV dataset and produce a text file containing a column from that CSV dataset. It has the option use a maximum cutoff, as with the process mode, using -m or --max_age. The intended use is to gather all phonemic or orthographic utterances from the processed dataset (but can also be used to extract other columns, or to extract from a downloaded CSV that hasn't been processed).

For example, to extract all ipa transcriptions from the train file produced by the previous example, only including utterances targeting children under the age of 2, run the following:

childes_processor extract processed/Eng-NA/train.csv ipa_transcription -o extracted/Eng-NA -m 24

This will create a file childes/extracted/Eng-NA/utterances.txt containing the contents of the ipa_transcription column where target_child_age is less than 24 months.

Python Usage

The download and process modes can also be used within Python. For example:

from childes_processor import ChildesProcessor, ChildesDownloader
from pathlib import Path
DOWNLOAD_PATH = Path('downloaded')
PROCESSED_PATH = Path('processed')

downloader = ChildesDownloader()
downloader.download('Eng-NA',
                    'Warren',
                    DOWNLOAD_PATH,
                    separate_by_child=False)

processor = ChildesProcessor(DOWNLOAD_PATH / 'Eng-NA',
                             keep_child_utterances=True,
                             max_age = 120)
processor.transcribe_utterances('EnglishNA')
processor.character_split_utterances()
processor.print_statistics()
processor.save_df(PROCESSED_PATH / 'Eng-NA')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

childes_processor-0.1.0.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

childes_processor-0.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file childes_processor-0.1.0.tar.gz.

File metadata

  • Download URL: childes_processor-0.1.0.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for childes_processor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 da2a0cff24ccd1a15244706136bfdc6d1b1c3eb11ec694d41499e08d253ec3a2
MD5 84d416688a6d71746f3ffda30512ba1a
BLAKE2b-256 d7858d6d0991e9d3f15c68a283a548c139a1d253d38d34241b86a704635f05f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for childes_processor-0.1.0.tar.gz:

Publisher: publish.yml on codebyzeb/childes-processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file childes_processor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for childes_processor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9e2d7ddc0062205ace61c08b173b1b1f3d43f8520b731202a0049327035d5777
MD5 42bc6261fd923124e70289e737a6081c
BLAKE2b-256 0350daa05aaa7df7c5f1b3ef6e6cb11a67605840aae3e7f16a14e9a4bbddabf6

See more details on using hashes here.

Provenance

The following attestation bundles were made for childes_processor-0.1.0-py3-none-any.whl:

Publisher: publish.yml on codebyzeb/childes-processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page