Library for downloading CHILDES, preprocessing it, and extracting phonetic transcriptions

These details have not been verified by PyPI

Project links

Project description

CHILDES Processor

Scripts for processing the CHILDES dataset and converting it to a phonemic representation. Used to create the IPA-CHILDES dataset (see scripts/create_ipa_childes).

Installation

The simplest way is using pip:

pip install childes-processor

Or you can install from source:

git clone https://github.com/codebyzeb/childes-processor
cd childes-processor
pip install .

Dependencies

If using the process command to convert CHILDES to IPA, you may require additional dependencies for G2P+.

If you are using the download, make sure you have R installed.

Usage

CHILDES processor can be used as a command-line interface using childes-processor or by importing ChildesDownloader or ChildesProcessor in python. The CLI has three modes: download, process and extract, allowing the user to download and transcribe the CHILDES dataset.

To bring up the help menu, simply type:

childes_processor -h

Or for each mode, there is also a help menu:

childes_processor extract -h

Download

The download mode allows for corpora to be downloaded from CHILDES. For example, to download the Warren corpus from the Eng-NA collection, run the following:

childes_processor download Eng-NA --corpus Warren -o childes/downloaded

This will save the utterances to downloaded/Eng-NA/Warren.csv. If -s is used, the data will be separated by speaker. The command can also be run without the corpus provided, downloading all corpora available in the collection:

childes_processor download Eng-NA -o downloaded

Process

The process mode will process downloaded CSVs from CHILDES (those downloaded from the download tool) and provide a new CSV with additional columns and utterances sorted by child age. The additional columns are as follows:

Column	Description
`is_child`	Whether the utterance was spoken by a child or not. Note that unless the `-k` or `--keep` flag is set, all child utterances will be dicarded so this column will only contain `False`.
`processed_gloss`	The pre-processed orthographic utterance. This includes lowercasing, fixing English spelling and adding punctuation marks. This is based on the AOChildes preprocessing.
`ipa_transcription`	A phonemic transcription of the utterance in IPA, space-separated with word boundaries marked with the `WORD_BOUNDARY` token. This uses G2P+ using specifically-configured backends and language codes.
`character_split_utterance`	A space separated transcription of the utterance, produced simply by splitting the processed gloss by character. This is intended to have a very similar format to `ipa_transcription` for studies comparing phonemic to orthographic transcriptions.

The first required argument is the CSV or folder of CSVs to process. The second argument is the language that will be used for producing the phonemic transcription. To view supported languages, use -h.

The -k or --keep flag is used to keep child utterances. The -s or --split flag is used to split the resulting dataset into training set and a validation set containing 10,000 utterances. The -m or --max_age flag is used to discard all utterances produced when the child's age greater than the provided number of months.

For example, to process all downloaded Eng-NA corpora, run the following:

childes_processor process downloaded/Eng-NA EnglishNA -o processed/Eng-NA -s

This will take all the CSVs in the downloaded/Eng-NA folder and create two new CSVs, train.csv and valid.csv in the processed/Eng-NA folder specified containing processed utterances and additional useful information. These datasets contain phonemic transcriptions of each utterance that have been produced using the en-us language backend. If the path provided is a CSV instead of a folder, just that CSV will be processed.

Extract

The extract mode will take a CSV dataset and produce a text file containing a column from that CSV dataset. It has the option use a maximum cutoff, as with the process mode, using -m or --max_age. The intended use is to gather all phonemic or orthographic utterances from the processed dataset (but can also be used to extract other columns, or to extract from a downloaded CSV that hasn't been processed).

For example, to extract all ipa transcriptions from the train file produced by the previous example, only including utterances targeting children under the age of 2, run the following:

childes_processor extract processed/Eng-NA/train.csv ipa_transcription -o extracted/Eng-NA -m 24

This will create a file childes/extracted/Eng-NA/utterances.txt containing the contents of the ipa_transcription column where target_child_age is less than 24 months.

Python Usage

The download and process modes can also be used within Python. For example:

from childes_processor import ChildesProcessor, ChildesDownloader
from pathlib import Path
DOWNLOAD_PATH = Path('downloaded')
PROCESSED_PATH = Path('processed')

downloader = ChildesDownloader()
downloader.download('Eng-NA',
                    'Warren',
                    DOWNLOAD_PATH,
                    separate_by_child=False)

processor = ChildesProcessor(DOWNLOAD_PATH / 'Eng-NA',
                             keep_child_utterances=True,
                             max_age = 120)
processor.transcribe_utterances('EnglishNA')
processor.character_split_utterances()
processor.print_statistics()
processor.save_df(PROCESSED_PATH / 'Eng-NA')

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

childes_processor-0.1.0.tar.gz (4.1 MB view details)

Uploaded Apr 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

childes_processor-0.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded Apr 2, 2025 Python 3

File details

Details for the file childes_processor-0.1.0.tar.gz.

File metadata

Download URL: childes_processor-0.1.0.tar.gz
Upload date: Apr 2, 2025
Size: 4.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for childes_processor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`da2a0cff24ccd1a15244706136bfdc6d1b1c3eb11ec694d41499e08d253ec3a2`
MD5	`84d416688a6d71746f3ffda30512ba1a`
BLAKE2b-256	`d7858d6d0991e9d3f15c68a283a548c139a1d253d38d34241b86a704635f05f7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for childes_processor-0.1.0.tar.gz:

Publisher: publish.yml on codebyzeb/childes-processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: childes_processor-0.1.0.tar.gz
- Subject digest: da2a0cff24ccd1a15244706136bfdc6d1b1c3eb11ec694d41499e08d253ec3a2
- Sigstore transparency entry: 191606350
- Sigstore integration time: Apr 2, 2025
Source repository:
- Permalink: codebyzeb/childes-processor@9bdc3800acadfe68b4d11eb653d6e6c8453a9b68
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/codebyzeb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9bdc3800acadfe68b4d11eb653d6e6c8453a9b68
- Trigger Event: push

File details

Details for the file childes_processor-0.1.0-py3-none-any.whl.

File metadata

Download URL: childes_processor-0.1.0-py3-none-any.whl
Upload date: Apr 2, 2025
Size: 15.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for childes_processor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9e2d7ddc0062205ace61c08b173b1b1f3d43f8520b731202a0049327035d5777`
MD5	`42bc6261fd923124e70289e737a6081c`
BLAKE2b-256	`0350daa05aaa7df7c5f1b3ef6e6cb11a67605840aae3e7f16a14e9a4bbddabf6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for childes_processor-0.1.0-py3-none-any.whl:

Publisher: publish.yml on codebyzeb/childes-processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: childes_processor-0.1.0-py3-none-any.whl
- Subject digest: 9e2d7ddc0062205ace61c08b173b1b1f3d43f8520b731202a0049327035d5777
- Sigstore transparency entry: 191606354
- Sigstore integration time: Apr 2, 2025
Source repository:
- Permalink: codebyzeb/childes-processor@9bdc3800acadfe68b4d11eb653d6e6c8453a9b68
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/codebyzeb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9bdc3800acadfe68b4d11eb653d6e6c8453a9b68
- Trigger Event: push

childes-processor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CHILDES Processor

Installation

Dependencies

Usage

Download

Process

Extract

Python Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance