Downloader and tools for Spraakbanken data

These details have not been verified by PyPI

Project links

GitHub Statistics

Project description

SCRIBE data downloader

This is a project in progress, with the core functionality being interactive, version-checked downloads with user prompting for the following datasets:

NST
NPSC
NB tale
more to come...

System components

Metadata handling
- Web scraper (online information)
- PDF
- XML
Downloader
- Downloading the files
Data extraction (TODO)
- Reformat the downloaded data into a unified format, as described in the combined_dataset repo
Parsing (TODO)

Installation

pip install spraakbanken-downloader

Usage (from pip installation)

python -m spraakbanken

Usage (Local)

install dependencies first make install
subsequent runs: make which in turns runs linting and the main file, prompting you for which dataset to download.

Usage (Local: Accessing python files directly)

run the main file with a --dataset (-d) argument:

python src/main.py --dataset *somedataset* (where dataset is in {nst, storting, nbtale})

e.g.

python src/main.py --dataset nst

(an optional --verbose (-v) can be used for printing the metadata in the console)

this guides you through the download process, based on information from the Sprakbanken websites, as such:

New datasets:

>>> python src/main.py --dataset nst

Fetching data for dataset: 'nst'
----------------------------------------
Last updated at 2020-07-31
Accessed URL 'https://www.nb.no/SPRAKBANKEN/ressurskatalog//oai-nb-no-sbr-54/' at 03-08-2022_16-10-09
Found the following files:
1. ADB_NOR_0463.tar.gz
2. ADB_NOR_0464.tar.gz
3. ADB_OD_Nor.NOR.tar.gz
4. lydfiler_16_1_a.tar.gz
5. lydfiler_16_1_b.tar.gz
6. lydfiler_16_1_c.tar.gz
7. lydfiler_16_1_d.tar.gz
8. lydfiler_16_2_a.tar.gz
9. lydfiler_16_2_b.tar.gz
10. lydfiler_16_2_c.tar.gz
11. lydfiler_16_2_d.tar.gz
12. lydfiler_16_begge_a.tar.gz
13. lydfiler_16_begge_b.tar.gz
14. lydfiler_16_begge_c.tar.gz
15. lydfiler_16_begge_d.tar.gz
Download? [yes (Y) / no (N)]

Existing local datasets

The metadata fetching process stores the data locally, along with a checksum to match the data points. An example file name is nbtale/3676375100_02-08-2022_20-14-59.json the first part being the checksum, the rest being the date and time accessed.

Within this file is the corresponding metadata to each dataset. Varying data points are fetched, as this is not standardized by Sprakbanken.

Given that a checksum-file is matching the newly fetched metadata, the user is prompted as such: >>> python src/main.py --dataset nbtale

Last updated at 2015-12-22
Accessed URL 'https://www.nb.no/SPRAKBANKEN/ressurskatalog//oai-nb-no-sbr-31/' at 03-08-2022_16-14-45
Dataset 'nbtale' already downloaded
Continue to download regardless? [yes (Y) / no (N)]

if "y", the user continues to the same pipeline as above:

Continue to download regardless? [yes (Y) / no (N)] y
Found the following files:
1. sennheiser_1.tar.gz
2. sennheiser_2.tar.gz
3. sennheiser_3.tar.gz
4. shure_1.tar.gz
5. shure_2.tar.gz
6. shure_3.tar.gz
Download? [yes (Y) / no (N)]

Demo mode

you can append the --only-meta argument to create the checksum files as if you downloaded the dataset:

>>> python src/main.py --dataset storting --only-meta

Meta file example (Storting corpus):

There is still a bit of work left to store proper data fields. The sprakbanken websites are not web scraper friendly.

{
    "corpus audio info": {
        "size": "140",
        "size unit": "files",
        "duration unit": "hours",
        "mime type": "audio/wav",
        "signal encoding": "linearpcm",
        "sampling rate": "48000",
        "quantization": "16",
        "byte order": "littleendian",
        "sign convention": "signedinteger",
        "number of tracks": "2",
        "recording quality": "medium"
    },
    "audio size info": {
        "size": "140",
        "size unit": "files",
        "duration unit": "hours"
    },
    "size info": {
        "size": "1198590",
        "size unit": "words"
    },
    "duration of effective speech info": {
        "size": "126",
        "duration unit": "hours"
    },
    "duration of audio info": {
        "size": "140",
        "duration unit": "hours"
    },
    "audio format info": {
        "mime type": "audio/wav",
        "signal encoding": "linearpcm",
        "sampling rate": "48000",
        "quantization": "16",
        "byte order": "littleendian",
        "sign convention": "signedinteger",
        "number of tracks": "2",
        "recording quality": "medium"
    },
    "corpus text info": {
        "size": "1198590",
        "size unit": "words",
        "character encoding": "utf-8"
    },
    "text format info": {
        "size": "1198590",
        "size unit": "words"
    },
    "size per text format": {
        "size": "1198590",
        "size unit": "words"
    },
    "character encoding info": {
        "character encoding": "utf-8"
    }
}

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

0.2.0

Sep 11, 2023

0.1.4

Oct 3, 2022

0.1.3

Aug 17, 2022

0.1.2

Aug 17, 2022

0.1.1

Aug 16, 2022

0.1.0

Aug 16, 2022

0.0.5

Aug 10, 2022

0.0.4

Aug 8, 2022

This version

0.0.3

Aug 8, 2022

0.0.2

Aug 8, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spraakbanken_downloader-0.0.3.tar.gz (6.7 kB view hashes)

Uploaded Aug 8, 2022 Source

Built Distribution

spraakbanken_downloader-0.0.3-py3-none-any.whl (7.5 kB view hashes)

Uploaded Aug 8, 2022 Python 3

Hashes for spraakbanken_downloader-0.0.3.tar.gz

Hashes for spraakbanken_downloader-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`1340502d08552ec01713df68faef773477aadd6a183261ccdfb2bed6b586322b`
MD5	`13027f42bf30b0313607809a4eb81dcb`
BLAKE2b-256	`a857e9d25f4b55e6bdd02350199f434e5a1aa5a01023fb64326cf780089499d9`

Hashes for spraakbanken_downloader-0.0.3-py3-none-any.whl

Hashes for spraakbanken_downloader-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`12b5a8a30945f1e8b84946de1dbc8875750f471b730a01585357af89a02174f3`
MD5	`1daed8aef9836c58a60631c23060b35b`
BLAKE2b-256	`03f796a1e337522f0881f4f639395c984b461c997a1886a6f1dd97ebbfe2eeb2`