Skip to main content

Downloader and tools for Spraakbanken data

Project description

SCRIBE data downloader

This is a project in progress, with the core functionality being interactive, version-checked downloads with user prompting for the following datasets:

System components

  • Metadata handling
    • Web scraper (online information)
    • PDF
    • XML
  • Downloader
    • Downloading the files
  • Data extraction (TODO)
  • Parsing (TODO)

Installation

pip install spraakbanken-downloader

Usage (from pip installation)

python -m spraakbanken

Usage (Local)

  1. install dependencies first make install
  2. subsequent runs: make which in turns runs linting and the main file, prompting you for which dataset to download.

Usage (Local: Accessing python files directly)

run the main file with a --dataset (-d) argument:

python src/main.py --dataset *somedataset* (where dataset is in {nst, storting, nbtale})

e.g.

python src/main.py --dataset nst

(an optional --verbose (-v) can be used for printing the metadata in the console)

this guides you through the download process, based on information from the Sprakbanken websites, as such:

New datasets:

>>> python src/main.py --dataset nst

Fetching data for dataset: 'nst'
----------------------------------------
Last updated at 2020-07-31
Accessed URL 'https://www.nb.no/SPRAKBANKEN/ressurskatalog//oai-nb-no-sbr-54/' at 03-08-2022_16-10-09
Found the following files:
1. ADB_NOR_0463.tar.gz
2. ADB_NOR_0464.tar.gz
3. ADB_OD_Nor.NOR.tar.gz
4. lydfiler_16_1_a.tar.gz
5. lydfiler_16_1_b.tar.gz
6. lydfiler_16_1_c.tar.gz
7. lydfiler_16_1_d.tar.gz
8. lydfiler_16_2_a.tar.gz
9. lydfiler_16_2_b.tar.gz
10. lydfiler_16_2_c.tar.gz
11. lydfiler_16_2_d.tar.gz
12. lydfiler_16_begge_a.tar.gz
13. lydfiler_16_begge_b.tar.gz
14. lydfiler_16_begge_c.tar.gz
15. lydfiler_16_begge_d.tar.gz
Download? [yes (Y) / no (N)]

Existing local datasets

The metadata fetching process stores the data locally, along with a checksum to match the data points. An example file name is nbtale/3676375100_02-08-2022_20-14-59.json the first part being the checksum, the rest being the date and time accessed.

Within this file is the corresponding metadata to each dataset. Varying data points are fetched, as this is not standardized by Sprakbanken.

Given that a checksum-file is matching the newly fetched metadata, the user is prompted as such: >>> python src/main.py --dataset nbtale

Last updated at 2015-12-22
Accessed URL 'https://www.nb.no/SPRAKBANKEN/ressurskatalog//oai-nb-no-sbr-31/' at 03-08-2022_16-14-45
Dataset 'nbtale' already downloaded
Continue to download regardless? [yes (Y) / no (N)] 

if "y", the user continues to the same pipeline as above:

Continue to download regardless? [yes (Y) / no (N)] y
Found the following files:
1. sennheiser_1.tar.gz
2. sennheiser_2.tar.gz
3. sennheiser_3.tar.gz
4. shure_1.tar.gz
5. shure_2.tar.gz
6. shure_3.tar.gz
Download? [yes (Y) / no (N)]  

Demo mode

you can append the --only-meta argument to create the checksum files as if you downloaded the dataset:

>>> python src/main.py --dataset storting --only-meta

Meta file example (Storting corpus):

There is still a bit of work left to store proper data fields. The sprakbanken websites are not web scraper friendly.

{
    "corpus audio info": {
        "size": "140",
        "size unit": "files",
        "duration unit": "hours",
        "mime type": "audio/wav",
        "signal encoding": "linearpcm",
        "sampling rate": "48000",
        "quantization": "16",
        "byte order": "littleendian",
        "sign convention": "signedinteger",
        "number of tracks": "2",
        "recording quality": "medium"
    },
    "audio size info": {
        "size": "140",
        "size unit": "files",
        "duration unit": "hours"
    },
    "size info": {
        "size": "1198590",
        "size unit": "words"
    },
    "duration of effective speech info": {
        "size": "126",
        "duration unit": "hours"
    },
    "duration of audio info": {
        "size": "140",
        "duration unit": "hours"
    },
    "audio format info": {
        "mime type": "audio/wav",
        "signal encoding": "linearpcm",
        "sampling rate": "48000",
        "quantization": "16",
        "byte order": "littleendian",
        "sign convention": "signedinteger",
        "number of tracks": "2",
        "recording quality": "medium"
    },
    "corpus text info": {
        "size": "1198590",
        "size unit": "words",
        "character encoding": "utf-8"
    },
    "text format info": {
        "size": "1198590",
        "size unit": "words"
    },
    "size per text format": {
        "size": "1198590",
        "size unit": "words"
    },
    "character encoding info": {
        "character encoding": "utf-8"
    }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spraakbanken_downloader-0.0.3.tar.gz (6.7 kB view hashes)

Uploaded Source

Built Distribution

spraakbanken_downloader-0.0.3-py3-none-any.whl (7.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page