Skip to main content

Downloader and tools for Spraakbanken data

Project description

SCRIBE/Spraakbanken data downloader

This is a project in progress, with the core functionality being interactive, version-checked downloads with user prompting for the following datasets:

System components

  • Metadata handling
    • Web scraper (online information)
    • PDF
    • XML
  • Downloader
    • Downloading the files
  • Data extraction (TODO)
  • Parsing (TODO)

Installation

pip install spraakbanken-downloader

Usage (from pip installation)

  • For the interactive downloader:
    • python -m spraakbanken
  • With specific arguments:
    • python -m spraakbanken --dataset nst
    • See below for supported arguments

Arguments


Option String Required Default Option Summary
-d, --dataset False None Optional specific dataset
-o, --outdir False current_dir/data Specify output dir
-v, --verbose False False Verbose to print meta in console
--meta False False Mock run to test download of metadata. Will download to outdir
--unpack False False Whether to unpack downloaded archives
--cleanup False False Deletes downloaded archives after unpacking
--language False no Language. Defaults to Norwegian. only supports certain datasets.

Local usage

  1. install dependencies first make install
  2. subsequent runs: make which in turns runs linting and the main file, prompting you for which dataset to download.

Usage (Local: Accessing python files directly)

run the main file with a --dataset (-d) argument:

python src/spraakbanken/__main__.py --dataset *somedataset* (where dataset is in {nst, storting, nbtale})

e.g.

python src/spraakbanken/__main__.py --dataset nst

this guides you through the download process, based on information from the Sprakbanken websites, as such:

New datasets:

Fetching data for dataset: 'nst'
----------------------------------------
Last updated at 2020-07-31
Accessed URL 'https://www.nb.no/SPRAKBANKEN/ressurskatalog//oai-nb-no-sbr-54/' at 03-08-2022_16-10-09
Found the following files:
1. ADB_NOR_0463.tar.gz
2. ADB_NOR_0464.tar.gz
3. ADB_OD_Nor.NOR.tar.gz
4. lydfiler_16_1_a.tar.gz
5. lydfiler_16_1_b.tar.gz
6. lydfiler_16_1_c.tar.gz
7. lydfiler_16_1_d.tar.gz
8. lydfiler_16_2_a.tar.gz
9. lydfiler_16_2_b.tar.gz
10. lydfiler_16_2_c.tar.gz
11. lydfiler_16_2_d.tar.gz
12. lydfiler_16_begge_a.tar.gz
13. lydfiler_16_begge_b.tar.gz
14. lydfiler_16_begge_c.tar.gz
15. lydfiler_16_begge_d.tar.gz
Download? [yes (Y) / no (N)]

Existing local datasets

The metadata fetching process stores the data locally, along with a checksum to match the data points. An example file name is nbtale/3676375100_02-08-2022_20-14-59.json the first part being the checksum, the rest being the date and time accessed.

Within this file is the corresponding metadata to each dataset. Varying data points are fetched, as this is not standardized by Sprakbanken.

Given that a checksum-file is matching the newly fetched metadata, the user is prompted as such:

Last updated at 2015-12-22
Accessed URL 'https://www.nb.no/SPRAKBANKEN/ressurskatalog//oai-nb-no-sbr-31/' at 03-08-2022_16-14-45
Dataset 'nbtale' already downloaded
Continue to download regardless? [yes (Y) / no (N)] 

if "y", the user continues to the same pipeline as above:

Continue to download regardless? [yes (Y) / no (N)] y
Found the following files:
1. sennheiser_1.tar.gz
2. sennheiser_2.tar.gz
3. sennheiser_3.tar.gz
4. shure_1.tar.gz
5. shure_2.tar.gz
6. shure_3.tar.gz
Download? [yes (Y) / no (N)]  

Meta file example (Storting corpus):

There is still a bit of work left to store proper data fields. The sprakbanken websites are not web scraper friendly.

{
    "corpus audio info": {
        "size": "140",
        "size unit": "files",
        "duration unit": "hours",
        "mime type": "audio/wav",
        "signal encoding": "linearpcm",
        "sampling rate": "48000",
        "quantization": "16",
        "byte order": "littleendian",
        "sign convention": "signedinteger",
        "number of tracks": "2",
        "recording quality": "medium"
    },
    "audio size info": {
        "size": "140",
        "size unit": "files",
        "duration unit": "hours"
    },
    "size info": {
        "size": "1198590",
        "size unit": "words"
    },
    "duration of effective speech info": {
        "size": "126",
        "duration unit": "hours"
    },
    "duration of audio info": {
        "size": "140",
        "duration unit": "hours"
    },
    "audio format info": {
        "mime type": "audio/wav",
        "signal encoding": "linearpcm",
        "sampling rate": "48000",
        "quantization": "16",
        "byte order": "littleendian",
        "sign convention": "signedinteger",
        "number of tracks": "2",
        "recording quality": "medium"
    },
    "corpus text info": {
        "size": "1198590",
        "size unit": "words",
        "character encoding": "utf-8"
    },
    "text format info": {
        "size": "1198590",
        "size unit": "words"
    },
    "size per text format": {
        "size": "1198590",
        "size unit": "words"
    },
    "character encoding info": {
        "character encoding": "utf-8"
    }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spraakbanken_downloader-0.2.0.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spraakbanken_downloader-0.2.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file spraakbanken_downloader-0.2.0.tar.gz.

File metadata

  • Download URL: spraakbanken_downloader-0.2.0.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for spraakbanken_downloader-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0e59bcc315b1639c2945ca0f3d65633ca2dce0ea62f9f43924bf760b91918304
MD5 9d1be9d06d3456c35c25502372235292
BLAKE2b-256 d8ee8d65139fb828900979c1f258ee798e2b4dfb43755b758cb129ae7cee3136

See more details on using hashes here.

File details

Details for the file spraakbanken_downloader-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spraakbanken_downloader-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b4073502c2e30f2187faaec6577e6e92f5ea605a9d9ed9271eac8d6e6e2e1c8f
MD5 8f405b1ff7365aaf4e69879362bf1034
BLAKE2b-256 8cd7d39534e654030363f30e2637546d9e7f2576a6d0bcb3f4749315577a8e26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page