Skip to main content

Api for getting data from dane.gov.pl

Project description

danegovpl

Tool for getting data from dane.gov.pl

Installation

pip install danegovpl

Usage

CLI

usage: __main__.py [-h] [-v] [-d DIR] [-t NUM] [-l LVL] [-f FORMAT] [-w TIME]
                   [-W TIME] [-r NUM] [--retry-delay TIME]
                   [--retry-all-errors] [-m TIMEOUT] [-k] [-L]
                   [--max-redirs NUM] [-A UA] [-x PROXY] [-H HEADER]
                   [-b COOKIE] [-B BROWSER]
                   [RESOURCE ...]

Tool for getting data from dane.gov.pl

positional arguments:
  RESOURCE              starting point for getting resources i.e.
                        institutions, institution.{ID}, datasets,
                        dataset.{ID}, resources, resource.{ID}

General:
  -h, --help            Show this help message and exit
  -v, --version         Print program version and exit

Files:
  -d, --directory DIR   Change directory to DIR

Settings:
  -t, --threads NUM     use NUM of threads
  -l, --lvl LVL         Get resources metadata up to level
  -f, --format FORMAT   Download files in specified format preference i.e.
                        all; jsonld; csv; xlsx, csv,jsonld,xls (if not set,
                        files are not downloaded)

Request settings:
  -w, --wait TIME       Set waiting time for each request
  -W, --wait-random TIME
                        Set random waiting time for each request to be from 0
                        to TIME
  -r, --retry NUM       Set number of retries for failed request to NUM
  --retry-delay TIME    Set interval between each retry
  --retry-all-errors    Retry no matter the error
  -m, --timeout TIMEOUT
                        Set request timeout, if in TIME format it'll be set
                        for the whole request. If in TIME,TIME format first
                        TIME will specify connection timeout, the second read
                        timeout. If set to '-' timeout is disabled
  -k, --insecure        Ignore ssl errors
  -L, --location        Allow for redirections, can be dangerous if
                        credentials are passed in headers
  --max-redirs NUM      Set the maximum number of redirections to follow
  -A, --user-agent UA   Sets custom user agent
  -x, --proxy PROXY     Use the specified proxy, can be used multiple times.
                        If set to URL it'll be used for all protocols, if in
                        PROTOCOL URL format it'll be set only for given
                        protocol, if in URL URL format it'll be set only for
                        given path. If first character is '@' then proxies are
                        read from file
  -H, --header HEADER   Set curl style header, can be used multiple times e.g.
                        -H 'User: Admin' -H 'Pass: 12345', if first character
                        is '@' then headers are read from file e.g. -H @file
  -b, --cookie COOKIE   Set curl style cookie, can be used multiple times e.g.
                        -b 'auth=8f82ab' -b 'PHPSESSID=qw3r8an829', without
                        '=' character argument is read as a file
  -B, --browser BROWSER
                        Get cookies from specified browser e.g. -B firefox

dane.gov.pl groups it's data as a tree where nodes at each next level are: institution, dataset, resource.

Get metadata for all institutions and datasets and resources published by it

danegovpl institutions

This is also equivalent to

danegovpl institutions --lvl 3

Get metadata using 8 threads

danegovpl institutions -t 8

Get metadata for all institutions

danegovpl institutions --lvl 1

Get metadata for all institutions and datasets published by it

danegovpl institutions --lvl 2

Get metadata for specific institution and datasets and resources published by it

danegovpl institution.2522

Get metadata for all datasets and resources under it

danegovpl datasets

Get metadata for specific dataset

danegovpl dataset.6935

Get metadata for all datasets

danegovpl datasets --lvl 1

Get metadata for all resources

danegovpl resources

Get metadata for specific resource

danegovpl resource.3814

Get all metadata and download all resource files using 8 threads

danegovpl institutions -t 8 -f all

Get metadata for all resources and download only csv files using 8 threads

danegovpl institutions -t 8 -f csv

Get metadata for all resources and download csv files or jsonld files if csv files aren't available

danegovpl institutions -t 8 -f csv,jsonld

Get metadata for all resources and download csv files or jsonld files or xlsx files, while compressing csv and jsonld files with zstd

danegovpl institutions -t 8 -f csv,jsonld,xlsx

Output example

Can be found in examples directory and are excerpt taken from running

danegovpl institutions

this illustrates all provided formats, using datasets or resources would create a single directory with thousands of subdirectories in it.

Library

Code

from danegovpl import Api, Error, ArgError, RequestError

api = Api(timeout=30) # arguments for treerequests can be passed

try:
    for datasets in api.datasets(page=2,params=[("title[prefix]","imiona")]):
        for dataset in datasets['data']:
            print(dataset['id'])
except RequestError as e:
    print(repr(e))

Exceptions

All exceptions raised by this library are derived from Error, ArgError is raised if functions are called with incorrect arguments and RequestError is raised for errors when handling requests.

Api

Api class provides methods for interacting with dane.gov.pl, at it's initialization it accepts parameters for treerequests session.

Methods

Methods are named in fashion similar to the endpoints, some names were changed from plural to singular form to denote operation on single item.

All of them accept optional argument params: List[Tuple[str]] which represents parameters passed in url params. It's done this way, because they aren't always consistent and allow for expressions not easily representable in python code. If you know what you need you can add them manually (protip: https://dane.gov.pl/ site uses it's own api for the requests, so the params can taken from requests made by it e.g. in searches).

dga_aggregated(self, i_id: int, params: List[Tuple[str, str]] = []) -> dict

Returns data about Aggregated DGA resource - especially resource_id and dataset_id

Methods for items

The following take i_id: int denoting id of element

institution(self, i_id: int, params: List[Tuple[str, str]] = []) -> dict

Returns institution with given ID

dataset(self, i_id: int, params: List[Tuple[str, str]] = []) -> dict

Returns dataset with given ID

resource(self, i_id: int, params: List[Tuple[str, str]] = []) -> dict

Returns resource with given ID

resource_data_row(self, i_id: int, row_id: int, params: List[Tuple[str]] = []) -> str

Returns single row

showcase(self, i_id: int, params: List[Tuple[str]] = []) -> dict

Returns showcase with given ID

history(self, i_id: int, params: List[Tuple[str]] = []) -> dict

Returns history item with given ID

Methods for pages

The following take page: int = 1 and per_page: int = 100 denoting starting page and number of results per page, and return iterator yielding pages starting from page

institutions(self, params: List[Tuple[str, str]] = [], page: int = 1, per_page: int = 100) -> Iterator[dict]

Gives the ability to browse, filter and search for institutions

institution_datasets(self, i_id: int, params: List[Tuple[str, str]] = [], page: int = 1, per_page: int = 100) -> Iterator[dict]

Gives the ability to browse, filter and search for datasets of given institution

datasets(self, params: List[Tuple[str, str]] = [], page: int = 1, per_page: int = 100) -> Iterator[dict]

Gives the ability to browse, filter and search datasets

dataset_resources(self, i_id: int, params: List[Tuple[str, str]] = [], page: int = 1, per_page: int = 100) -> Iterator[dict]

Gives the ability to browse, filter and search for resources of given dataset

dataset_showcases(self, i_id: int, params: List[Tuple[str, str]] = [], page: int = 1, per_page: int = 100) -> Iterator[dict]

Gives the ability to browse, filter and search for showcases of given dataset

resources(self, params: List[Tuple[str, str]] = [], page: int = 1, per_page: int = 100) -> Iterator[dict]

Gives the ability to browse, filter and search resources

resource_data(self, i_id: int, params: List[Tuple[str, str]] = [], page=1, per_page=100) -> Iterator[dict]

Returns list of rows

search(self, params: List[Tuple[str, str]] = [], page: int = 1, per_page: int = 100) -> Iterator[dict]

Gives the ability to filter and search objects of various types: articles, datasets, institutions, resources, showcases

showcases(self, params: List[Tuple[str, str]] = [], page: int = 1, per_page: int = 100) -> Iterator[dict]

Gives the ability to browse, filter and search showcases

histories(self, params: List[Tuple[str, str]] = [], page: int = 1, per_page: int = 100) -> Iterator[dict]

Gives the ability to browse, filter and search histories

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

danegovpl-0.0.2.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

danegovpl-0.0.2-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file danegovpl-0.0.2.tar.gz.

File metadata

  • Download URL: danegovpl-0.0.2.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for danegovpl-0.0.2.tar.gz
Algorithm Hash digest
SHA256 513767f8641a8c39c06759515ee61908dc8d7a5771d280398dd193d5e2bc6dc0
MD5 13cc4a5a14a1bb723b42de86334ef62a
BLAKE2b-256 b1507e0c3b80eebded8a97f3aea8c723290b1d08d05bd71ae192e3c1af1b1c7a

See more details on using hashes here.

File details

Details for the file danegovpl-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: danegovpl-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for danegovpl-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e58ee050204db4a6128b177a3ba637243e201357bb2df213fd9e1fb5371d1710
MD5 e2f574c0ed031e3a191b2bdf7058dac7
BLAKE2b-256 cc6aa20b257615fd0b8ee26a852e5e7cae4c6a55f543ad0fe2147c945f3612e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page