A package allowing to download datacubes into pandas data frames
Project description
Pandas Datacube
About
pandas-datacube is a python package allowing to convert and download a datacube from a remote source using SPARQL queries and to obtain a pandas dataframe.
This module is able to detect the different datasets of an entry point and its different dimensions and measures, to use the metadata present in the ontology to order the dimensions and to download the data
This project was realized during an internship at LIG in the GETALP team under the supervision of Mr Sérasset (Gilles.Serasset@imag.fr)
Installation
You can install pandas-datacube from PyPi:
$ pip install pandas-datacube
How to use
The module is quite simple to use:
-
get all datasets available:
from pandasdatacube import get_datasets import pandas as pd ENDPOINT: str = "https://statistics.gov.scot/sparql" datasets: pd.DataFrame = get_datasets(ENDPOINT) datasets.head()
dataset commentaire 0 http://statistics.gov.scot/data/pupil-attainment Number of pupils who attained a given number of qualifications by level and stage. 1 http://statistics.gov.scot/data/alcohol-related-discharge Number and European Age-sex Standardised Rates (EASRs) of general acute inpatient and day case discharges with an alcohol-related diagnosis. 2 http://statistics.gov.scot/data/business-births-deaths-and-survival-rates Number and rate (per 10,000 adults) of VAT/PAYE registrations, de-registrations and business survival rates 3 http://statistics.gov.scot/data/earnings Mean and median gross weekly earnings (£s) by gender, working pattern and workplace/residence measure. 4 http://statistics.gov.scot/data/economic-inactivity Economic inactivity level and rate by gender -
get and transform features of a dataset
from pandasdatacube import get_features, transform_features import pandas as pd ENDPOINT: str = "https://statistics.gov.scot/sparql" DATASET_NAME: str = "http://statistics.gov.scot/data/earnings" features: pd.DataFrame = get_features(ENDPOINT, DATASET_NAME) features.head()
transformed_features: tuple[list[str]] = transform_features(features) print(transformed_features)
Output:
(['http://purl.org/linked-data/sdmx/2009/dimension#refArea', 'http://purl.org/linked-data/sdmx/2009/dimension#refPeriod', 'http://purl.org/linked-data/cube#measureType', 'http://statistics.gov.scot/def/dimension/gender', 'http://statistics.gov.scot/def/dimension/workingPattern', 'http://statistics.gov.scot/def/dimension/populationGroup'], ['http://statistics.gov.scot/def/measure-properties/median', 'http://statistics.gov.scot/def/measure-properties/mean'])
-
download a dataset
from pandasdatacube import download_dataset import pandas as pd ENDPOINT: str = "https://statistics.gov.scot/sparql" DATASET_NAME: str = "http://statistics.gov.scot/data/earnings" DIMENSIONS: list[str] = ['http://purl.org/linked-data/sdmx/2009/dimension#refArea', 'http://purl.org/linked-data/sdmx/2009/dimension#refPeriod', 'http://purl.org/linked-data/cube#measureType', 'http://statistics.gov.scot/def/dimension/gender', 'http://statistics.gov.scot/def/dimension/workingPattern', 'http://statistics.gov.scot/def/dimension/populationGroup'] MEASURES: list[str] = ['http://statistics.gov.scot/def/measure-properties/median', 'http://statistics.gov.scot/def/measure-properties/mean'] data: pd.DataFrame = download_dataset( endpoint =ENDPOINT, dataset_name=DATASET_NAME, dimensions=DIMENSIONS, measures=MEASURES ) data.head().reset_index()
-
do all steps in one lines
from pandasdatacube import get_datacube import pandas as pd ENDPOINT: str = "http://kaiko.getalp.org/sparql" PREFIXES: dict[str] = {'dbnary': 'http://kaiko.getalp.org/dbnary#', 'dbnstats': 'http://kaiko.getalp.org/dbnary/statistics/', 'lime': 'http://www.w3.org/ns/lemon/lime#'} dataset: str = "dbnstats:dbnaryStatisticsCube" dimensions: list[str] = ['dbnary:observationLanguage', 'dbnary:wiktionaryDumpVersion'] mesures: list[str] = ['dbnary:lexicalEntryCount', 'dbnary:lexicalSenseCount', 'dbnary:pageCount', 'dbnary:translationsCount'] dtypes: dict[str] = {"lexicalEntryCount": int, "translationsCount": int, "lexicalSenseCount": int, "pageCount": int} data: pd.DataFrame = get_datacube(ENDPOINT, dataset, dimensions, mesures, dtypes, PREFIXES) data.head().reset_index()
observationLanguage wiktionaryDumpVersion lexicalEntryCount lexicalSenseCount pageCount translationsCount 0 bg 20210701 18626 18420 27050 18086 1 bg 20140224 18831 18798 27071 13888 2 bg 20140312 18829 18796 27068 13895 3 bg 20140328 18828 18795 27072 13909 4 bg 20140415 18822 18294 27068 13920
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pandas_datacube-0.0.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dac0be902154416663e55541ed08dbba06232274c82200295de266a45d4881cb |
|
MD5 | 0a6d0f76268f82e0965c747a85077999 |
|
BLAKE2b-256 | d528f53a3c52f9fd14c50d5267ef5cc0904436f152066f7a7ff96c2f7aa9acdd |