Skip to main content

A Python- and pandas-powered client for statistical data and metadata exchange

Project description

1. Overview

pandaSDMX is an endeavor to create an SDMX client that facilitates the management, acquisition and analysis of large datasets disseminated according to the SDMX standard by national statistics offices, central banks and international organisations. Notable SDMX data providers are Eurostat, the European Central Bank, the Bank for International Settlements, the International Monetary Fund, and the OECD, to name but a few. pandaSDMX reads datasets and converts them into pandas time series or DataFrames with hierarchical indexes created from structural metadata. Metadata on the content of datasets (so-called dataflows) is stored locally using SQLite.

2. Installation

pandaSDMX contains the pure Python module ‘pandasdmx’. You can install it with “pip install pandasdmx” or manually by downloading and extracting the source distribution, and issuing “python setup.py install” from the command line.

pandaSDMX has the following dependencies:

  • the data analysis library pandas which itself depends on a number of packages, and

  • requests

It is highly recommended to use one of the pre-packaged Python distributions for scientific computing and data analysis rather than installing those dependencies separately. Scientific Python distributions include, among many other useful things, the interactive, web-based Python shell IPython which is a must-have when working with data. The author uses Anaconda. For other Python distributions (not only scientific) see here.

3. Usage

Suppose we wanted to conduct some research on the European dairy industry. As pandaSDMX’s support for FAO is still experimental (or indeed not working), we will look for relevant data from Eurostat, the European statistics office. It provides data from national statistics offices of the 28 EU countries and more.

Step 1: Instantiate a ‘Client’ for it

>>> from pandasdmx import client
>>> estat = client('Eurostat', 'milk.db')

Here, we have used the factory function ‘client’. It instantiates the ‘Client’ class using the values hard-coded in pandasdmx.providers.

Step 2: Get the available dataflows and download interesting datasets

Now that we have an SDMX client for Eurostat, we call its ‘get_dataflows’ method to download the complete list of dataflows. A dataflow is essentially a table of all available datasets referenced by their flow references plus a human-readable description. Eurostat offers about 4500 datasets. Downloading the complete list may take a while.

>>> db = estat.get_dataflows()
>>> str(estat)
"<class 'pandasdmx.Client'>('http://www.ec.europa.eu/eurostat/SDMX/diss-web/rest', 'ESTAT', db_filename = 'milk.db') Database: <sqlite3.Connection object at 0x0501A130> ['table: ESTAT_dataflows SQL: CREATE TABLE ESTAT_dataflows \\n            (id INTEGER PRIMARY KEY, agencyID text, flowref text, version text, title text); ']"

The string representation of ‘Client’ shows the attached SQLite database and the tables. Note that the get_dataflows() method has just created the ‘ESTAT_dataflows’ table.

Next, we select dataflows whose title (description) contains the word ‘milk’.

>>> milk_table = db.execute('CREATE TABLE milk AS SELECT * FROM ESTAT_dataflows WHERE title LIKE "%milk%"')
>>> milk_list = milk_table.fetchall()

‘milk_list’ is a list of sqlite3.Row instances. They allow dict-like access using column names:

>>> milk_list[1]['title']
"Cows'milk collection and products obtained - annual data"
>>> milk_list[1]['flowref']
'apro_mk_cola'

Next, we use the get_data() method to actually download a dataset referenced by a flowref or a Row instance containing a key named ‘flowref’ as shown above. get_data() returns a 2-tuple: its first element is either a list of pandas timeseries (concat = False) or a DataFrame (if concat = True). The structural metadata attached to the data is used to create a multi-level column index for the DataFrame. When returning a list of timeseries, their ‘name’ attributes contain the non-global metadata as hashable NamedTuples (dicts would cause problems when concatenating the series later). The second element of the 2-tuple is a dict containing global metadata describing the entire dataset. As each key takes on only one value, it is unsuitable to structure the data. Hence, it is disregarded when creating the column index.

>>> df, md = estat.get_data(milk_list[1], '', concat = True)
>>> md
{'FREQ': 'A', 'UNIT': 'THS_T'}
>>> df.shape
(46, 492)
>>> df.columns.names
FrozenList(['GEO', 'PRODMILK'])

The second argument is used to narrow down the datasets using structural metadata. E.g., ‘…NL’ would yield data solely on the Netherlands.

Step 3: Get human-readable descriptions of the content metadata

From ‘df.columns.levels’ we can see the values of the structural metadata. Their meanings are explained in so-called code-lists. You can download them as follows:

>>> milk_codes = estat.get_codes(milk_list[1])
>>> milk_codes
Out[14]: OrderedDict([('FREQ', {'Q': 'Quarterly', 'W': 'Weekly', 'H': 'Semi-annu
al', 'M': 'Monthly', 'A': 'Annual', 'D': 'Daily'}), ('GEO', {'FI': 'Finland', 'E
S': 'Spain', 'DK': 'Denmark', 'BG': 'Bulgaria', 'FR': 'France', 'MT': 'Malta', ' [omitted]

4. Next steps, known issues, ToDo’s

While pandasdmx works well with Eurostat data, other institutions cause problems. Moreover, content metadata such as on data quality is currently ignored. So are categories, i.e. folders of dataflows; they are considered as a flat list. For other features such as writing data to a local file, see the doc strings of the get_data and get_dataflows methods.

For a more detailed ToDo list consider the ToDo.rst file in the source distribution. Any help is much appreciated.

5. Change log

Version 0.1 (2014-09-07)

initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandaSDMX-0.1.tar.gz (29.5 kB view details)

Uploaded Source

File details

Details for the file pandaSDMX-0.1.tar.gz.

File metadata

  • Download URL: pandaSDMX-0.1.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pandaSDMX-0.1.tar.gz
Algorithm Hash digest
SHA256 8bf24b6459e3595424910416df8f483c0f012e02a84eb446f6481cba088a4723
MD5 8310fec181e29b05e36acd9a515b6909
BLAKE2b-256 3a2beedc222152ccaf200f7414ea11fb2728690548b20ba731d6d0bb23c28b7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page