Core data types used by OWID for managing data.
Project description
owid-catalog
A Pythonic API for working with OWID's data catalog.
Status: experimental, APIs likely to change
Quickstart
Install with pip install owid-catalog
. Then you can begin exploring the experimental data catalog:
from owid import catalog
# look for Covid-19 data, return a data frame of matches
catalog.find('covid')
# load Covid-19 data from the Our World In Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()
# load data from other than the default `garden` channel
lung_cancer_tables = catalog.find('lung_cancer_deaths_per_100000_men', channels=['open_numbers'])
df = lung_cancer_tables.iloc[0].load()
Development
You need Python 3.8+, poetry
and make
installed. Clone the repo, then you can simply run:
# run all unit tests and CI checks
make test
# watch for changes, then run all checks
make watch
Data types
Catalog
A catalog is an arbitrarily deep folder structure containing datasets inside. It can be local on disk, or remote.
Load the remote catalog
# find the default OWID catalog and fetch the catalog index over HTTPS
cat = RemoteCatalog()
# get a list of matching tables in different datasets
matches = cat.find('population')
# fetch a data frame for a specific match over HTTPS
t = cat.find_one('population', namespace='gapminder')
# load other channels than `garden`
cat = RemoteCatalog(channels=('garden', 'meadow', 'open_numbers'))
Datasets
A dataset is a folder of tables containing metadata about the overall collection.
- Metadata about the dataset lives in
index.json
- All tables in the folder must share a common format (CSV or Feather)
Create a new dataset
# make a folder and an empty index.json file
ds = Dataset.create('/tmp/my_data')
# choose CSV instead of feather for files
ds = Dataset.create('/tmp/my_data', format='csv')
Add a table to a dataset
# serialize a table using the table's name and the dataset's default format (feather)
# (e.g. /tmp/my_data/my_table.feather)
ds.add(table)
Remove a table from a dataset
ds.remove('table_name')
Access a table
# load a table including metadata into memory
t = ds['my_table']
List tables
# the length is the number of datasets discovered on disk
assert len(ds) > 0
# iterate over the tables discovered on disk
for table in ds:
do_something(table)
Add metadata
# you need to manually save your changes
ds.title = "Very Important Dataset"
ds.description = "This dataset is a composite of blah blah blah..."
ds.save()
Copy a dataset
# copying a dataset copies all its files to a new location
ds_new = ds.copy('/tmp/new_data_path')
# copying a dataset is identical to copying its folder, so this works too
shutil.copytree('/tmp/old_data', '/tmp/new_data_path')
ds_new = Dataset('/tmp/new_data_path')
Tables
Tables are essentially pandas DataFrames but with metadata. All operations on them occur in-memory, except for loading from and saving to disk. On disk, they are represented by tabular file (feather or CSV) and a JSON metadata file.
Columns of Table
have attribute VariableMeta
, including their type, description, and unit. Be carful when manipulating them, not all operations are currently supported. Supported are: adding a column, renaming columns. Not supported: direct assignment to t.columns = ...
or to index names t.columns.index = ...
.
Make a new table
# same API as DataFrames
t = Table({
'gdp': [1, 2, 3],
'country': ['AU', 'SE', 'CH']
}).set_index('country')
Add metadata about the whole table
t.title = 'Very important data'
Add metadata about a field
t.gdp.description = 'GDP measured in 2011 international $'
t.sources = [
Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
Add metadata about all fields at once
# sources and licenses are actually stored a the field level
t.sources = [
Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
t.licenses = [
License('CC-BY-SA-4.0', url='https://creativecommons.org/licenses/by-nc/4.0/')
]
Save a table to disk
# save to /tmp/my_table.feather + /tmp/my_table.meta.json
t.to_feather('/tmp/my_table.feather')
# save to /tmp/my_table.csv + /tmp/my_table.meta.json
t.to_csv('/tmp/my_table.csv')
Load a table from disk
These work like normal pandas DataFrames, but if there is also a my_table.meta.json
file, then metadata will also get read. Otherwise it will be assumed that the data has no metadata:
t = Table.read_feather('/tmp/my_table.feather')
t = Table.read_csv('/tmp/my_table.csv')
Changelog
v0.2.9
- Allow multiple channels in
catalog.find
function
- Allow multiple channels in
v0.2.8
- Update
OWID_CATALOG_VERSION
to 2
- Update
v0.2.7
- Split datasets into channels (
garden
,meadow
,open_numbers
, ...) and make garden default one - Add
.find_latest
method to Catalog
- Split datasets into channels (
v0.2.6
- Add flag
is_public
for public/private datasets - Enforce snake_case for table, dataset and variable short names
- Add fields
published_by
andpublished_at
to Source - Added a list of supported and unsupported operations on columns
- Updated
pyarrow
- Add flag
v0.2.5
- Fix ability to load remote CSV tables
v0.2.4
- Update the default catalog URL to use a CDN
v0.2.3
- Fix methods for finding and loading data from a
LocalCatalog
- Fix methods for finding and loading data from a
v0.2.2
- Repack frames to compact dtypes on
Table.to_feather()
- Repack frames to compact dtypes on
v0.2.1
- Fix key typo used in version check
v0.2.0
- Copy dataset metadata into tables, to make tables more traceable
- Add API versioning, and a requirement to update if your version of this library is too old
v0.1.1
- Add support for Python 3.8
v0.1.0
- Initial release, including searching and fetching data from a remote catalog
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file owid-catalog-0.2.9.tar.gz
.
File metadata
- Download URL: owid-catalog-0.2.9.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.13 Darwin/21.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7f7b8be6a73f502c2e42373ce3ada73a942aa3908640ebe2ead9affb97a4b7a |
|
MD5 | b1446c6a94627fba97b83a8c6f11425a |
|
BLAKE2b-256 | c116b6bf1c1c63d4749cd726bf9ae0bca7bf070e9eeb5d3f6010f3091387954b |
File details
Details for the file owid_catalog-0.2.9-py3-none-any.whl
.
File metadata
- Download URL: owid_catalog-0.2.9-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.13 Darwin/21.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d136687a964e83965b89ecb828b3a3236375769fe9f68a7e5769b81e87e5f222 |
|
MD5 | 7c6e5d69c02ae164b1b82cc006948185 |
|
BLAKE2b-256 | c2da287325b61632c6aee8598d4b70ce26a91d2b0863ff604266fbbbef6be939 |