Core data types used by OWID for managing data.
Project description
owid-catalog
A Pythonic API for working with OWID's data catalog.
Status: experimental, APIs likely to change
Quickstart
Install with pip install owid-catalog
. Then you can begin exploring the experimental data catalog:
from owid import catalog
# look for Covid-19 data, return a data frame of matches
catalog.find('covid')
# load Covid-19 data from the Our World In Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()
Development
You need Python 3.9+, poetry
and make
installed. Clone the repo, then you can simply run:
# run all unit tests and CI checks
make test
# watch for changes, then run all checks
make watch
Data types
Catalog
A catalog is an arbitrarily deep folder structure containing datasets inside. It can be local on disk, or remote.
Load the remote catalog
# find the default OWID catalog and fetch the catalog index over HTTPS
cat = RemoteCatalog()
# get a list of matching tables in different datasets
matches = cat.find('population')
# fetch a data frame for a specific match over HTTPS
t = cat.find_one('population', namespace='gapminder')
Datasets
A dataset is a folder of tables containing metadata about the overall collection.
- Metadata about the dataset lives in
index.json
- All tables in the folder must share a common format (CSV or Feather)
Create a new dataset
# make a folder and an empty index.json file
ds = Dataset.create('/tmp/my_data')
# choose CSV instead of feather for files
ds = Dataset.create('/tmp/my_data', format='csv')
Add a table to a dataset
# serialize a table using the table's name and the dataset's default format (feather)
# (e.g. /tmp/my_data/my_table.feather)
ds.add(table)
Remove a table from a dataset
ds.remove('table_name')
Access a table
# load a table including metadata into memory
t = ds['my_table']
List tables
# the length is the number of datasets discovered on disk
assert len(ds) > 0
# iterate over the tables discovered on disk
for table in ds:
do_something(table)
Add metadata
# you need to manually save your changes
ds.title = "Very Important Dataset"
ds.description = "This dataset is a composite of blah blah blah..."
ds.save()
Copy a dataset
# copying a dataset copies all its files to a new location
ds_new = ds.copy('/tmp/new_data_path')
# copying a dataset is identical to copying its folder, so this works too
shutil.copytree('/tmp/old_data', '/tmp/new_data_path')
ds_new = Dataset('/tmp/new_data_path')
Tables
Tables are essentially pandas DataFrames but with metadata. All operations on them occur in-memory, except for loading from and saving to disk. On disk, they are represented by tabular file (feather or CSV) and a JSON metadata file.
Make a new table
# same API as DataFrames
t = Table({
'gdp': [1, 2, 3],
'country': ['AU', 'SE', 'CH']
}).set_index('country')
Add metadata about the whole table
t.title = 'Very important data'
Add metadata about a field
t.gdp.description = 'GDP measured in 2011 international $'
t.sources = [
Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
Add metadata about all fields at once
# sources and licenses are actually stored a the field level
t.sources = [
Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
t.licenses = [
License('CC-BY-SA-4.0', url='https://creativecommons.org/licenses/by-nc/4.0/')
]
Save a table to disk
# save to /tmp/my_table.feather + /tmp/my_table.meta.json
t.to_feather('/tmp/my_table.feather')
# save to /tmp/my_table.csv + /tmp/my_table.meta.json
t.to_csv('/tmp/my_table.csv')
Load a table from disk
These work like normal pandas DataFrames, but if there is also a my_table.meta.json
file, then metadata will also get read. Otherwise it will be assumed that the data has no metadata:
t = Table.read_feather('/tmp/my_table.feather')
t = Table.read_csv('/tmp/my_table.csv')
Changelog
v0.2.5
- Fix ability to load remote CSV tables
v0.2.4
- Update the default catalog URL to use a CDN
v0.2.3
- Fix methods for finding and loading data from a
LocalCatalog
- Fix methods for finding and loading data from a
v0.2.2
- Repack frames to compact dtypes on
Table.to_feather()
- Repack frames to compact dtypes on
v0.2.1
- Fix key typo used in version check
v0.2.0
- Copy dataset metadata into tables, to make tables more traceable
- Add API versioning, and a requirement to update if your version of this library is too old
v0.1.1
- Add support for Python 3.8
v0.1.0
- Initial release, including searching and fetching data from a remote catalog
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for owid_catalog-0.2.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 511b7e7873e3d5579a7c4a4acd2dd591ad022f5445f903ded68eca54c7571a38 |
|
MD5 | 736044be4bf2a680fc3b2a11186545bd |
|
BLAKE2b-256 | 3bc7b02817e6a3414a00a0b4150389f8ecd94abe600961ce36c66bbc09520747 |