The fastest way to get data from the Open Data Blend Dataset API

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

alt text

Open Data Blend for Python

Open Data Blend for Python is the fastest way to get data from the Open Data Blend Dataset API. It is a lightweight, easy-to-use extract and load (EL) tool.

You can use the the get_data function to download any data file belonging to an Open Data Blend dataset. Alternatively, can use the get_data_files function to download a collection of data files from an Open Data Blend dataset. The functions transparently download and cache the data locally or in cloud storage, mirroring the same folder hierarchy as on the remote server. They also cache a copy of the dataset metadata file (datapackage.json) at the point that they are called. The cache is persistent which means the files will be kept until they are deleted.

The versioned dataset metadata can be used to re-download a specific version of a data file (sometimes referred to as 'time travel'). You can learn more about how we version our datasets in the Open Data Blend Docs.

In addition to downloading the data and metadata files, get_data returns an object called Output which includes the locations of the downloaded files. Similarly, get_data_files returns an object called OutputSet which includes the locations of the files that are downloaded and the associated metadata. From there, you can query and analyse the data directly using something light like Pandas or, for more resource intensive processing, a data lakehouse platform like Databricks, or a scalable in-memory OLAP library like Polars.

Installation

Install the latest version of opendatablend from PyPI:

pip install opendatablend

Usage Examples

NOTE

If you want to run the examples, be sure to replace placeholder values such as <ACCESS_KEY> with appropriate string literals or variables.

Some of the following examples require the pandas and pyarrow packages to be installed:

pip install pandas
pip install pyarrow

Making Public API Requests

NOTE

Public API requests have a monthly limit.

Get the Data

import opendatablend as odb
import pandas as pd

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'

# Specify the resource name of the data file. In this example, the 'date' data file will be requested in Parquet format.
resource_name = 'date-parquet'

# Get the data and store the output object
output = odb.get_data(dataset_path, resource_name)

# Print the file locations
print(output.data_file_name)
print(output.metadata_file_name)

Use The Data

# Read a subset of the columns into a dataframe
df_date = pd.read_parquet(output.data_file_name, columns=['drv_date_key', 'drv_date', 'drv_month_name', 'drv_month_number', 'drv_quarter_name', 'drv_quarter_number', 'drv_year'])

# Check the contents of the dataframe
df_date

Making Authenticated API Requests

Get the Data

import opendatablend as odb
import pandas as pd

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>'

# Specify the resource name of the data file. In this example, the 'date' data file will be requested in Parquet format.
resource_name = 'date-parquet'

# Get the data and store the output object
output = odb.get_data(dataset_path, resource_name, access_key=access_key)

# Print the file locations
print(output.data_file_name)
print(output.metadata_file_name)

Use the Data

# Read a subset of the columns into a dataframe
df_date = pd.read_parquet(output.data_file_name, columns=['drv_date_key', 'drv_date', 'drv_month_name', 'drv_month_number', 'drv_quarter_name', 'drv_quarter_number', 'drv_year'])

# Check the contents of the dataframe
df_date

Downloading Multiple Data Files

The get_data_files function can be used to download a set of data files by providing their resource names as a list.

Get the Data

import opendatablend as odb
import pandas as pd

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>'

# Specify the resource names of the data files. In this example, a subset of the available data files will be requested in Parquet format.
resource_names = [
    'date-parquet',
    'time-of-day-parquet',
    'geolocation-parquet',
    'road-safety-accident-info-parquet',
    'road-safety-accident-location-parquet',
    'road-safety-accident-2021-parquet'
    ]

# Get the data files and store the output object
output = odb.get_data_files(dataset_path, resource_names, access_key=access_key)

# Print the file locations
print(output.data_file_names)
print(output.metadata_file_name)

Ingesting Data Directly into Cloud Storage Services

Azure Blob Storage

Using `get_data`

import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>' # The access key can be set to an empty string if you are making a public API request

# Specify the resource name of the data file. In this example, the 'date' data file will be requested in Parquet format.
resource_name = 'date-parquet'

# Get the data and store the output object using the Azure Blob Storage file system
configuration = {
    "connection_string" : "DefaultEndpointsProtocol=https;AccountName=<AZURE_BLOB_STORAGE_ACCOUNT_NAME>;AccountKey=<AZURE_BLOB_STORAGE_ACCOUNT_KEY>;EndpointSuffix=core.windows.net",
    "container_name" : "<AZURE_BLOB_STORAGE_CONTAINER_NAME>" # e.g. odbp-integration
    }
output = odb.get_data(dataset_path, resource_name, access_key=access_key, file_system="azure_blob_storage", configuration=configuration)

# Print the file locations
print(output.data_file_name)
print(output.metadata_file_name)

Using `get_data_files`

import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>' # The access key can be set to an empty string if you are making a public API request

# Specify the resource names of the data files. In this example, a subset of the available data files will be requested in Parquet format.
resource_names = [
    'date-parquet',
    'time-of-day-parquet',
    'geolocation-parquet',
    'road-safety-accident-info-parquet',
    'road-safety-accident-location-parquet',
    'road-safety-accident-2021-parquet'
    ]

# Get the data and store the output object using the Azure Blob Storage file system
configuration = {
    "connection_string" : "DefaultEndpointsProtocol=https;AccountName=<AZURE_BLOB_STORAGE_ACCOUNT_NAME>;AccountKey=<AZURE_BLOB_STORAGE_ACCOUNT_KEY>;EndpointSuffix=core.windows.net",
    "container_name" : "<AZURE_BLOB_STORAGE_CONTAINER_NAME>" # e.g. odbp-integration
    }
output = odb.get_data_files(dataset_path, resource_names, access_key=access_key, file_system="azure_blob_storage", configuration=configuration)

# Print the file locations
print(output.data_file_names)
print(output.metadata_file_name)

Azure Data Lake Storage (ADLS) Gen2

Using `get_data`

import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>' # The access key can be set to an empty string if you are making a public API request

# Specify the resource name of the data file. In this example, the 'date' data file will be requested in Parquet format.
resource_name = 'date-parquet'

# Get the data and store the output object using the Azure Data Lake Storage Gen2 file system
configuration = {
    "connection_string" : "DefaultEndpointsProtocol=https;AccountName=<ADLS_GEN2_ACCOUNT_NAME>;AccountKey=<ADLS_GEN2_ACCOUNT_KEY>;EndpointSuffix=core.windows.net",
    "container_name" : "<ADLS_GEN2_CONTAINER_NAME>" # e.g. odbp-integration
    }
output = odb.get_data(dataset_path, resource_name, access_key=access_key, file_system="azure_blob_storage", configuration=configuration)

# Print the file locations
print(output.data_file_name)
print(output.metadata_file_name)

Using `get_data_files`

import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>' # The access key can be set to an empty string if you are making a public API request

# Specify the resource names of the data files. In this example, a subset of the available data files will be requested in Parquet format.
resource_names = [
    'date-parquet',
    'time-of-day-parquet',
    'geolocation-parquet',
    'road-safety-accident-info-parquet',
    'road-safety-accident-location-parquet',
    'road-safety-accident-2021-parquet'
    ]

# Get the data and store the output object using the Azure Data Lake Storage Gen2 file system
configuration = {
    "connection_string" : "DefaultEndpointsProtocol=https;AccountName=<ADLS_GEN2_ACCOUNT_NAME>;AccountKey=<ADLS_GEN2_ACCOUNT_KEY>;EndpointSuffix=core.windows.net",
    "container_name" : "<ADLS_GEN2_CONTAINER_NAME>" # e.g. odbp-integration
    }
output = odb.get_data_files(dataset_path, resource_names, access_key=access_key, file_system="azure_blob_storage", configuration=configuration)

# Print the file locations
print(output.data_file_names)
print(output.metadata_file_name)

Amazon S3

Using `get_data`

import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>' # The access key can be set to an empty string if you are making a public API request

# Specify the resource name of the data file. In this example, the 'date' data file will be requested in Parquet format.
resource_name = 'date-parquet'

# Get the data and store the output object using the Amazon S3 file system
configuration = {
    "aws_access_key_id" : "<AWS_ACCESS_KEY_ID>",
    "aws_secret_access_key" : "AWS_SECRET_ACCESS_KEY",
    "bucket_name" : "<BUCKET_NAME>", # e.g. odbp-integration
    "bucket_region" : "<BUCKET_REGION>" # e.g. eu-west-2
    }

output = odb.get_data(dataset_path, resource_name, access_key=access_key, file_system="amazon_s3", configuration=configuration)

# Print the file locations
print(output.data_file_name)
print(output.metadata_file_name)

Using `get_data_files`

import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>' # The access key can be set to an empty string if you are making a public API request

# Specify the resource names of the data files. In this example, a subset of the available data files will be requested in Parquet format.
resource_names = [
    'date-parquet',
    'time-of-day-parquet',
    'geolocation-parquet',
    'road-safety-accident-info-parquet',
    'road-safety-accident-location-parquet',
    'road-safety-accident-2021-parquet'
    ]

# Get the data and store the output object using the Amazon S3 file system
configuration = {
    "aws_access_key_id" : "<AWS_ACCESS_KEY_ID>",
    "aws_secret_access_key" : "AWS_SECRET_ACCESS_KEY",
    "bucket_name" : "<BUCKET_NAME>", # e.g. odbp-integration
    "bucket_region" : "<BUCKET_REGION>" # e.g. eu-west-2
    }

output = odb.get_data(dataset_path, resource_names, access_key=access_key, file_system="amazon_s3", configuration=configuration)

# Print the file locations
print(output.data_file_names)
print(output.metadata_file_name)

Google Cloud Storage

Using `get_data`

import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>' # The access key can be set to an empty string if you are making a public API request

# Specify the resource name of the data file. In this example, the 'date' data file will be requested in Parquet format.
resource_name = 'date-parquet'

# Get the data and store the output object using the Google Cloud Storage file system
configuration = {
    "service_account_private_key_file" : "<PATH_TO_SERVICE_ACCOUNT_PRIVATE_KEY_FILE>",
    "bucket_name" : "<BUCKET_NAME>", # e.g. odbp-integration
    "bucket_location" : "<BUCKET_LOCATION>" # e.g. europe-west2
    }

output = odb.get_data(dataset_path, resource_name, access_key=access_key, file_system="google_cloud_storage", configuration=configuration)

# Print the file locations
print(output.data_file_name)
print(output.metadata_file_name)

Using `get_data_files`

import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
access_key = '<ACCESS_KEY>' # The access key can be set to an empty string if you are making a public API request

# Specify the resource names of the data files. In this example, a subset of the available data files will be requested in Parquet format.
resource_names = [
    'date-parquet',
    'time-of-day-parquet',
    'geolocation-parquet',
    'road-safety-accident-info-parquet',
    'road-safety-accident-location-parquet',
    'road-safety-accident-2021-parquet'
    ]

# Get the data and store the output object using the Google Cloud Storage file system
configuration = {
    "service_account_private_key_file" : "<PATH_TO_SERVICE_ACCOUNT_PRIVATE_KEY_FILE>",
    "bucket_name" : "<BUCKET_NAME>", # e.g. odbp-integration
    "bucket_location" : "<BUCKET_LOCATION>" # e.g. europe-west2
    }

output = odb.get_data(dataset_path, resource_names, access_key=access_key, file_system="google_cloud_storage", configuration=configuration)

# Print the file locations
print(output.data_file_names)
print(output.metadata_file_name)

Additional Examples

For more in-depth examples, see the examples folder.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.4.2

Jan 3, 2024

1.4.1

Jan 3, 2024

1.4.0

Nov 2, 2023

This version

1.3.0

Jul 11, 2023

1.3.0rc1 pre-release

Jul 10, 2023

1.2.2

Jul 5, 2023

1.2.1

Apr 15, 2022

1.2.0

Apr 15, 2022

1.1.0

Mar 25, 2022

1.0.2

Mar 19, 2022

1.0.1

Mar 18, 2022

1.0.0

Mar 18, 2022

0.3.2

Aug 6, 2021

0.3.1

Jul 6, 2021

0.3.0

Jul 6, 2021

0.3.0rc3 pre-release

Jul 6, 2021

0.3.0rc2 pre-release

Jul 6, 2021

0.3.0rc1 pre-release

Jul 6, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendatablend-1.3.0.tar.gz (9.8 kB view hashes)

Uploaded Jul 11, 2023 Source

Built Distribution

opendatablend-1.3.0-py3-none-any.whl (7.7 kB view hashes)

Uploaded Jul 11, 2023 Python 3

Hashes for opendatablend-1.3.0.tar.gz

Hashes for opendatablend-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`2d79a78820f35438e27fc9fd08a0386d6478de6e18905410c460f35b81472c0c`
MD5	`25be0d6575fb43838a7bdf34772ad237`
BLAKE2b-256	`27af5f44ca08c5b7ba7ac5f202423a8373853fdc05ef976ba5d993a49ecb9c55`

Hashes for opendatablend-1.3.0-py3-none-any.whl

Hashes for opendatablend-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`86a9a1179d098c7a2caa10312f46b6a8fab9baf8d280aca8cc69dd53089f09e4`
MD5	`a427737e99a00fdf1b8648070e53ed81`
BLAKE2b-256	`6e69c2b7258e1645c9e357ed035cdb7bd5ac50b29a3aea136bd177780e274857`

opendatablend 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Open Data Blend for Python

Installation

Usage Examples

Making Public API Requests

Get the Data

Use The Data

Making Authenticated API Requests

Get the Data

Use the Data

Downloading Multiple Data Files

Get the Data

Ingesting Data Directly into Cloud Storage Services

Azure Blob Storage

Using get_data

Using get_data_files

Azure Data Lake Storage (ADLS) Gen2

Using get_data

Using get_data_files

Amazon S3

Using get_data

Using get_data_files

Google Cloud Storage

Using get_data

Using get_data_files

Additional Examples

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Using `get_data`

Using `get_data_files`

Using `get_data`

Using `get_data_files`

Using `get_data`

Using `get_data_files`

Using `get_data`

Using `get_data_files`