Skip to main content

Get clean datasets from DataHerb to boost your data science and data analysis projects

Project description


Markdownify
The Python Package for DataHerb

A DataHerb Core Service to Create and Load Datasets.

Install

pip install dataherb

Documentation: dataherb.github.io/dataherb-python

The DataHerb Command-Line Tool

Requires Python 3

The DataHerb cli provides tools to create dataset metadata, validate metadata, search dataset in flora, and download dataset.

Search and Download

Search by keyword

dataherb search covid19
# Shows the minimal metadata

Search by dataherb id

dataherb search -i covid19_eu_data
# Shows the full metadata

Download dataset by dataherb id

dataherb download covid19_eu_data
# Downloads this dataset: http://dataherb.io/flora/covid19_eu_data

Create Dataset Using Command Line Tool

We provide a template for dataset creation.

Within a dataset folder where the data files are located, use the following command line tool to create the metadata template.

dataherb create

Upload dataset to remote

Within the dataset folder, run

dataherb upload

UI for all the datasets in a flora

dataherb serve

Use DataHerb in Your Code

Load Data into DataFrame

# Load the package
from dataherb.flora import Flora

# Initialize Flora service
# The Flora service holds all the dataset metadata
use_flora = "path/to/my/flora.json"
dataherb = Flora(flora=use_flora)

# Search datasets with keyword(s)
geo_datasets = dataherb.search("geo")
print(geo_datasets)

# Get a specific file from a dataset and load as DataFrame
tz_df = pd.read_csv(
  dataherb.herb(
      "geonames_timezone"
  ).get_resource(
      "dataset/geonames_timezone.csv"
  )
)
print(tz_df)

The DataHerb Project

What is DataHerb

DataHerb is an open-source data discovery and management tool.

  • A DataHerb or Herb is a dataset. A dataset comes with the data files, and the metadata of the data files.
  • A Herb Resource or Resource is a data file in the DataHerb.
  • A Flora is the combination of all the DataHerbs.

In many data projects, finding the right datasets to enhance your data is one of the most time consuming part. DataHerb adds flavor to your data project. By creating metadata and manage the datasets systematically, locating an dataset is much easier.

Currently, dataherb supports sync dataset between local and S3/git. Each dataset can have its own remote location.

What is DataHerb Flora

We desigined the following workflow to share and index open datasets.

DataHerb Workflow

The repo dataherb-flora is a demo flora that lists some datasets and demonstrated on the website https://dataherb.github.io. At this moment, the whole system is being renovated.

Development

  1. Create a conda environment.
  2. Install requirements: pip install -r requirements.txt

Documentation

The source of the documentation for this package is located at docs.

References and Acknolwedgement

  • dataherb uses datapackage in the core. datapackage is a python library for the data-package standard. The core schema of the dataset is essentially the data-package standard.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataherb-0.1.3.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

dataherb-0.1.3-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file dataherb-0.1.3.tar.gz.

File metadata

  • Download URL: dataherb-0.1.3.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.7.9

File hashes

Hashes for dataherb-0.1.3.tar.gz
Algorithm Hash digest
SHA256 9e129ed4dfb76d983903c4a9e27797369e87200ff2934632c24281d1b31c337b
MD5 2e4a5d8d29d8d66c20eafd3d8a1d0fc1
BLAKE2b-256 d4a45800d7ed55b8e2146ccef6b535e334ddb3004cd57c0dd809b9cbdd0b2979

See more details on using hashes here.

File details

Details for the file dataherb-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: dataherb-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 30.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.7.9

File hashes

Hashes for dataherb-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9b07cd6779af59cabe82c61db4c5341b8a657d6ac9afc39aa31db20dfabb1108
MD5 6d1113db802a69304d202b3706d38805
BLAKE2b-256 abf0b065128a00dfebbfc091a83cba19340a41c1767b4c3e30234b2b26c82275

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page