Skip to main content

Python Datasets on top of Pandas

Project description

Pandas Dataset library

Wrapper on top of pandas to support nested datasets from pandas dataframes provided as dicts.

Readers for CSV and parquet.

Minimal support for adding new features.

Usage

Copy paste from the examples.

#!/usr/bin/env python3
from pprint import pprint
import pandas as pd
import numpy as np
from pandas_dataset import Dataset

n_root, n_nested = 100, 300 # 100 rows in root df, 300 in the nested one which joins the root one
df_root = pd.DataFrame({"column": np.random.randn(100,), # random floats
                        "column2": [''.join(chr(_y) for _y in y) for y in np.random.randint(ord("A"), ord("z"), # text
                                    size=(100, 10))]}).set_index(pd.Index(range(100), name="root_index"))
df_nested = pd.DataFrame({"column3": [x.astype(object) for x in np.random.randn(300, 20)], # vector column (embeddings)
                          "root_index": np.random.randint(0, 100, size=(300, )) # join key with df_root
                          })
dataset = Dataset({"root": df_root, "nested": df_nested})
print(dataset)
pprint(dataset.dtypes)

Outputs:

[20240311 12:06-WARNING] Data group 'nested' has empty index name. Defaulting to 'nested' (internal.py:75)
Dataset: 'root ('root_index')':  (100, 2) 'nested ('nested')':  (300, 2)
{'nested': {'column3': 'vector', 'nested': 'int', 'root_index': 'int'},
 'root': {'column': 'float', 'column2': 'object', 'root_index': 'int'}}

Explanation:

  • we print the index a well in the dtypes. root_index is the index of df_root. nested is the index of df_nested which is automatically added because we provide an empty index name (no set_index) so the name is taken from the table name (key of the dict to the constructor)
  • column is a float column with random numbers
  • column2 is an object column (text)
  • column3 is a vector column (arrays) which can be used for embeddings purposes (i.e. knn embedding)
  • nested is the join key from df_nested to df_root so each entry in the nested df is directly mapped to the entry in the root df!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas-dataset-0.1.4.tar.gz (21.3 kB view details)

Uploaded Source

File details

Details for the file pandas-dataset-0.1.4.tar.gz.

File metadata

  • Download URL: pandas-dataset-0.1.4.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for pandas-dataset-0.1.4.tar.gz
Algorithm Hash digest
SHA256 0dae7751ccf2d258de431d8f95c6f3beaff170017868bf53118ea64f5eb01565
MD5 5a22fd9c5ead58ce5b0c93625cdc1877
BLAKE2b-256 839e014c9575fb7eeb0a651d2ea620bc65d8a01936b7e39faadb2424646a0412

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page