Python Datasets on top of Pandas
Project description
Pandas Dataset library
Wrapper on top of pandas to support nested datasets from pandas dataframes provided as dicts.
Readers for CSV and parquet.
Minimal support for adding new features.
Usage
Copy paste from the examples.
#!/usr/bin/env python3
from pprint import pprint
import pandas as pd
import numpy as np
from pandas_dataset import Dataset
n_root, n_nested = 100, 300 # 100 rows in root df, 300 in the nested one which joins the root one
df_root = pd.DataFrame({"column": np.random.randn(100,), # random floats
"column2": [''.join(chr(_y) for _y in y) for y in np.random.randint(ord("A"), ord("z"), # text
size=(100, 10))]}).set_index(pd.Index(range(100), name="root_index"))
df_nested = pd.DataFrame({"column3": [x.astype(object) for x in np.random.randn(300, 20)], # vector column (embeddings)
"root_index": np.random.randint(0, 100, size=(300, )) # join key with df_root
})
dataset = Dataset({"root": df_root, "nested": df_nested})
print(dataset)
pprint(dataset.dtypes)
Outputs:
[20240311 12:06-WARNING] Data group 'nested' has empty index name. Defaulting to 'nested' (internal.py:75)
Dataset: 'root ('root_index')': (100, 2) 'nested ('nested')': (300, 2)
{'nested': {'column3': 'vector', 'nested': 'int', 'root_index': 'int'},
'root': {'column': 'float', 'column2': 'object', 'root_index': 'int'}}
Explanation:
- we print the index a well in the dtypes.
root_index
is the index ofdf_root
.nested
is the index ofdf_nested
which is automatically added because we provide an empty index name (noset_index
) so the name is taken from the table name (key of the dict to the constructor) column
is a float column with random numberscolumn2
is an object column (text)column3
is a vector column (arrays) which can be used for embeddings purposes (i.e. knn embedding)nested
is the join key fromdf_nested
todf_root
so each entry in the nested df is directly mapped to the entry in the root df!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pandas-dataset-0.1.4.tar.gz
(21.3 kB
view details)
File details
Details for the file pandas-dataset-0.1.4.tar.gz
.
File metadata
- Download URL: pandas-dataset-0.1.4.tar.gz
- Upload date:
- Size: 21.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0dae7751ccf2d258de431d8f95c6f3beaff170017868bf53118ea64f5eb01565 |
|
MD5 | 5a22fd9c5ead58ce5b0c93625cdc1877 |
|
BLAKE2b-256 | 839e014c9575fb7eeb0a651d2ea620bc65d8a01936b7e39faadb2424646a0412 |