A high-level Python package for managing DataFrames using TileDB as a backing store
Project description
cellarr-frame
A high-level Python package for managing DataFrames using TileDB as a backing store. This package provides two distinct, storage strategies for your data.
DenseCellArrayFrame: For standard DataFrames. Uses TileDB's native 1D array, multi-attribute storage. This is highly efficient for dataframes with columns of mixed types (e.g., numbers, strings, dates).SparseCellArrayFrame: For sparse DataFrames. Uses a 2D sparsecellarr-arrayto store data in a "coordinate" (COO) format. This is ideal for very large DataFrames where most values areNaNor0(e.g., gene-cell matrices).
Installation
To get started, install the package from PyPI
pip install cellarr-frame
Factory Function: create_cellarr_frame
The easiest way to get started is with the create_cellarr_frame factory. It automatically builds the correct TileDB array schema based on an initial DataFrame or specified dim_dtypes.
from cellarr_frame import create_cellarr_frame
# Example 1: Create a DENSE frame by providing an initial DataFrame
df = pd.DataFrame({'A': np.arange(5), 'B': [f'val_{i}' for i in range(5)]})
create_cellarr_frame("my_dense_frame.tdb", sparse=False, df=df)
# Example 2: Create an EMPTY SPARSE frame with integer-based dimensions
create_cellarr_frame("my_sparse_frame_int.tdb", sparse=True, dim_dtypes=[np.uint64, np.uint64])
# Example 3: Create an EMPTY SPARSE frame with string-based dimensions
create_cellarr_frame("my_sparse_frame_str.tdb", sparse=True, dim_dtypes=[str, str])
DenseCellArrayFrame (Native DataFrames)
This is the best/standard choice for typical, dense dataframes.
Writing and Appending
This class is designed for efficient appends. The create_cellarr_frame function (or write_dataframe) writes the first chunk, and append_dataframe adds new rows to the end.
import pandas as pd
import numpy as np
from cellarr_frame import create_cellarr_frame, DenseCellArrayFrame
# 1. Create and write the first DataFrame
df1 = pd.DataFrame({
'A': np.arange(5, dtype=np.int32),
'B': np.random.rand(5),
'C': ['foo' + str(i) for i in range(5)]
})
create_cellarr_frame("dense.tdb", sparse=False, df=df1)
# 2. Open the frame and append a second DataFrame
cdf = DenseCellArrayFrame("dense.tdb")
print(f"Shape before append: {cdf.shape}")
df2 = pd.DataFrame({
'A': np.arange(5, 10, dtype=np.int32),
'B': np.random.rand(5),
'C': ['bar' + str(i) for i in range(5)]
})
cdf.append_dataframe(df2)
print(f"Shape after append: {cdf.shape}")
# Shape before append: (5, 3)
# Shape after append: (10, 3)
Reading and Querying
You can read the full DataFrame or query it using standard Python slicing.
# 1. Read the full DataFrame
full_df = cdf.read_dataframe()
print(full_df)
# A B C
# 0 0 0.123456 foo0
# 1 1 0.234567 foo1
# ...
# 8 8 0.456789 bar3
# 9 9 0.567890 bar4
# 2. Querying with __getitem__
# Get specific rows (exclusive slice, like pandas)
row_subset = cdf[5:8]
# A B C
# 5 5 0.345678 bar0
# 6 6 0.456789 bar1
# 7 7 0.567890 bar2
# Get a single column
col_A = cdf['A']
# A
# 0 0
# 1 1
# ...
# Get multiple columns
cols_AC = cdf[['A', 'C']]
# A C
# 0 0 foo0
# 1 1 foo1
# ...
# Get specific rows and columns
subset = cdf[1:3, ['A', 'C']]
# A C
# 1 1 foo1
# 2 2 foo2
Properties
print(f"Shape: {cdf.shape}") # (10, 3)
print(f"Columns: {cdf.columns}") # Index(['A', 'B', 'C'], dtype='object')
print(f"Index: {cdf.index}") # RangeIndex(start=0, stop=10, step=1)
2. SparseCellArrayFrame (Sparse DataFrames)
This is the best choice for data that is mostly empty (NaN). It only stores the values that exist, saving significant space.
Writing and Appending
Writing to a sparse frame involves stack()-ing the DataFrame to find all non-NaN values and writing them to the 2D array.
import pandas as pd
import numpy as np
from cellarr_frame import create_cellarr_frame, SparseCellArrayFrame
# 1. Create a sparse DataFrame (most values are NaN)
df1 = pd.DataFrame({
0: [1.0, np.nan], # Index 0, 1
1: [np.nan, 2.0]
})
# Create the array and write the data
# We specify integer dtypes for the dimensions (row/col labels)
create_cellarr_frame("sparse.tdb", sparse=True, df=df1, dim_dtypes=[np.uint64, np.uint64])
# 2. Open the frame and append new data
sdf = SparseCellArrayFrame("sparse.tdb")
print(f"Shape before append: {sdf.shape}")
# This new DataFrame will be appended starting at the next available row index
df2 = pd.DataFrame({
1: [3.0, np.nan], # Relative index 0, 1
2: [np.nan, 4.0]
})
sdf.append_dataframe(df2) # Automatically appends at rows 2 and 3
print(f"Shape after append: {sdf.shape}")
# Shape before append: (2, 2)
# Shape after append: (4, 3)
Reading and Querying
Reading reconstructs the DataFrame from the sparse coordinates.
# 1. Read the full DataFrame
full_df = sdf.read_dataframe()
print(full_df)
# 0 1 2
# 0 1.0 NaN NaN
# 1 NaN 2.0 NaN
# 2 NaN 3.0 NaN
# 3 NaN NaN 4.0
# 2. Querying with __getitem__
# Get specific rows
row_subset = sdf[1:3]
# 0 1 2
# 1 NaN 2.0 NaN
# 2 NaN 3.0 NaN
# Get specific columns (by label)
col_subset = sdf[[0, 2]]
# 0 2
# 0 1.0 NaN
# 1 NaN NaN
# 2 NaN NaN
# 3 NaN 4.0
# Get specific rows and columns
subset = sdf[0:2, [1]]
# 1
# 0 NaN
# 1 2.0
String Dimensions
SparseCellArrayFrame also fully supports string-based row and column labels.
# Create with string dimensions
create_cellarr_frame("sparse_str.tdb", sparse=True, dim_dtypes=[str, str])
sdf_str = SparseCellArrayFrame("sparse_str.tdb")
# Write DataFrame with string index/columns
df_str1 = pd.DataFrame({'col_A': [1.0, np.nan]}, index=['row_A', 'row_B'])
sdf_str.write_dataframe(df_str1)
# Appending with string dimensions just adds the new coordinates
df_str2 = pd.DataFrame({'col_B': [3.0]}, index=['row_C'])
sdf_str.append_dataframe(df_str2)
print(sdf_str.read_dataframe())
# col_A col_B
# row_A 1.0 NaN
# row_C NaN 3.0
[!NOTE]
row_Bis missing since all the values are NaN for this column.
Properties
Properties on sparse frames query the array to find the unique dimension labels.
print(f"Shape: {sdf_str.shape}") # (3, 2)
print(f"Columns: {sdf_str.columns}") # Index(['col_A', 'col_B'], dtype='object')
print(f"Index: {sdf_str.index}") # Index(['row_A', 'row_B', 'row_C'], dtype='object')
Note
This project has been set up using BiocSetup and PyScaffold.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cellarr_frame-0.0.3.tar.gz.
File metadata
- Download URL: cellarr_frame-0.0.3.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
917cdee3a814e73e24d73b92cd4254a3027609903da163b3b54afad6aca15f60
|
|
| MD5 |
e0cdfc2995e98f1f862b837c94c11e73
|
|
| BLAKE2b-256 |
993707db5100656865ca9fbff81d53ee58460ef978a24685c4efea75bf81f8df
|
Provenance
The following attestation bundles were made for cellarr_frame-0.0.3.tar.gz:
Publisher:
publish-pypi.yml on CellArr/cellarr-frame
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cellarr_frame-0.0.3.tar.gz -
Subject digest:
917cdee3a814e73e24d73b92cd4254a3027609903da163b3b54afad6aca15f60 - Sigstore transparency entry: 850771912
- Sigstore integration time:
-
Permalink:
CellArr/cellarr-frame@333f9f2eb89fcd1f6269732f1bca9c3192e2e71a -
Branch / Tag:
refs/tags/0.0.3 - Owner: https://github.com/CellArr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@333f9f2eb89fcd1f6269732f1bca9c3192e2e71a -
Trigger Event:
push
-
Statement type:
File details
Details for the file cellarr_frame-0.0.3-py3-none-any.whl.
File metadata
- Download URL: cellarr_frame-0.0.3-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9bd3d7bf90e03e1326ef25c40a4aec96155aa67b34202cd6ae1b49fd9fefbc3
|
|
| MD5 |
36e0fbae449fb386a78f1e9f25085ea0
|
|
| BLAKE2b-256 |
d6427bbc6d7b2add3140e4ecad30203f5ed7a3cf8a78b34f464159e42e4e9216
|
Provenance
The following attestation bundles were made for cellarr_frame-0.0.3-py3-none-any.whl:
Publisher:
publish-pypi.yml on CellArr/cellarr-frame
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cellarr_frame-0.0.3-py3-none-any.whl -
Subject digest:
d9bd3d7bf90e03e1326ef25c40a4aec96155aa67b34202cd6ae1b49fd9fefbc3 - Sigstore transparency entry: 850771970
- Sigstore integration time:
-
Permalink:
CellArr/cellarr-frame@333f9f2eb89fcd1f6269732f1bca9c3192e2e71a -
Branch / Tag:
refs/tags/0.0.3 - Owner: https://github.com/CellArr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@333f9f2eb89fcd1f6269732f1bca9c3192e2e71a -
Trigger Event:
push
-
Statement type: