Skip to main content

A high-level Python package for managing DataFrames using TileDB as a backing store

Project description

PyPI-Server Unit tests

cellarr-frame

A high-level Python package for managing DataFrames using TileDB as a backing store. This package provides two distinct, storage strategies for your data.

  • DenseCellArrayFrame: For standard DataFrames. Uses TileDB's native 1D array, multi-attribute storage. This is highly efficient for dataframes with columns of mixed types (e.g., numbers, strings, dates).
  • SparseCellArrayFrame: For sparse DataFrames. Uses a 2D sparse cellarr-array to store data in a "coordinate" (COO) format. This is ideal for very large DataFrames where most values are NaN or 0 (e.g., gene-cell matrices).

Installation

To get started, install the package from PyPI

pip install cellarr-frame

Factory Function: create_cellarr_frame

The easiest way to get started is with the create_cellarr_frame factory. It automatically builds the correct TileDB array schema based on an initial DataFrame or specified dim_dtypes.

from cellarr_frame import create_cellarr_frame

# Example 1: Create a DENSE frame by providing an initial DataFrame
df = pd.DataFrame({'A': np.arange(5), 'B': [f'val_{i}' for i in range(5)]})
create_cellarr_frame("my_dense_frame.tdb", sparse=False, df=df)

# Example 2: Create an EMPTY SPARSE frame with integer-based dimensions
create_cellarr_frame("my_sparse_frame_int.tdb", sparse=True, dim_dtypes=[np.uint64, np.uint64])

# Example 3: Create an EMPTY SPARSE frame with string-based dimensions
create_cellarr_frame("my_sparse_frame_str.tdb", sparse=True, dim_dtypes=[str, str])

DenseCellArrayFrame (Native DataFrames)

This is the best/standard choice for typical, dense dataframes.

Writing and Appending

This class is designed for efficient appends. The create_cellarr_frame function (or write_dataframe) writes the first chunk, and append_dataframe adds new rows to the end.

import pandas as pd
import numpy as np
from cellarr_frame import create_cellarr_frame, DenseCellArrayFrame

# 1. Create and write the first DataFrame
df1 = pd.DataFrame({
    'A': np.arange(5, dtype=np.int32),
    'B': np.random.rand(5),
    'C': ['foo' + str(i) for i in range(5)]
})
create_cellarr_frame("dense.tdb", sparse=False, df=df1)

# 2. Open the frame and append a second DataFrame
cdf = DenseCellArrayFrame("dense.tdb")
print(f"Shape before append: {cdf.shape}")

df2 = pd.DataFrame({
    'A': np.arange(5, 10, dtype=np.int32),
    'B': np.random.rand(5),
    'C': ['bar' + str(i) for i in range(5)]
})
cdf.append_dataframe(df2)

print(f"Shape after append: {cdf.shape}")

# Shape before append: (5, 3)
# Shape after append: (10, 3)

Reading and Querying

You can read the full DataFrame or query it using standard Python slicing.

# 1. Read the full DataFrame
full_df = cdf.read_dataframe()
print(full_df)

#     A         B      C
# 0   0  0.123456   foo0
# 1   1  0.234567   foo1
# ...
# 8   8  0.456789   bar3
# 9   9  0.567890   bar4

# 2. Querying with __getitem__

# Get specific rows (exclusive slice, like pandas)
row_subset = cdf[5:8]
#    A         B      C
# 5  5  0.345678   bar0
# 6  6  0.456789   bar1
# 7  7  0.567890   bar2

# Get a single column
col_A = cdf['A']
#    A
# 0  0
# 1  1
# ...

# Get multiple columns
cols_AC = cdf[['A', 'C']]
#    A      C
# 0  0   foo0
# 1  1   foo1
# ...

# Get specific rows and columns
subset = cdf[1:3, ['A', 'C']]
#    A      C
# 1  1   foo1
# 2  2   foo2

Properties

print(f"Shape: {cdf.shape}")       # (10, 3)
print(f"Columns: {cdf.columns}")   # Index(['A', 'B', 'C'], dtype='object')
print(f"Index: {cdf.index}")       # RangeIndex(start=0, stop=10, step=1)

2. SparseCellArrayFrame (Sparse DataFrames)

This is the best choice for data that is mostly empty (NaN). It only stores the values that exist, saving significant space.

Writing and Appending

Writing to a sparse frame involves stack()-ing the DataFrame to find all non-NaN values and writing them to the 2D array.

import pandas as pd
import numpy as np
from cellarr_frame import create_cellarr_frame, SparseCellArrayFrame

# 1. Create a sparse DataFrame (most values are NaN)
df1 = pd.DataFrame({
    0: [1.0, np.nan],  # Index 0, 1
    1: [np.nan, 2.0]
})

# Create the array and write the data
# We specify integer dtypes for the dimensions (row/col labels)
create_cellarr_frame("sparse.tdb", sparse=True, df=df1, dim_dtypes=[np.uint64, np.uint64])

# 2. Open the frame and append new data
sdf = SparseCellArrayFrame("sparse.tdb")
print(f"Shape before append: {sdf.shape}")

# This new DataFrame will be appended starting at the next available row index
df2 = pd.DataFrame({
    1: [3.0, np.nan],  # Relative index 0, 1
    2: [np.nan, 4.0]
})
sdf.append_dataframe(df2) # Automatically appends at rows 2 and 3

print(f"Shape after append: {sdf.shape}")

# Shape before append: (2, 2)
# Shape after append: (4, 3)

Reading and Querying

Reading reconstructs the DataFrame from the sparse coordinates.

# 1. Read the full DataFrame
full_df = sdf.read_dataframe()
print(full_df)

#      0    1    2
# 0  1.0  NaN  NaN
# 1  NaN  2.0  NaN
# 2  NaN  3.0  NaN
# 3  NaN  NaN  4.0

# 2. Querying with __getitem__

# Get specific rows
row_subset = sdf[1:3]
#      0    1    2
# 1  NaN  2.0  NaN
# 2  NaN  3.0  NaN

# Get specific columns (by label)
col_subset = sdf[[0, 2]]
#      0    2
# 0  1.0  NaN
# 1  NaN  NaN
# 2  NaN  NaN
# 3  NaN  4.0

# Get specific rows and columns
subset = sdf[0:2, [1]]
#      1
# 0  NaN
# 1  2.0

String Dimensions

SparseCellArrayFrame also fully supports string-based row and column labels.

# Create with string dimensions
create_cellarr_frame("sparse_str.tdb", sparse=True, dim_dtypes=[str, str])
sdf_str = SparseCellArrayFrame("sparse_str.tdb")

# Write DataFrame with string index/columns
df_str1 = pd.DataFrame({'col_A': [1.0, np.nan]}, index=['row_A', 'row_B'])
sdf_str.write_dataframe(df_str1)

# Appending with string dimensions just adds the new coordinates
df_str2 = pd.DataFrame({'col_B': [3.0]}, index=['row_C'])
sdf_str.append_dataframe(df_str2)

print(sdf_str.read_dataframe())
#        col_A  col_B
# row_A    1.0    NaN
# row_C    NaN    3.0

[!NOTE]

row_B is missing since all the values are NaN for this column.

Properties

Properties on sparse frames query the array to find the unique dimension labels.

print(f"Shape: {sdf_str.shape}")       # (3, 2)
print(f"Columns: {sdf_str.columns}")   # Index(['col_A', 'col_B'], dtype='object')
print(f"Index: {sdf_str.index}")       # Index(['row_A', 'row_B', 'row_C'], dtype='object')

Note

This project has been set up using BiocSetup and PyScaffold.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cellarr_frame-0.0.4.tar.gz (33.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cellarr_frame-0.0.4-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file cellarr_frame-0.0.4.tar.gz.

File metadata

  • Download URL: cellarr_frame-0.0.4.tar.gz
  • Upload date:
  • Size: 33.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cellarr_frame-0.0.4.tar.gz
Algorithm Hash digest
SHA256 b7856260aa2034d3987aceb4506e66082788ce302452956d04c1ea441a2db40a
MD5 401fa65887d1ecdd64fe52a78d8e28ea
BLAKE2b-256 e5451ed58b484b9441d5036c930d81078e6bb5b5b180405ba19b95956e7f892c

See more details on using hashes here.

Provenance

The following attestation bundles were made for cellarr_frame-0.0.4.tar.gz:

Publisher: publish-pypi.yml on CellArr/cellarr-frame

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cellarr_frame-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: cellarr_frame-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cellarr_frame-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8bd2ec43a8150dd0e90a9b129537439000744497141474c096d738260c5eb6ba
MD5 3eb981ac35e90f36499446c79a388895
BLAKE2b-256 8a9c22c0cfd6a4aa2c44c96aa8f837365ac5513a6aec74c23738db366d3c5296

See more details on using hashes here.

Provenance

The following attestation bundles were made for cellarr_frame-0.0.4-py3-none-any.whl:

Publisher: publish-pypi.yml on CellArr/cellarr-frame

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page