Skip to main content

Sparse Lazy Array Format - MVP for single-cell data

Project description

SLAF (Sparse Lazy Array Format)

SLAF Logo

Python License Tests Coverage Code style PyPI PyPI Downloads

SLAF is a high-performance format for single-cell data that combines the power of SQL with lazy evaluation. Built for large-scale single-cell analysis with memory efficiency and production-ready ML capabilities.

Be Lazy (lazy APIs for AnnData and Scanpy) • Write SQL (arbitrary SQL to query the tables) • Train Foundation Models (with tokenizers and dataloaders)

🚀 Key Features

  • ⚡ Fast: SQL-level performance for data operations
  • 💾 Memory Efficient: Lazy evaluation, only load what you need
  • 🔍 SQL Native: Direct SQL queries on your data
  • 🧬 Scanpy Compatible: Drop-in replacement for AnnData workflows
  • ⚙️ ML Ready: Ready for ML training with efficient tokenization
  • 🔧 Production Ready: Built for large-scale single-cell analysis

📦 Installation

Default Installation (Batteries Included)

The default installation includes core functionality, CLI tools, and data conversion capabilities:

# Using uv (recommended)
uv add slafdb

# Or pip
pip install slafdb

What's included by default:

  • ✅ Core SLAF functionality (SQL queries, data structures)
  • ✅ CLI tools (slaf convert, slaf query, etc.)
  • ✅ Data conversion tools (scanpy, h5py for h5ad files)
  • ✅ Rich console output and progress bars
  • ✅ Cross-platform compatibility

What's NOT included by default:

Dependencies for:

  • ❌ Machine learning features (PyTorch tokenizers)
  • ❌ Advanced single-cell tools (igraph, leidenalg)

Platform-Specific Notes

Polars Compatibility:

  • Linux/Windows: Works with standard polars
  • macOS (Apple Silicon): May require polars-lts-cpu for compatibility

If you encounter polars-related issues on macOS, you have several options:

Option 1: Manual platform-specific installation

# For macOS Apple Silicon
pip install "polars-lts-cpu>=1.31.0"
pip install slafdb

# For Linux/Windows
pip install slafdb

Option 2: Use uv with manual polars specification

# For macOS Apple Silicon
uv add "polars-lts-cpu>=1.31.0"
uv add slafdb

# For Linux/Windows
uv add slafdb

Note: Package managers don't automatically choose between polars and polars-lts-cpu - you may need to specify the correct version for your platform.

Optional Dependencies

Add specific features as needed:

Using uv:

uv add "slafdb[ml]"
uv add "slafdb[advanced]"
uv add "slafdb[full]"
uv add "slafdb[dev]"

Using pip:

pip install slafdb[ml]
pip install slafdb[advanced]
pip install slafdb[full]
pip install slafdb[dev]

Development Installation

git clone https://github.com/slaf-project/slaf.git
cd slaf
uv add --extra dev --extra test --extra docs

🚀 Quick Start

Converting Your Data

Convert your existing single-cell data to SLAF format - no extra dependencies required!

# Convert AnnData (.h5ad) to SLAF
slaf convert input.h5ad output.slaf

# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf

# Convert 10x Genomics data
slaf convert path/to/10x/filtered_feature_bc_matrix output.slaf

Basic Usage

from slaf import SLAFArray

# Load a SLAF dataset
slaf = SLAFArray("path/to/dataset.slaf")

# Describe the dataset
print(slaf.info())

# Execute SQL queries directly
results = slaf.query("""
    SELECT batch, COUNT(*) as count
    FROM cells
    GROUP BY batch
    ORDER BY count DESC
""")
print(results)

Filtering Data

# Filter cells by metadata
filtered_cells = slaf.filter_cells(
    batch="batch1",
    total_counts=">1000"
)

# Filter genes
filtered_genes = slaf.filter_genes(
    highly_variable=True
)

# Get expression submatrix
expression = slaf.get_submatrix(
    cell_selector=filtered_cells,
    gene_selector=filtered_genes
)

🦥 Be Lazy - Lazy AnnData & Scanpy Integration

SLAF provides lazy versions of AnnData and Scanpy operations that only compute when needed:

from slaf.integrations.anndata import read_slaf
import scanpy as sc

# Load as lazy AnnData
adata = read_slaf("path/to/dataset.slaf")
print(f"Type: {type(adata)}")  # LazyAnnData
print(f"Expression matrix type: {type(adata.X)}")  # LazyExpressionMatrix

# Apply scanpy operations (lazy)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)

# Still lazy - no computation yet
print(f"Still lazy: {type(adata.X)}")

# Compute when needed
adata.compute()  # Now it's a real AnnData object

Lazy Computation Control

# Compute specific parts
expression_matrix = adata.X.compute()  # Just the expression matrix
cell_metadata = adata.obs              # Cell metadata
gene_metadata = adata.var              # Gene metadata

# Or compute everything at once
real_adata = adata.compute()

Lazy Slicing

# All slicing operations are lazy
subset = adata[:100, :50]  # Lazy slice
filtered = adata[adata.obs['n_genes_by_counts'] > 1000]  # Lazy filtering

🔍 Write SQL - Direct Database Access

SLAF stores data in three main tables that you can query directly with SQL:

Database Schema

  • cells: Cell metadata and QC metrics
  • genes: Gene metadata and annotations
  • expression: Sparse expression matrix data

SQL Queries

# Get expression data for specific cells
cell_expression = slaf.query("""
    SELECT
        c.cell_id,
        c.total_counts,
        COUNT(e.gene_id) as genes_expressed,
        AVG(e.value) as avg_expression
    FROM cells c
    JOIN expression e ON c.cell_integer_id = e.cell_integer_id
    WHERE c.batch = 'batch1'
    GROUP BY c.cell_id, c.total_counts
    ORDER BY genes_expressed DESC
    LIMIT 10
""")

# Find highly expressed genes
high_expr_genes = slaf.query("""
    SELECT
        g.gene_id,
        COUNT(e.cell_id) as cells_expressing,
        AVG(e.value) as avg_expression
    FROM genes g
    JOIN expression e ON g.gene_integer_id = e.gene_integer_id
    GROUP BY g.gene_id
    HAVING cells_expressing > 100
    ORDER BY avg_expression DESC
    LIMIT 10
""")

🧠 Train Foundation Models - ML Training

SLAF provides efficient tokenization and dataloaders for training foundation models:

Tokenization

from slaf.ml import SLAFTokenizer

# Create tokenizer for GeneFormer style tokenization
tokenizer = SLAFTokenizer(
    slaf_array=slaf,
    tokenizer_type="geneformer",
    vocab_size=50000,
    n_expression_bins=10
)

# Geneformer tokenization (gene sequence only)
gene_sequences = [[1, 2, 3], [4, 5, 6]]  # Example gene IDs
input_ids, attention_mask = tokenizer.tokenize(
    gene_sequences,
    max_genes=2048
)

# Create tokenizer for scGPT style tokenization
tokenizer = SLAFTokenizer(
    slaf_array=slaf,
    tokenizer_type="scgpt",
    vocab_size=50000,
    n_expression_bins=10
)

# scGPT tokenization (gene-expression pairs)
gene_sequences = [[1, 2, 3], [4, 5, 6]]  # Gene IDs
expr_sequences = [[0.5, 0.8, 0.2], [0.9, 0.1, 0.7]]  # Expression values
input_ids, attention_mask = tokenizer.tokenize(
    gene_sequences,
    expr_sequences=expr_sequences,
    max_genes=1024
)

DataLoader for Training

from slaf.ml import SLAFDataLoader

# Create DataLoader
dataloader = SLAFDataLoader(
    slaf_array=slaf,
    tokenizer_type="geneformer",  # or "scgpt"
    batch_size=32,
    max_genes=2048
)

# Use with PyTorch training
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    cell_ids = batch["cell_ids"]

    # Your training loop here
    loss = model(input_ids, attention_mask=attention_mask)
    loss.backward()

🛠️ Command Line Interface

Data Conversion

# Convert AnnData to SLAF (included by default)
slaf convert input.h5ad output.slaf

# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf --format hdf5

Data Querying

# Execute SQL query
slaf query dataset.slaf "SELECT * FROM cells LIMIT 10"

# Save results to CSV
slaf query dataset.slaf "SELECT * FROM cells" --output cells.csv

Dataset Information

slaf info dataset.slaf

📚 Documentation

🙏 Acknowledgments

Built on top of

  • Lance for cloud-native, efficient columnar storage
  • Polars for lazy, composable, in-memory, zero-copy data processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slafdb-0.4.2.tar.gz (316.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slafdb-0.4.2-py3-none-any.whl (219.1 kB view details)

Uploaded Python 3

File details

Details for the file slafdb-0.4.2.tar.gz.

File metadata

  • Download URL: slafdb-0.4.2.tar.gz
  • Upload date:
  • Size: 316.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slafdb-0.4.2.tar.gz
Algorithm Hash digest
SHA256 3ef878a02b0e0e2332440a4879f8731a2cd0d208a6984295e3fc93e38f8a8d77
MD5 e8d1bbb2e9fe1d9c5c97be05b9bb8a2f
BLAKE2b-256 172d4b0fe0d5279c845e96dede73e57f4233d147e21a730a6ce1f46f326d4742

See more details on using hashes here.

File details

Details for the file slafdb-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: slafdb-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 219.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for slafdb-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 31510e9b83ab1ddddcfa6c026344e21d968482a57ab1e9de6d7d792affa50f1c
MD5 a0bd43e8488206eae6c79cfb4004496a
BLAKE2b-256 78b8a731a03cabe3174592ea5353f9b098f42423c3fab327aa233c37825ef961

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page