Skip to main content

Sparse Lazy Array Format - MVP for single-cell data

Project description

SLAF (Sparse Lazy Array Format)

SLAF Logo

Python License Tests Coverage Code style PyPI PyPI Downloads

SLAF is a high-performance format for single-cell data that combines the power of SQL with lazy evaluation. Built for large-scale single-cell analysis with memory efficiency and production-ready ML capabilities.

Be Lazy (lazy APIs for AnnData and Scanpy) • Write SQL (arbitrary SQL to query the tables) • Train Foundation Models (with tokenizers and dataloaders)

🚀 Key Features

  • ⚡ Fast: SQL-level performance for data operations
  • 💾 Memory Efficient: Lazy evaluation, only load what you need
  • 🔍 SQL Native: Direct SQL queries on your data
  • 🧬 Scanpy Compatible: Drop-in replacement for AnnData workflows
  • ⚙️ ML Ready: Ready for ML training with efficient tokenization
  • 🔧 Production Ready: Built for large-scale single-cell analysis

📦 Installation

Default Installation (Batteries Included)

The default installation includes core functionality, CLI tools, and data conversion capabilities:

# Using uv (recommended)
uv add slafdb

# Or pip
pip install slafdb

What's included by default:

  • ✅ Core SLAF functionality (SQL queries, data structures)
  • ✅ CLI tools (slaf convert, slaf query, etc.)
  • ✅ Data conversion tools (scanpy, h5py for h5ad files)
  • ✅ Rich console output and progress bars
  • ✅ Cross-platform compatibility

What's NOT included by default:

Dependencies for:

  • ❌ Machine learning features (PyTorch tokenizers)
  • ❌ Advanced single-cell tools (igraph, leidenalg)

Platform-Specific Notes

Polars Compatibility:

  • Linux/Windows: Works with standard polars
  • macOS (Apple Silicon): May require polars-lts-cpu for compatibility

If you encounter polars-related issues on macOS, you have several options:

Option 1: Manual platform-specific installation

# For macOS Apple Silicon
pip install "polars-lts-cpu>=1.31.0"
pip install slafdb

# For Linux/Windows
pip install slafdb

Option 2: Use uv with manual polars specification

# For macOS Apple Silicon
uv add "polars-lts-cpu>=1.31.0"
uv add slafdb

# For Linux/Windows
uv add slafdb

Note: Package managers don't automatically choose between polars and polars-lts-cpu - you may need to specify the correct version for your platform.

Optional Dependencies

Add specific features as needed:

Using uv:

uv add "slafdb[ml]"
uv add "slafdb[advanced]"
uv add "slafdb[full]"
uv add "slafdb[dev]"

Using pip:

pip install slafdb[ml]
pip install slafdb[advanced]
pip install slafdb[full]
pip install slafdb[dev]

Development Installation

git clone https://github.com/slaf-project/slaf.git
cd slaf
uv add --extra dev --extra test --extra docs

🚀 Quick Start

Converting Your Data

Convert your existing single-cell data to SLAF format - no extra dependencies required!

# Convert AnnData (.h5ad) to SLAF
slaf convert input.h5ad output.slaf

# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf

# Convert 10x Genomics data
slaf convert path/to/10x/filtered_feature_bc_matrix output.slaf

Basic Usage

from slaf import SLAFArray

# Load a SLAF dataset
slaf = SLAFArray("path/to/dataset.slaf")

# Describe the dataset
print(slaf.info())

# Execute SQL queries directly
results = slaf.query("""
    SELECT batch, COUNT(*) as count
    FROM cells
    GROUP BY batch
    ORDER BY count DESC
""")
print(results)

Filtering Data

# Filter cells by metadata
filtered_cells = slaf.filter_cells(
    batch="batch1",
    total_counts=">1000"
)

# Filter genes
filtered_genes = slaf.filter_genes(
    highly_variable=True
)

# Get expression submatrix
expression = slaf.get_submatrix(
    cell_selector=filtered_cells,
    gene_selector=filtered_genes
)

🦥 Be Lazy - Lazy AnnData & Scanpy Integration

SLAF provides lazy versions of AnnData and Scanpy operations that only compute when needed:

from slaf.integrations.anndata import read_slaf
import scanpy as sc

# Load as lazy AnnData
adata = read_slaf("path/to/dataset.slaf")
print(f"Type: {type(adata)}")  # LazyAnnData
print(f"Expression matrix type: {type(adata.X)}")  # LazyExpressionMatrix

# Apply scanpy operations (lazy)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)

# Still lazy - no computation yet
print(f"Still lazy: {type(adata.X)}")

# Compute when needed
adata.compute()  # Now it's a real AnnData object

Lazy Computation Control

# Compute specific parts
expression_matrix = adata.X.compute()  # Just the expression matrix
cell_metadata = adata.obs              # Cell metadata
gene_metadata = adata.var              # Gene metadata

# Or compute everything at once
real_adata = adata.compute()

Lazy Slicing

# All slicing operations are lazy
subset = adata[:100, :50]  # Lazy slice
filtered = adata[adata.obs['n_genes_by_counts'] > 1000]  # Lazy filtering

🔍 Write SQL - Direct Database Access

SLAF stores data in three main tables that you can query directly with SQL:

Database Schema

  • cells: Cell metadata and QC metrics
  • genes: Gene metadata and annotations
  • expression: Sparse expression matrix data

SQL Queries

# Get expression data for specific cells
cell_expression = slaf.query("""
    SELECT
        c.cell_id,
        c.total_counts,
        COUNT(e.gene_id) as genes_expressed,
        AVG(e.value) as avg_expression
    FROM cells c
    JOIN expression e ON c.cell_integer_id = e.cell_integer_id
    WHERE c.batch = 'batch1'
    GROUP BY c.cell_id, c.total_counts
    ORDER BY genes_expressed DESC
    LIMIT 10
""")

# Find highly expressed genes
high_expr_genes = slaf.query("""
    SELECT
        g.gene_id,
        COUNT(e.cell_id) as cells_expressing,
        AVG(e.value) as avg_expression
    FROM genes g
    JOIN expression e ON g.gene_integer_id = e.gene_integer_id
    GROUP BY g.gene_id
    HAVING cells_expressing > 100
    ORDER BY avg_expression DESC
    LIMIT 10
""")

🧠 Train Foundation Models - ML Training

SLAF provides efficient tokenization and dataloaders for training foundation models:

Tokenization

from slaf.ml import SLAFTokenizer

# Create tokenizer for GeneFormer style tokenization
tokenizer = SLAFTokenizer(
    slaf_array=slaf,
    tokenizer_type="geneformer",
    vocab_size=50000,
    n_expression_bins=10
)

# Geneformer tokenization (gene sequence only)
gene_sequences = [[1, 2, 3], [4, 5, 6]]  # Example gene IDs
input_ids, attention_mask = tokenizer.tokenize(
    gene_sequences,
    max_genes=2048
)

# Create tokenizer for scGPT style tokenization
tokenizer = SLAFTokenizer(
    slaf_array=slaf,
    tokenizer_type="scgpt",
    vocab_size=50000,
    n_expression_bins=10
)

# scGPT tokenization (gene-expression pairs)
gene_sequences = [[1, 2, 3], [4, 5, 6]]  # Gene IDs
expr_sequences = [[0.5, 0.8, 0.2], [0.9, 0.1, 0.7]]  # Expression values
input_ids, attention_mask = tokenizer.tokenize(
    gene_sequences,
    expr_sequences=expr_sequences,
    max_genes=1024
)

DataLoader for Training

from slaf.ml import SLAFDataLoader

# Create DataLoader
dataloader = SLAFDataLoader(
    slaf_array=slaf,
    tokenizer_type="geneformer",  # or "scgpt"
    batch_size=32,
    max_genes=2048
)

# Use with PyTorch training
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    cell_ids = batch["cell_ids"]

    # Your training loop here
    loss = model(input_ids, attention_mask=attention_mask)
    loss.backward()

🛠️ Command Line Interface

Data Conversion

# Convert AnnData to SLAF (included by default)
slaf convert input.h5ad output.slaf

# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf --format hdf5

Data Querying

# Execute SQL query
slaf query dataset.slaf "SELECT * FROM cells LIMIT 10"

# Save results to CSV
slaf query dataset.slaf "SELECT * FROM cells" --output cells.csv

Dataset Information

slaf info dataset.slaf

📚 Documentation

💬 Community

  • Discord — chat, questions, and updates

🙏 Acknowledgments

Built on top of

  • Lance for cloud-native, efficient columnar storage
  • Polars for lazy, composable, in-memory, zero-copy data processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slafdb-0.5.2.tar.gz (324.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slafdb-0.5.2-py3-none-any.whl (222.3 kB view details)

Uploaded Python 3

File details

Details for the file slafdb-0.5.2.tar.gz.

File metadata

  • Download URL: slafdb-0.5.2.tar.gz
  • Upload date:
  • Size: 324.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slafdb-0.5.2.tar.gz
Algorithm Hash digest
SHA256 a31eaf67d7f3ec3f120d3310c03188990de352a995417cd77b648848f93dab14
MD5 89147587b5f470fd7fbf68b34a739a7e
BLAKE2b-256 8e02a9ef0d0ba2ca0483364e46c55a9131cb4b6b0b36156cb11fc81cc4549564

See more details on using hashes here.

File details

Details for the file slafdb-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: slafdb-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 222.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slafdb-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a34ba0ce74118ffd3fd2fca201186efb13da985382efc24a79f87925e037092d
MD5 877ffd52db28ca3cc53a3dda7bb0f9b9
BLAKE2b-256 5d5d75e2019e43d001dbeff051f69166880d9a47e0313576c33b7e72556d2f7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page