Sparse Lazy Array Format - MVP for single-cell data

These details have not been verified by PyPI

Project description

SLAF (Sparse Lazy Array Format)

SLAF is a high-performance format for single-cell data that combines the power of SQL with lazy evaluation. Built for large-scale single-cell analysis with memory efficiency and production-ready ML capabilities.

Be Lazy (lazy APIs for AnnData and Scanpy) • Write SQL (arbitrary SQL to query the tables) • Train Foundation Models (with tokenizers and dataloaders)

🚀 Key Features

⚡ Fast: SQL-level performance for data operations
💾 Memory Efficient: Lazy evaluation, only load what you need
🔍 SQL Native: Direct SQL queries on your data
🧬 Scanpy Compatible: Drop-in replacement for AnnData workflows
⚙️ ML Ready: Ready for ML training with efficient tokenization
🔧 Production Ready: Built for large-scale single-cell analysis

📦 Installation

Default Installation (Batteries Included)

The default installation includes core functionality, CLI tools, and data conversion capabilities:

# Using uv (recommended)
uv add slafdb

# Or pip
pip install slafdb

What's included by default:

✅ Core SLAF functionality (SQL queries, data structures)
✅ CLI tools (slaf convert, slaf query, etc.)
✅ Data conversion tools (scanpy, h5py for h5ad files)
✅ Rich console output and progress bars
✅ Cross-platform compatibility

What's NOT included by default:

Dependencies for:

❌ Machine learning features (PyTorch tokenizers)
❌ Advanced single-cell tools (igraph, leidenalg)

Platform-Specific Notes

Polars Compatibility:

Linux/Windows: Works with standard polars
macOS (Apple Silicon): May require polars-lts-cpu for compatibility

If you encounter polars-related issues on macOS, you have several options:

Option 1: Manual platform-specific installation

# For macOS Apple Silicon
pip install "polars-lts-cpu>=1.31.0"
pip install slafdb

# For Linux/Windows
pip install slafdb

Option 2: Use uv with manual polars specification

# For macOS Apple Silicon
uv add "polars-lts-cpu>=1.31.0"
uv add slafdb

# For Linux/Windows
uv add slafdb

Note: Package managers don't automatically choose between polars and polars-lts-cpu - you may need to specify the correct version for your platform.

Optional Dependencies

Add specific features as needed:

Using uv:

uv add "slafdb[ml]"
uv add "slafdb[advanced]"
uv add "slafdb[full]"
uv add "slafdb[dev]"

Using pip:

pip install slafdb[ml]
pip install slafdb[advanced]
pip install slafdb[full]
pip install slafdb[dev]

Development Installation

git clone https://github.com/slaf-project/slaf.git
cd slaf
uv add --extra dev --extra test --extra docs

🚀 Quick Start

Converting Your Data

Convert your existing single-cell data to SLAF format - no extra dependencies required!

# Convert AnnData (.h5ad) to SLAF
slaf convert input.h5ad output.slaf

# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf

# Convert 10x Genomics data
slaf convert path/to/10x/filtered_feature_bc_matrix output.slaf

Basic Usage

from slaf import SLAFArray

# Load a SLAF dataset
slaf = SLAFArray("path/to/dataset.slaf")

# Describe the dataset
print(slaf.info())

# Execute SQL queries directly
results = slaf.query("""
    SELECT batch, COUNT(*) as count
    FROM cells
    GROUP BY batch
    ORDER BY count DESC
""")
print(results)

Filtering Data

# Filter cells by metadata
filtered_cells = slaf.filter_cells(
    batch="batch1",
    total_counts=">1000"
)

# Filter genes
filtered_genes = slaf.filter_genes(
    highly_variable=True
)

# Get expression submatrix
expression = slaf.get_submatrix(
    cell_selector=filtered_cells,
    gene_selector=filtered_genes
)

🦥 Be Lazy - Lazy AnnData & Scanpy Integration

SLAF provides lazy versions of AnnData and Scanpy operations that only compute when needed:

from slaf.integrations.anndata import read_slaf
import scanpy as sc

# Load as lazy AnnData
adata = read_slaf("path/to/dataset.slaf")
print(f"Type: {type(adata)}")  # LazyAnnData
print(f"Expression matrix type: {type(adata.X)}")  # LazyExpressionMatrix

# Apply scanpy operations (lazy)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)

# Still lazy - no computation yet
print(f"Still lazy: {type(adata.X)}")

# Compute when needed
adata.compute()  # Now it's a real AnnData object

Lazy Computation Control

# Compute specific parts
expression_matrix = adata.X.compute()  # Just the expression matrix
cell_metadata = adata.obs              # Cell metadata
gene_metadata = adata.var              # Gene metadata

# Or compute everything at once
real_adata = adata.compute()

Lazy Slicing

# All slicing operations are lazy
subset = adata[:100, :50]  # Lazy slice
filtered = adata[adata.obs['n_genes_by_counts'] > 1000]  # Lazy filtering

🔍 Write SQL - Direct Database Access

SLAF stores data in three main tables that you can query directly with SQL:

Database Schema

cells: Cell metadata and QC metrics
genes: Gene metadata and annotations
expression: Sparse expression matrix data

SQL Queries

# Get expression data for specific cells
cell_expression = slaf.query("""
    SELECT
        c.cell_id,
        c.total_counts,
        COUNT(e.gene_id) as genes_expressed,
        AVG(e.value) as avg_expression
    FROM cells c
    JOIN expression e ON c.cell_integer_id = e.cell_integer_id
    WHERE c.batch = 'batch1'
    GROUP BY c.cell_id, c.total_counts
    ORDER BY genes_expressed DESC
    LIMIT 10
""")

# Find highly expressed genes
high_expr_genes = slaf.query("""
    SELECT
        g.gene_id,
        COUNT(e.cell_id) as cells_expressing,
        AVG(e.value) as avg_expression
    FROM genes g
    JOIN expression e ON g.gene_integer_id = e.gene_integer_id
    GROUP BY g.gene_id
    HAVING cells_expressing > 100
    ORDER BY avg_expression DESC
    LIMIT 10
""")

🧠 Train Foundation Models - ML Training

SLAF provides efficient tokenization and dataloaders for training foundation models:

Tokenization

from slaf.ml import SLAFTokenizer

# Create tokenizer for GeneFormer style tokenization
tokenizer = SLAFTokenizer(
    slaf_array=slaf,
    tokenizer_type="geneformer",
    vocab_size=50000,
    n_expression_bins=10
)

# Geneformer tokenization (gene sequence only)
gene_sequences = [[1, 2, 3], [4, 5, 6]]  # Example gene IDs
input_ids, attention_mask = tokenizer.tokenize(
    gene_sequences,
    max_genes=2048
)

# Create tokenizer for scGPT style tokenization
tokenizer = SLAFTokenizer(
    slaf_array=slaf,
    tokenizer_type="scgpt",
    vocab_size=50000,
    n_expression_bins=10
)

# scGPT tokenization (gene-expression pairs)
gene_sequences = [[1, 2, 3], [4, 5, 6]]  # Gene IDs
expr_sequences = [[0.5, 0.8, 0.2], [0.9, 0.1, 0.7]]  # Expression values
input_ids, attention_mask = tokenizer.tokenize(
    gene_sequences,
    expr_sequences=expr_sequences,
    max_genes=1024
)

DataLoader for Training

from slaf.ml import SLAFDataLoader

# Create DataLoader
dataloader = SLAFDataLoader(
    slaf_array=slaf,
    tokenizer_type="geneformer",  # or "scgpt"
    batch_size=32,
    max_genes=2048
)

# Use with PyTorch training
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    cell_ids = batch["cell_ids"]

    # Your training loop here
    loss = model(input_ids, attention_mask=attention_mask)
    loss.backward()

🛠️ Command Line Interface

Data Conversion

# Convert AnnData to SLAF (included by default)
slaf convert input.h5ad output.slaf

# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf --format hdf5

Data Querying

# Execute SQL query
slaf query dataset.slaf "SELECT * FROM cells LIMIT 10"

# Save results to CSV
slaf query dataset.slaf "SELECT * FROM cells" --output cells.csv

Dataset Information

slaf info dataset.slaf

📚 Documentation

SLAF Documentation
Quickstart
API Reference
Examples
User Guide
Contributing — setup, workflow, and how to contribute
Maintainers Guide

💬 Community

Discord — chat, questions, and updates

🙏 Acknowledgments

Built on top of

Lance for cloud-native, efficient columnar storage
Polars for lazy, composable, in-memory, zero-copy data processing

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.2

Apr 15, 2026

0.5.1

Apr 15, 2026

0.5.0

Apr 14, 2026

0.4.2

Mar 13, 2026

0.4.1

Mar 11, 2026

0.4.0

Mar 5, 2026

0.3.2

Jan 26, 2026

0.3.1

Nov 6, 2025

0.3.0

Sep 6, 2025

0.2.4

Aug 14, 2025

0.2.3

Aug 13, 2025

0.2.2

Aug 4, 2025

0.2.1

Aug 2, 2025

0.2.0

Jul 29, 2025

0.1.2

Jul 15, 2025

0.1.1

Jul 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slafdb-0.5.2.tar.gz (324.3 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slafdb-0.5.2-py3-none-any.whl (222.3 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file slafdb-0.5.2.tar.gz.

File metadata

Download URL: slafdb-0.5.2.tar.gz
Upload date: Apr 15, 2026
Size: 324.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slafdb-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`a31eaf67d7f3ec3f120d3310c03188990de352a995417cd77b648848f93dab14`
MD5	`89147587b5f470fd7fbf68b34a739a7e`
BLAKE2b-256	`8e02a9ef0d0ba2ca0483364e46c55a9131cb4b6b0b36156cb11fc81cc4549564`

See more details on using hashes here.

File details

Details for the file slafdb-0.5.2-py3-none-any.whl.

File metadata

Download URL: slafdb-0.5.2-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 222.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slafdb-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a34ba0ce74118ffd3fd2fca201186efb13da985382efc24a79f87925e037092d`
MD5	`877ffd52db28ca3cc53a3dda7bb0f9b9`
BLAKE2b-256	`5d5d75e2019e43d001dbeff051f69166880d9a47e0313576c33b7e72556d2f7e`

See more details on using hashes here.

slafdb 0.5.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

SLAF (Sparse Lazy Array Format)

🚀 Key Features

📦 Installation

Default Installation (Batteries Included)

Platform-Specific Notes

Optional Dependencies

Development Installation

🚀 Quick Start

Converting Your Data

Basic Usage

Filtering Data

🦥 Be Lazy - Lazy AnnData & Scanpy Integration

Lazy Computation Control

Lazy Slicing

🔍 Write SQL - Direct Database Access

Database Schema

SQL Queries

🧠 Train Foundation Models - ML Training

Tokenization

DataLoader for Training

🛠️ Command Line Interface

Data Conversion

Data Querying

Dataset Information

📚 Documentation

💬 Community

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes