Skip to main content

A Python SQL tool for converting Anndata objects to a relational DuckDb database. Methods are included for querying and basic single-cell preprocessing (experimental).

Project description



Query AnnData Objects with SQL

The Python based AnnSQL package enables SQL-based queries on AnnData objects, returning results as either a Pandas DataFrame, an AnnData object, or a Parquet file that can easily be imported into a variety of data analysis tools. Behind the scenes, AnnSQL converts the layers of an AnnData object into a relational DuckDB database. Each layer is stored as an individual table, allowing for simple or complex SQL queries, including table joins.

Features

  • Query AnnData with SQL.
  • Fast for complex queries and aggregative functions.
  • Return query results as Pandas Dataframes, Parquet files, or AnnData objects.
  • Create in-memory or on-disk databases directly from AnnData objects.
  • Open AnnSQL databases in R. No conversions necessary. Learn more

Full Documentation

docs.annsql.com


Quick Setup

pip install annsql

Basic Usage (In-Memory)

Ideal for smaller datasets.

from AnnSQL import AnnSQL
import scanpy as sc

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#instantiate the AnnData object (you may also pass a h5ad file to the adata parameter)
asql = AnnSQL(adata=adata)

#query the expression table. Returns Pandas Dataframe by Default
asql.query("SELECT * FROM adata LIMIT 10")

Basic Usage (On-Disk)

For larger datasets, AnnSQL can create a local database (asql) from the AnnData object. This database is stored on-disk, can be queried, and is persistent.

import scanpy as sc
from AnnSQL import AnnSQL
from AnnSQL.MakeDb import MakeDb

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#build the AnnSQL database
MakeDb(adata=adata, db_name="pbmc3k_reduced", db_path="db/")

#open the AnnSQL database
asql = AnnSQL(db="db/pbmc3k_reduced.asql")

#query the expression table
asql.query("SELECT * FROM adata LIMIT 5")

Advanced Queries and Usage

from AnnSQL import AnnSQL
import scanpy as sc

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#pass the AnnData object to the AnnSQL class
asql = AnnSQL(adata=adata)

#group and count all labels
asql.query("SELECT obs.bulk_labels, COUNT(*) FROM obs GROUP BY obs.bulk_labels")

#take the log10 of a value
asql.query("SELECT LOG10(HES4) FROM X WHERE HES4 > 0")

#sum all gene counts | Memory intensive | See method calculate_gene_counts for chunked approach.
asql.query("SELECT SUM(COLUMNS(*)) FROM (SELECT * EXCLUDE (cell_id) FROM X)")

#taking the correlation of genes ITGB2 and SSU72 in dendritic cells that express either gene > 0
asql.query("SELECT corr(ITGB2,SSU72) as correlation FROM adata WHERE bulk_labels = 'Dendritic' AND (ITGB2 > 0 OR SSU72 >0)")

############################################################################
# Extended AnnSQL methods (See: https://docs.annsql.com/preprocessing)
# These methods are either SQL based or Python/SQL hybrid implementations.
############################################################################

#basic QC on the dataset
asql.calculate_total_counts()
asql.filter_by_cell_counts(min_cell_count=1000, max_cell_count=50000)
asql.filter_by_gene_counts(min_gene_counts=100, max_gene_counts=10000)

#normalize & log umi counts
asql.expression_normalize(total_counts_per_cell=10000)
asql.expression_log(log_type="LN")

#select highly variable genes
asql.calculate_variable_genes(save_var_names=True, top_variable_genes=1000)

#run pca
asql.calculate_pca(n_pcs=50, top_variable_genes=1000, zero_center=False)

#umap, cluster, then and plot.
asql.calculate_umap()
asql.calculate_leiden_clusters(resolution=0.25, n_neighbors=5)
asql.plot_umap(color_by="leiden_clusters", annotate=True)


Reference

Kenny Pavan, Arpiar Saunders, AnnSQL: A Python SQL-based package for fast large-scale single-cell genomics analysis using minimal computational resources
Bioinformatics Advances, 2025; vbaf105, https://doi.org/10.1093/bioadv/vbaf105

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

annsql-1.0.3-py3-none-any.whl (27.4 kB view details)

Uploaded Python 3

File details

Details for the file annsql-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: annsql-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 27.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for annsql-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 24ffc7351080ffd31d50719a22f8714e240a4ee346b9ab8f3b37568e00f4ac41
MD5 a9cf99c8cc9d8cea90cd47b346099f41
BLAKE2b-256 57890a08379724feaed1453a17603104716a15c59aceb521117e750dd66c593c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page