A Python SQL tool for converting Anndata objects to a relational DuckDb database. Methods are included for querying and basic single-cell preprocessing (experimental).
Project description
Query AnnData Objects with SQL
The Python based AnnSQL package enables SQL-based queries on AnnData objects, returning results as either a Pandas DataFrame, an AnnData object, or a Parquet file that can easily be imported into a variety of data analysis tools. Behind the scenes, AnnSQL converts the layers of an AnnData object into a relational DuckDB database. Each layer is stored as an individual table, allowing for simple or complex SQL queries, including table joins.
Features
- Query AnnData with SQL.
- Fast for complex queries and aggregative functions.
- Return query results as Pandas Dataframes, Parquet files, or AnnData objects.
- Create in-memory or on-disk databases directly from AnnData objects.
- Open AnnSQL databases in R. No conversions necessary. Learn more
Full Documentation
docs.annsql.com
Quick Setup
pip install annsql
Basic Usage (In-Memory)
Ideal for smaller datasets.
from AnnSQL import AnnSQL
import scanpy as sc
#read sample data
adata = sc.datasets.pbmc68k_reduced()
#instantiate the AnnData object (you may also pass a h5ad file to the adata parameter)
asql = AnnSQL(adata=adata)
#query the expression table. Returns Pandas Dataframe by Default
asql.query("SELECT * FROM adata LIMIT 10")
Basic Usage (On-Disk)
For larger datasets, AnnSQL can create a local database (asql) from the AnnData object. This database is stored on-disk, can be queried, and is persistent.
import scanpy as sc
from AnnSQL import AnnSQL
from AnnSQL.MakeDb import MakeDb
#read sample data
adata = sc.datasets.pbmc68k_reduced()
#build the AnnSQL database
MakeDb(adata=adata, db_name="pbmc3k_reduced", db_path="db/")
#open the AnnSQL database
asql = AnnSQL(db="db/pbmc3k_reduced.asql")
#query the expression table
asql.query("SELECT * FROM adata LIMIT 5")
Advanced Queries and Usage
from AnnSQL import AnnSQL
import scanpy as sc
#read sample data
adata = sc.datasets.pbmc68k_reduced()
#pass the AnnData object to the AnnSQL class
asql = AnnSQL(adata=adata)
#group and count all labels
asql.query("SELECT obs.bulk_labels, COUNT(*) FROM obs GROUP BY obs.bulk_labels")
#take the log10 of a value
asql.query("SELECT LOG10(HES4) FROM X WHERE HES4 > 0")
#sum all gene counts | Memory intensive | See method calculate_gene_counts for chunked approach.
asql.query("SELECT SUM(COLUMNS(*)) FROM (SELECT * EXCLUDE (cell_id) FROM X)")
#taking the correlation of genes ITGB2 and SSU72 in dendritic cells that express either gene > 0
asql.query("SELECT corr(ITGB2,SSU72) as correlation FROM adata WHERE bulk_labels = 'Dendritic' AND (ITGB2 > 0 OR SSU72 >0)")
############################################################################
# Extended AnnSQL methods (See: https://docs.annsql.com/preprocessing)
# These methods are either SQL based or Python/SQL hybrid implementations.
############################################################################
#basic QC on the dataset
asql.calculate_total_counts()
asql.filter_by_cell_counts(min_cell_count=1000, max_cell_count=50000)
asql.filter_by_gene_counts(min_gene_counts=100, max_gene_counts=10000)
#normalize & log umi counts
asql.expression_normalize(total_counts_per_cell=10000)
asql.expression_log(log_type="LN")
#select highly variable genes
asql.calculate_variable_genes(save_var_names=True, top_variable_genes=1000)
#run pca
asql.calculate_pca(n_pcs=50, top_variable_genes=1000, zero_center=False)
#umap, cluster, then and plot.
asql.calculate_umap()
asql.calculate_leiden_clusters(resolution=0.25, n_neighbors=5)
asql.plot_umap(color_by="leiden_clusters", annotate=True)
Reference
Kenny Pavan, Arpiar Saunders, AnnSQL: A Python SQL-based package for fast large-scale single-cell genomics analysis using minimal computational resources
Bioinformatics Advances, 2025; vbaf105, https://doi.org/10.1093/bioadv/vbaf105
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file annsql-1.0.3-py3-none-any.whl.
File metadata
- Download URL: annsql-1.0.3-py3-none-any.whl
- Upload date:
- Size: 27.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24ffc7351080ffd31d50719a22f8714e240a4ee346b9ab8f3b37568e00f4ac41
|
|
| MD5 |
a9cf99c8cc9d8cea90cd47b346099f41
|
|
| BLAKE2b-256 |
57890a08379724feaed1453a17603104716a15c59aceb521117e750dd66c593c
|