A Python SQL tool for converting Anndata objects to a relational DuckDb database. Methods are included for querying and basic single-cell preprocessing (experimental).
Project description
Query AnnData Objects with SQL
The AnnSQL package enables SQL-based queries on AnnData objects, returning results as either a Pandas DataFrame, an AnnData object, or a Parquet file that can easily be imported into a variety of data analysis tools. Behind the scenes, AnnSQL converts the layers of an AnnData object into a relational DuckDB database. Each layer is stored as an individual table, allowing for simple or complex SQL queries, including table joins.
To get started, see the usage notes below or refer to the example scripts in the examples
directory.
Features
- Query AnnData with SQL.
- Return query results as Pandas Dataframes, Parquet files, or AnnData objects.
- Create in-memory or on-disk databases directly from AnnData objects.
- Fast for complex queries and aggregative functions.
Installation (Python >= 3.12)
Warning: Higher memory consumption using Apple M-Series is expected when building AnnSQL databases.
pip install annsql
Basic Usage (In-Memory)
Ideal for smaller datasets.
from AnnSQL import AnnSQL
import scanpy as sc
#read sample data
adata = sc.datasets.pbmc68k_reduced()
#instantiate the AnnData object (you may also pass a h5ad file to the adata parameter)
asql = AnnSQL(adata=adata)
#query the expression table. Returns Pandas Dataframe by Default
asql.query("SELECT * FROM X")
#query the observation table. Returns adata object.
asql.query("SELECT * FROM obs", return_type="adata")
#query the join of 'X' and 'obs' table
asql.query("SELECT * FROM adata", return_type="parquet")
Basic Usage (On-Disk)
For larger datasets, AnnSQL can create a local database (asql) from the AnnData object. This database is stored on-disk and queried. Storage requirements are similar to the original AnnData h5ad filesize; however, complex aggregative functions can now benefit from the DuckDb engine with full SQL support. Please see manuscript for benchmarks.
import scanpy as sc
from AnnSQL import AnnSQL
from AnnSQL.MakeDb import MakeDb
#read sample data
adata = sc.datasets.pbmc68k_reduced()
#build the AnnSQL database
MakeDb(adata=adata, db_name="pbmc3k_reduced", db_path="db/")
#open the AnnSQL database
asql = AnnSQL(db="db/pbmc3k_reduced.asql")
#query the expression table
asql.query("SELECT * FROM adata LIMIT 5")
Entity Relationship Diagram
Using the Scanpy sample dataset, "Processed 3k PBMCs from 10x Genomics," the following ERD was generated from the DuckDB database constructed via AnnSQL, based on the corresponding AnnData object. Note: The database structure is not optimized for performance. Instead, the tables are designed to closely mirror the familiar structure of the AnnData object for ease of use.
AnnData object with n_obs × n_vars = 700 × 765
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
Advanced Queries and usage
from AnnSQL import AnnSQL
import scanpy as sc
#read sample data
adata = sc.datasets.pbmc68k_reduced()
#pass the AnnData object to the AnnSQL class
asql = AnnSQL(adata=adata)
#group and count all labels
asql.query("SELECT obs.bulk_labels, COUNT(*) FROM obs GROUP BY obs.bulk_labels")
#take the log10 of a value
asql.query("SELECT LOG10(HES4) FROM X WHERE HES4 > 0")
#sum all gene counts | Memory intensive | See method calculate_gene_counts for chunked approach.
asql.query("SELECT SUM(COLUMNS(*)) FROM (SELECT * EXCLUDE (cell_id) FROM X)")
#taking the correlation of genes ITGB2 and SSU72 in dendritic cells that express either gene > 0
asql.query("SELECT corr(ITGB2,SSU72) as correlation FROM adata WHERE bulk_labels = 'Dendritic' AND (ITGB2 > 0 OR SSU72 >0)")
AnnSQL Class
Method | Parameters | Description |
---|---|---|
__init__(adata, db, layers, create_all_indexes) |
adata : AnnData object or h5ad filepath (optional)db : Path to DuckDB database (optional)layers : List (optional. default: ["X", "obs", "var", "var_names", "obsm", "varm", "obsp", "uns"]).The layers of the Anndata object to build into the database. For larger datasets, it may be beneficial to only include the layers you're interested in querying.create_basic_indexes : Build indexed on cell_id (optional. default: False)create_all_indexes : Boolean (optional. default: False). Warning: Runtime can be significant when building. |
Initializes the AnnSQL object. Requires either a AnnData object (adata ) or a DuckDB database path (db ). |
query(query, return_type) |
query : SQL query stringreturn_type : 'pandas', 'adata', or 'parquet' (default: 'pandas') |
Executes a SELECT SQL query. Returns results as a pandas DataFrame, AnnData object, or parquet file. |
query_raw(query) |
query : SQL query string |
Executes a raw SQL query without restrictions on the type of query. Returns the raw result of the query. |
update_query(query) |
query : SQL update, delete, or insert query |
Executes an UPDATE, DELETE, or INSERT SQL query. Raises an error if a SELECT query is detected. |
show_tables() |
Displays the list of all tables in the DuckDB instance. | |
show_settings() |
Returns the current DuckDB settings in a pandas DataFrame format. | |
export_parquet() |
Exports all tables in the DuckDB database to individual Parquet files, saved in the parquet_files folder. |
|
calculate_total_counts(chunk_size, print_progress) |
chunk_size : The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)print_progress : Boolean (optional. default: False) |
Calculates total library size for each cell. Adds the column 'total_counts' directly to the `obs` and `X` table for convience. |
expression_normalize(total_counts_per_cell, chunk_size, print_progress) |
total_counts_per_cell : Integer (optional. default: 1e4)chunk_size : The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)print_progress : Boolean (optional. default: False) |
Updates the cell UMI counts to proportionally add to the total_counts_per_cell value. Directly updates the `X` table. |
expression_log(log_type, chunk_size, print_progress) |
log_type : Accepts either LN, LOG2 or LOG10. String (optional. default: LN)chunk_size : The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)print_progress : Boolean (optional. default: False) |
Log transforms the expression and directly updates the `X` table. |
calculate_gene_counts(chunk_size, print_progress) |
chunk_size : The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)print_progress : Boolean (optional. default: False) |
Sums all genes (columns) in the `X` table and adds the results to the column, 'gene_counts' in the `var` table. |
calculate_variable_genes(chunk_size, print_progress) |
chunk_size : The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)print_progress : Boolean (optional. default: False) |
Takes the sample variance of each gene in the `X` table and adds the results to the 'variance' column in the `var` table. Includes Bessel's bias correction. |
MakeDb Class
Method | Parameters | Description |
---|---|---|
__init__(adata, db_name, db_path, layers, create_all_indexes) |
adata : AnnData object (required)db_name : Name for the database (required)db_path : Path to store the database (default: 'db/')chunk_size : When opening AnnData in backed mode, the amount of chunks to insert at one time. Lower the value for low-memory systems (default: 5000)make_buffer_file : For memory errors on very low memory system, set this flag to true. It will mitigate the DuckDb high memory leak while inserting chunks by creating a buffer file. Be sure to have hard drive space x2 the size of your AnnData object available. (default: False)layers : List (optional. default: ["X", "obs", "var", "var_names", "obsm", "varm", "obsp", "uns"]).The layers of the Anndata object to build into the database. For larger datasets, it may be beneficial to only include the layers you're interested in querying.create_basic_indexes : Build indexed on cell_id (optional. default: False)create_all_indexes : Boolean (optional. default: False). Warning: Runtime can be significant when building.convenience_view : Boolean (optional. default: True). Creates the view 'adata' by joining the X and obs tables. For larger datasets, consider setting this flag to False to save resources. |
Initializes the MakeDb object and validates parameters, then proceeds to build the DuckDB database. |
Usage & Runtime Notes
There are two key reasons to use AnnSQL: (1) if you prefer SQL's expressive syntax for filtering and querying your data, or (2) if you're working with datasets that exceed memory limits and require loading AnnData in backed mode. Using backed mode in AnnData can limit available functions, especially aggregate operations, and slow down data access. AnnSQL offers a solution by enabling SQL-style queries that may perform more efficiently in these scenarios. Below are rough runtime comparisons between AnnData and AnnSQL after a database has been built. Running AnnSQL locally for datasets that are larger than memory, that would typically require AnnData in backed mode see substantial runtime improvements for a variety of filtering operations.
Accessing and processing 4.4 million cells on a laptop
To illustrate how AnnSQL can be used to access atlas sized datasets on a local computer, we examine the single nuclei dataset presented in "The molecular cytoarchitecture of the adult mouse brain" by Langlieb et al 2023. First, we opened the atlas AnnData object in backed mode and created a asql
database using the MakeDb
class provided with AnnSQL. Next, we performed some basic querying of the data to return subsets. We then calculated total counts per gene which we accomplished entirely in SQL; even with the non-optimized schema. Next, we blended Python and SQL to normalize and log the counts per library. Lastly, we calculated highly variable genes in the entire dataset using two SQL queries which: (1) provide a list of all gene names in the X table, then (2) use those gene names to calculate the variance for each gene and return a list of the top 2000. Our results demonstrate AnnSQL is a capable tool for basic (and possibly more advanced) analyses of atlas scale datasets.
#import libraries
from MakeDb import MakeDb
from AnnSQL import AnnSQL
#load the atlas dataset in backed mode
adata = sc.read_h5ad("Macosko_Mouse_Atlas_Single_Nuclei.Use_Backed.h5ad", backed="r+")
#build the asql database | Runtime 7hr 10min
MakeDb(adata=adata, db_name="Macosko_Mouse_Atlas", db_path="../db/", layers=["X", "obs"])
#query example | Runtime: 0.24sec
asql.query("SELECT ENSMUSG00000070880 FROM X WHERE ENSMUSG00000070880 > 0")
#count the number of cells in each cluster | Runtime: 0.35sec
asql.query("SELECT ClusterNm, COUNT(cell_id) AS num_cells FROM obs GROUP BY ClusterNm ORDER BY num_cells DESC")
#determine the total counts per cell library | Runtime: 4min 30sec
asql.calculate_total_counts(chunk_size=950)
#normalize umi counts to 10k per cell | Runtime: 1hr 48mins
asql.expression_normalize(total_counts_per_cell=1e4, chunk_size=300)
#log scale the normalized counts | Runtime: 59mins 13sec
asql.expression_log(log_type="LN", chunk_size=250)
Laptop system details for both runtime analyses displayed above.
- Memory: 40.0 GiB
- Processor: 12th Gen Intel® Core™ i7-1255U × 12
- Disk Capacity: 1.0 TB
- OS: Ubuntu 24.04.1 LTS
- Python Version: 3.12
Citation
Coming soon...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file annsql-0.9.8.tar.gz
.
File metadata
- Download URL: annsql-0.9.8.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3b37b2bb11b2fedafff72971d7b37163c28e4c97835c945571055efcc5e3f4b |
|
MD5 | b8028809b7bd510d9ad3bd3821b5915c |
|
BLAKE2b-256 | b9d43e3c4e7ddf600929b89e0544ad243fb34b141d53cc0f113c7239f3ab0d7d |
File details
Details for the file AnnSQL-0.9.8-py3-none-any.whl
.
File metadata
- Download URL: AnnSQL-0.9.8-py3-none-any.whl
- Upload date:
- Size: 15.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb9840afd1e1f961c1ac2819b4e89394021da898bfb8c5c18e753d7b52b23d36 |
|
MD5 | c98b450e87aab7da2e7ed986db14614c |
|
BLAKE2b-256 | 4c19a6fc348b51359da370feeb023e457d2950d59dc73494a8615e7adfec663e |