Skip to main content

A Python SQL tool for converting Anndata objects to a relational DuckDb database. Methods are included for querying and basic single-cell preprocessing (experimental).

Project description


Query AnnData Objects with SQL

The AnnSQL package enables SQL-based queries on AnnData objects, returning results as either a Pandas DataFrame, an AnnData object, or a Parquet file that can easily be imported into a variety of data analysis tools. Behind the scenes, AnnSQL converts the layers of an AnnData object into a relational DuckDB database. Each layer is stored as an individual table, allowing for simple or complex SQL queries, including table joins.

To get started, see the usage notes below or refer to the example scripts in the examples directory.

Features

  • Query AnnData with SQL.
  • Return query results as Pandas Dataframes, Parquet files, or AnnData objects.
  • Create in-memory or on-disk databases directly from AnnData objects.
  • Fast for complex queries and aggregative functions.

Installation (Python >= 3.12)

Warning: Higher memory consumption using Apple M-Series is expected when building AnnSQL databases.

pip install annsql

Basic Usage (In-Memory)

Ideal for smaller datasets.

from AnnSQL import AnnSQL
import scanpy as sc

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#instantiate the AnnData object (you may also pass a h5ad file to the adata parameter)
adata_sql = AnnSQL(adata=adata)

#query the expression table. Returns Pandas Dataframe by Default
adata_sql.query("SELECT * FROM X")

#query the observation table. Returns adata object.
adata_sql.query("SELECT * FROM obs", return_type="adata")

#query the join of 'X' and 'obs' table
adata_sql.query("SELECT * FROM adata", return_type="parquet")

Basic Usage (On-Disk)

For larger datasets, AnnSQL can create a local database (asql) from the AnnData object. This database is stored on-disk and queried. Storage requirements are similar to the original AnnData h5ad filesize; however, complex aggregative functions can now benefit from the DuckDb engine with full SQL support. Please see manuscript for benchmarks.

import scanpy as sc
from AnnSQL.MakeDb import MakeDb

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#build the AnnSQL database
MakeDb(adata=adata, db_name="pbmc3k_reduced", db_path="db/")

#open the AnnSQL database
adata_sql = AnnSQL(db="db/pbmc3k_reduced.asql")

#query the expression table
adata_sql.query("SELECT * FROM adata")

Entity Relationship Diagram

Using the Scanpy sample dataset, "Processed 3k PBMCs from 10x Genomics," the following ERD was generated from the DuckDB database constructed via AnnSQL, based on the corresponding AnnData object. Note: The database structure is not optimized for performance. Instead, the tables are designed to closely mirror the familiar structure of the AnnData object for ease of use.

AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Advanced Queries and usage

from AnnSQL import AnnSQL
import scanpy as sc

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#pass the AnnData object to the AnnSQL class
adata_sql = AnnSQL(adata=adata)

#group and count all labels
adata_sql.query("SELECT obs.bulk_labels, COUNT(*) FROM obs GROUP BY obs.bulk_labels")

#take the log10 of a value
adata_sql.query("SELECT LOG10(HES4) FROM X WHERE HES4 > 0")

#sum all gene counts
adata_sql.query("SELECT SUM(COLUMNS(*)) FROM (SELECT * EXCLUDE (cell_id) FROM X)")

#taking the correlation of genes ITGB2 and SSU72 in dendritic cells that express either gene > 0
adata_sql.query("SELECT corr(ITGB2,SSU72) as correlation FROM adata WHERE bulk_labels = 'Dendritic' AND (ITGB2 > 0 OR SSU72 >0)")

AnnSQL Class

Method Parameters Description
__init__(adata, db, layers, create_all_indexes)
  • adata: AnnData object or h5ad filepath (optional)
  • db: Path to DuckDB database (optional)
  • layers: List (optional. default: ["X", "obs", "var", "var_names", "obsm", "varm", "obsp", "uns"]).The layers of the Anndata object to build into the database. For larger datasets, it may be beneficial to only include the layers you're interested in querying.
  • create_basic_indexes: Build indexed on cell_id (optional. default: False)
  • create_all_indexes: Boolean (optional. default: False). Warning: Runtime can be significant when building.
  • Initializes the AnnSQL object. Requires either a AnnData object (adata) or a DuckDB database path (db).
    query(query, return_type)
  • query: SQL query string
  • return_type: 'pandas', 'adata', or 'parquet' (default: 'pandas')
  • Executes a SELECT SQL query. Returns results as a pandas DataFrame, AnnData object, or parquet file.
    query_raw(query)
  • query: SQL query string
  • Executes a raw SQL query without restrictions on the type of query. Returns the raw result of the query.
    update_query(query)
  • query: SQL update, delete, or insert query
  • Executes an UPDATE, DELETE, or INSERT SQL query. Raises an error if a SELECT query is detected.
    show_tables()
  • None
  • Displays the list of all tables in the DuckDB instance.
    show_settings()
  • None
  • Returns the current DuckDB settings in a pandas DataFrame format.
    export_parquet()
  • None
  • Exports all tables in the DuckDB database to individual Parquet files, saved in the parquet_files folder.
    calculate_total_counts(chunk_size, print_progress)
  • chunk_size: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)
  • print_progress: Boolean (optional. default: False)
  • Calculates total library size for each cell. Adds the column 'total_counts' directly to the `obs` and `X` table for convience.
    expression_normalize(total_counts_per_cell, chunk_size, print_progress)
  • total_counts_per_cell: Integer (optional. default: 1e4)
  • chunk_size: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)
  • print_progress: Boolean (optional. default: False)
  • Updates the cell UMI counts to proportionally add to the total_counts_per_cell value. Directly updates the `X` table.
    expression_log(log_type, chunk_size, print_progress)
  • log_type: Accepts either LN, LOG2 or LOG10. String (optional. default: LN)
  • chunk_size: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)
  • print_progress: Boolean (optional. default: False)
  • Log transforms the expression and directly updates the `X` table.
    calculate_gene_counts(chunk_size, print_progress)
  • chunk_size: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)
  • print_progress: Boolean (optional. default: False)
  • Sums all genes (columns) in the `X` table and adds the results to the column, 'gene_counts' in the `var` table.
    calculate_variable_genes(chunk_size, print_progress)
  • Experimental
  • chunk_size: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200)
  • print_progress: Boolean (optional. default: False)
  • Takes the sample variance of each gene in the `X` table and adds the results to the 'variance' column in the `var` table. Includes Bessel's bias correction.

    MakeDb Class

    Method Parameters Description
    __init__(adata, db_name, db_path, layers, create_all_indexes)
  • adata: AnnData object (required)
  • db_name: Name for the database (required)
  • db_path: Path to store the database (default: 'db/')
  • chunk_size: When opening AnnData in backed mode, the amount of chunks to insert at one time. Lower the value for low-memory systems (default: 5000)
  • make_buffer_file: For memory errors on very low memory system, set this flag to true. It will mitigate the DuckDb high memory leak while inserting chunks by creating a buffer file. Be sure to have hard drive space x2 the size of your AnnData object available. (default: False)
  • layers: List (optional. default: ["X", "obs", "var", "var_names", "obsm", "varm", "obsp", "uns"]).The layers of the Anndata object to build into the database. For larger datasets, it may be beneficial to only include the layers you're interested in querying.
  • create_basic_indexes: Build indexed on cell_id (optional. default: False)
  • create_all_indexes: Boolean (optional. default: False). Warning: Runtime can be significant when building.
  • convenience_view: Boolean (optional. default: True). Creates the view 'adata' by joining the X and obs tables. For larger datasets, consider setting this flag to False to save resources.
  • Initializes the MakeDb object and validates parameters, then proceeds to build the DuckDB database.


    Usage & Runtime Notes

    There are two key reasons to use AnnSQL: (1) if you prefer SQL's expressive syntax for filtering and querying your data, or (2) if you're working with datasets that exceed memory limits and require loading AnnData in backed mode. Using backed mode in AnnData can limit available functions, especially aggregate operations, and slow down data access. AnnSQL offers a solution by enabling SQL-style queries that may perform more efficiently in these scenarios. Below are rough runtime comparisons between AnnData and AnnSQL after a database has been built. Running AnnSQL locally for datasets that are larger than memory, that would typically require AnnData in backed mode see substantial runtime improvements for a variety of filtering operations.



    Accessing and processing 4.4 million cells on a laptop

    To illustrate how AnnSQL can be used to access atlas sized datasets on a local computer, we examine the single nuclei dataset presented in "The molecular cytoarchitecture of the adult mouse brain" by Langlieb et al 2023. First, we opened the atlas AnnData object in backed mode and created a asql database using the MakeDb class provided with AnnSQL. Next, we performed some basic querying of the data to return subsets. We then calculated total counts per gene which we accomplished entirely in SQL; even with the non-optimized schema. Next, we blended Python and SQL to normalize and log the counts per library. Lastly, we calculated highly variable genes in the entire dataset using two SQL queries which: (1) provide a list of all gene names in the X table, then (2) use those gene names to calculate the variance for each gene and return a list of the top 2000. Our results demonstrate AnnSQL is a capable tool for basic (and possibly more advanced) analyses of atlas scale datasets.

    #import libraries
    from MakeDb import MakeDb
    from AnnSQL import AnnSQL
    
    #load the atlas dataset in backed mode
    adata = sc.read_h5ad("Macosko_Mouse_Atlas_Single_Nuclei.Use_Backed.h5ad", backed="r+")
    
    #build the asql database | Runtime 7hr 10min
    MakeDb(adata=adata, db_name="Macosko_Mouse_Atlas", db_path="../db/", layers=["X", "obs"])
    
    #query example | Runtime: 0.24sec
    adata_sql.query("SELECT ENSMUSG00000070880 FROM X WHERE ENSMUSG00000070880 > 0")
    
    #count the number of cells in each cluster | Runtime: 0.35sec
    adata_sql.query("SELECT ClusterNm, COUNT(cell_id) AS num_cells FROM obs GROUP BY ClusterNm ORDER BY num_cells DESC")
    
    #determine the total counts per cell library | Runtime: 4min 30sec
    adata_sql.calculate_total_counts(chunk_size=950)
    
    #normalize umi counts to 10k per cell | Runtime: 1hr 48mins
    adata_sql.expression_normalize(total_counts_per_cell=1e4, chunk_size=300) 
    
    #log scale the normalized counts | Runtime: 59mins 13sec
    adata_sql.expression_log(log_type="LOG2", chunk_size=250)
    

    Laptop system details for runtime analyses displayed above.

    • Memory: 40.0 GiB
    • Processor: 12th Gen Intel® Core™ i7-1255U × 12
    • Disk Capacity: 1.0 TB
    • OS: Ubuntu 24.04.1 LTS
    • Python Version: 3.12


    Citation

    Coming soon...



    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distribution

    annsql-0.9.4.tar.gz (19.5 kB view details)

    Uploaded Source

    Built Distribution

    AnnSQL-0.9.4-py3-none-any.whl (15.2 kB view details)

    Uploaded Python 3

    File details

    Details for the file annsql-0.9.4.tar.gz.

    File metadata

    • Download URL: annsql-0.9.4.tar.gz
    • Upload date:
    • Size: 19.5 kB
    • Tags: Source
    • Uploaded using Trusted Publishing? No
    • Uploaded via: twine/5.1.1 CPython/3.12.3

    File hashes

    Hashes for annsql-0.9.4.tar.gz
    Algorithm Hash digest
    SHA256 533321a8f69f723271af21ceb839adabe86e505dfc322f1b4a98c35421c5b4f0
    MD5 381f3f281bcba6eff28de8ecd08b152d
    BLAKE2b-256 7ceb9df0a753486e1ac2175ba9c6e980ca19f4c13022817fa6c8f42071ec3a0c

    See more details on using hashes here.

    File details

    Details for the file AnnSQL-0.9.4-py3-none-any.whl.

    File metadata

    • Download URL: AnnSQL-0.9.4-py3-none-any.whl
    • Upload date:
    • Size: 15.2 kB
    • Tags: Python 3
    • Uploaded using Trusted Publishing? No
    • Uploaded via: twine/5.1.1 CPython/3.12.3

    File hashes

    Hashes for AnnSQL-0.9.4-py3-none-any.whl
    Algorithm Hash digest
    SHA256 5dd3e6c38e80e4e6b966136f4bec2bd2a244ba655851cb7713371da3be24b788
    MD5 ac41f96f208ba9488be587c0b4587e55
    BLAKE2b-256 175c02b9572e5c6f87b2b7316027b9fc511b57f2d448d1b79088046964183754

    See more details on using hashes here.

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page