A Python SQL tool for converting Anndata objects to a relational DuckDb database. Methods are included for querying and basic single-cell preprocessing (experimental).

These details have not been verified by PyPI

Project links

Homepage

Project description

Query AnnData Objects with SQL

The AnnSQL package enables SQL-based queries on AnnData objects, returning results as either a Pandas DataFrame, an AnnData object, or a Parquet file that can easily be imported into a variety of data analysis tools. Behind the scenes, AnnSQL converts the layers of an AnnData object into a relational DuckDB database. Each layer is stored as an individual table, allowing for simple or complex SQL queries, including table joins.

To get started, see the usage notes below or refer to the example scripts in the examples directory.

Features

Query AnnData with SQL.
Return query results as Pandas Dataframes, Parquet files, or AnnData objects.
Create in-memory or on-disk databases directly from AnnData objects.
Fast for complex queries and aggregative functions.

Installation (Python >= 3.12)

Warning: Higher memory consumption using Apple M-Series is expected when building AnnSQL databases.

pip install annsql

Basic Usage (In-Memory)

Ideal for smaller datasets.

from AnnSQL import AnnSQL
import scanpy as sc

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#instantiate the AnnData object (you may also pass a h5ad file to the adata parameter)
asql = AnnSQL(adata=adata)

#query the expression table. Returns Pandas Dataframe by Default
asql.query("SELECT * FROM X")

#query the observation table. Returns adata object.
asql.query("SELECT * FROM obs", return_type="adata")

#query the join of 'X' and 'obs' table
asql.query("SELECT * FROM adata", return_type="parquet")

Basic Usage (On-Disk)

For larger datasets, AnnSQL can create a local database (asql) from the AnnData object. This database is stored on-disk and queried. Storage requirements are similar to the original AnnData h5ad filesize; however, complex aggregative functions can now benefit from the DuckDb engine with full SQL support. Please see manuscript for benchmarks.

import scanpy as sc
from AnnSQL import AnnSQL
from AnnSQL.MakeDb import MakeDb

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#build the AnnSQL database
MakeDb(adata=adata, db_name="pbmc3k_reduced", db_path="db/")

#open the AnnSQL database
asql = AnnSQL(db="db/pbmc3k_reduced.asql")

#query the expression table
asql.query("SELECT * FROM adata LIMIT 5")

Entity Relationship Diagram

Using the Scanpy sample dataset, "Processed 3k PBMCs from 10x Genomics," the following ERD was generated from the DuckDB database constructed via AnnSQL, based on the corresponding AnnData object. Note: The database structure is not optimized for performance. Instead, the tables are designed to closely mirror the familiar structure of the AnnData object for ease of use.

AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Advanced Queries and usage

from AnnSQL import AnnSQL
import scanpy as sc

#read sample data
adata = sc.datasets.pbmc68k_reduced()

#pass the AnnData object to the AnnSQL class
asql = AnnSQL(adata=adata)

#group and count all labels
asql.query("SELECT obs.bulk_labels, COUNT(*) FROM obs GROUP BY obs.bulk_labels")

#take the log10 of a value
asql.query("SELECT LOG10(HES4) FROM X WHERE HES4 > 0")

#sum all gene counts | Memory intensive | See method calculate_gene_counts for chunked approach.
asql.query("SELECT SUM(COLUMNS(*)) FROM (SELECT * EXCLUDE (cell_id) FROM X)")

#taking the correlation of genes ITGB2 and SSU72 in dendritic cells that express either gene > 0
asql.query("SELECT corr(ITGB2,SSU72) as correlation FROM adata WHERE bulk_labels = 'Dendritic' AND (ITGB2 > 0 OR SSU72 >0)")

AnnSQL Class

Method	Parameters	Description
`__init__(adata, db, layers, create_all_indexes)`	`adata`: AnnData object or h5ad filepath (optional) `db`: Path to DuckDB database (optional) `layers`: List (optional. default: ["X", "obs", "var", "var_names", "obsm", "varm", "obsp", "uns"]).The layers of the Anndata object to build into the database. For larger datasets, it may be beneficial to only include the layers you're interested in querying. `create_basic_indexes`: Build indexed on cell_id (optional. default: False) `create_all_indexes`: Boolean (optional. default: False). Warning: Runtime can be significant when building.	Initializes the AnnSQL object. Requires either a AnnData object (`adata`) or a DuckDB database path (`db`).
`query(query, return_type)`	`query`: SQL query string `return_type`: 'pandas', 'adata', or 'parquet' (default: 'pandas')	Executes a SELECT SQL query. Returns results as a pandas DataFrame, AnnData object, or parquet file.
`query_raw(query)`	`query`: SQL query string	Executes a raw SQL query without restrictions on the type of query. Returns the raw result of the query.
`update_query(query)`	`query`: SQL update, delete, or insert query	Executes an UPDATE, DELETE, or INSERT SQL query. Raises an error if a SELECT query is detected.
`show_tables()`	None	Displays the list of all tables in the DuckDB instance.
`show_settings()`	None	Returns the current DuckDB settings in a pandas DataFrame format.
`export_parquet()`	None	Exports all tables in the DuckDB database to individual Parquet files, saved in the `parquet_files` folder.
`calculate_total_counts(chunk_size, print_progress)`	`chunk_size`: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200) `print_progress`: Boolean (optional. default: False)	Calculates total library size for each cell. Adds the column 'total_counts' directly to the `obs` and `X` table for convience.
`expression_normalize(total_counts_per_cell, chunk_size, print_progress)`	`total_counts_per_cell`: Integer (optional. default: 1e4) `chunk_size`: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200) `print_progress`: Boolean (optional. default: False)	Updates the cell UMI counts to proportionally add to the total_counts_per_cell value. Directly updates the `X` table.
`expression_log(log_type, chunk_size, print_progress)`	`log_type`: Accepts either LN, LOG2 or LOG10. String (optional. default: LN) `chunk_size`: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200) `print_progress`: Boolean (optional. default: False)	Log transforms the expression and directly updates the `X` table.
`calculate_gene_counts(chunk_size, print_progress)`	`chunk_size`: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200) `print_progress`: Boolean (optional. default: False)	Sums all genes (columns) in the `X` table and adds the results to the column, 'gene_counts' in the `var` table.
`calculate_variable_genes(chunk_size, print_progress)`	Experimental `chunk_size`: The amount of columns to perform the calculations on concurrently. DuckDb has a limit of 1000. Adjust this value to match the resources available. A higher number will decrease runtime, but requires more resources. Integer (optional. default: 200) `print_progress`: Boolean (optional. default: False)	Takes the sample variance of each gene in the `X` table and adds the results to the 'variance' column in the `var` table. Includes Bessel's bias correction.

MakeDb Class

Method	Parameters	Description
`__init__(adata, db_name, db_path, layers, create_all_indexes)`	`adata`: AnnData object (required) `db_name`: Name for the database (required) `db_path`: Path to store the database (default: 'db/') `chunk_size`: When opening AnnData in backed mode, the amount of chunks to insert at one time. Lower the value for low-memory systems (default: 5000) `make_buffer_file`: For memory errors on very low memory system, set this flag to true. It will mitigate the DuckDb high memory leak while inserting chunks by creating a buffer file. Be sure to have hard drive space x2 the size of your AnnData object available. (default: False) `layers`: List (optional. default: ["X", "obs", "var", "var_names", "obsm", "varm", "obsp", "uns"]).The layers of the Anndata object to build into the database. For larger datasets, it may be beneficial to only include the layers you're interested in querying. `create_basic_indexes`: Build indexed on cell_id (optional. default: False) `create_all_indexes`: Boolean (optional. default: False). Warning: Runtime can be significant when building. `convenience_view`: Boolean (optional. default: True). Creates the view 'adata' by joining the X and obs tables. For larger datasets, consider setting this flag to False to save resources.	Initializes the MakeDb object and validates parameters, then proceeds to build the DuckDB database.

Usage & Runtime Notes

There are two key reasons to use AnnSQL: (1) if you prefer SQL's expressive syntax for filtering and querying your data, or (2) if you're working with datasets that exceed memory limits and require loading AnnData in backed mode. Using backed mode in AnnData can limit available functions, especially aggregate operations, and slow down data access. AnnSQL offers a solution by enabling SQL-style queries that may perform more efficiently in these scenarios. Below are rough runtime comparisons between AnnData and AnnSQL after a database has been built. Running AnnSQL locally for datasets that are larger than memory, that would typically require AnnData in backed mode see substantial runtime improvements for a variety of filtering operations.

Accessing and processing 4.4 million cells on a laptop

To illustrate how AnnSQL can be used to access atlas sized datasets on a local computer, we examine the single nuclei dataset presented in "The molecular cytoarchitecture of the adult mouse brain" by Langlieb et al 2023. First, we opened the atlas AnnData object in backed mode and created a asql database using the MakeDb class provided with AnnSQL. Next, we performed some basic querying of the data to return subsets. We then calculated total counts per gene which we accomplished entirely in SQL; even with the non-optimized schema. Next, we blended Python and SQL to normalize and log the counts per library. Lastly, we calculated highly variable genes in the entire dataset using two SQL queries which: (1) provide a list of all gene names in the X table, then (2) use those gene names to calculate the variance for each gene and return a list of the top 2000. Our results demonstrate AnnSQL is a capable tool for basic (and possibly more advanced) analyses of atlas scale datasets.

#import libraries
from MakeDb import MakeDb
from AnnSQL import AnnSQL

#load the atlas dataset in backed mode
adata = sc.read_h5ad("Macosko_Mouse_Atlas_Single_Nuclei.Use_Backed.h5ad", backed="r+")

#build the asql database | Runtime 7hr 10min
MakeDb(adata=adata, db_name="Macosko_Mouse_Atlas", db_path="../db/", layers=["X", "obs"])

#query example | Runtime: 0.24sec
asql.query("SELECT ENSMUSG00000070880 FROM X WHERE ENSMUSG00000070880 > 0")

#count the number of cells in each cluster | Runtime: 0.35sec
asql.query("SELECT ClusterNm, COUNT(cell_id) AS num_cells FROM obs GROUP BY ClusterNm ORDER BY num_cells DESC")

#determine the total counts per cell library | Runtime: 4min 30sec
asql.calculate_total_counts(chunk_size=950)

#normalize umi counts to 10k per cell | Runtime: 1hr 48mins
asql.expression_normalize(total_counts_per_cell=1e4, chunk_size=300) 

#log scale the normalized counts | Runtime: 59mins 13sec
asql.expression_log(log_type="LN", chunk_size=250)

Laptop system details for both runtime analyses displayed above.

Memory: 40.0 GiB
Processor: 12th Gen Intel® Core™ i7-1255U × 12
Disk Capacity: 1.0 TB
OS: Ubuntu 24.04.1 LTS
Python Version: 3.12

Citation

Coming soon...

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.9.9

Nov 13, 2024

This version

0.9.8

Nov 2, 2024

0.9.7

Nov 2, 2024

0.9.6

Nov 2, 2024

0.9.5

Nov 2, 2024

0.9.4

Nov 2, 2024

0.9.3

Nov 2, 2024

0.9.2

Nov 2, 2024

0.9.1

Nov 2, 2024

0.9.0

Nov 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annsql-0.9.8.tar.gz (19.7 kB view details)

Uploaded Nov 2, 2024 Source

Built Distribution

AnnSQL-0.9.8-py3-none-any.whl (15.3 kB view details)

Uploaded Nov 2, 2024 Python 3

File details

Details for the file annsql-0.9.8.tar.gz.

File metadata

Download URL: annsql-0.9.8.tar.gz
Upload date: Nov 2, 2024
Size: 19.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for annsql-0.9.8.tar.gz
Algorithm	Hash digest
SHA256	`a3b37b2bb11b2fedafff72971d7b37163c28e4c97835c945571055efcc5e3f4b`
MD5	`b8028809b7bd510d9ad3bd3821b5915c`
BLAKE2b-256	`b9d43e3c4e7ddf600929b89e0544ad243fb34b141d53cc0f113c7239f3ab0d7d`

See more details on using hashes here.

File details

Details for the file AnnSQL-0.9.8-py3-none-any.whl.

File metadata

Download URL: AnnSQL-0.9.8-py3-none-any.whl
Upload date: Nov 2, 2024
Size: 15.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for AnnSQL-0.9.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fb9840afd1e1f961c1ac2819b4e89394021da898bfb8c5c18e753d7b52b23d36`
MD5	`c98b450e87aab7da2e7ed986db14614c`
BLAKE2b-256	`4c19a6fc348b51359da370feeb023e457d2950d59dc73494a8615e7adfec663e`