anngtf - lift annotations from a `.gtf` file to your AnnData object.
Project description
anngtf
Lift annotations from a gtf
to your adata
object.
Installation
To install via pip:
pip install anngtf
To install the development version:
git clone https://github.com/mvinyard/anngtf.git
cd anngtf; pip install -e .
Example usage
Parsing a .gtf
file
import anngtf
gtf_filepath = "/path/to/ref/hg38/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/genes/genes.gtf"
If this is your first time using anngtf
, run:
gtf = anngtf.parse(path=gtf_filepath, genes=False, force=False, return_gtf=True)
Running this function will create two .csv
files from the given .gtf
files - one containing all feature types and one containing only genes. Both of these files are smaller than a .gtf
and can be loaded into memory much faster using pandas.read_csv()
(shortcut implemented in the next function). Additionally, this function leaves a paper trail for anngtf
to find the newly-created .csv
files again in the future such that one does not need to pass a path to the gtf.
In the scenario in which you've already run the above function, run:
gtf = anngtf.load() # no path necessary!
Updating the adata.var
table.
import anndata as a
import anngtf
adata = anndata.read_h5ad("/path/to/singlecell/data/adata.h5ad")
gtf = anngtf.load(genes=True)
anngtf.add(adata, gtf)
Since the anngtf
distribution already knows where the .csv / .gtf
files are, we could directly annotate adata
without first specifcying gtf
as a DataFrame, saving a step but I think it's more user-friendly to see what each one looks like, first.
Working advantage
Let's take a look at the time difference of loading a .gtf
into memory as a pandas.DataFrame
:
import anngtf
import gtfparse
import time
start = time.time()
gtf = gtfparse.read_gtf("/home/mvinyard/ref/hg38/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/genes/genes.gtf")
stop = time.time()
print("baseline loading time: {:.2f}s".format(stop - start), end='\n\n')
start = time.time()
gtf = anngtf.load()
stop = time.time()
print("anngtf loading time: {:.2f}s".format(stop - start))
baseline loading time: 87.54s
anngtf loading time: 12.46s
~ 7x speed improvement.
- Note: This is not meant to criticize or comment on anything related to
gtfparse
- in fact, this library relies solely ongtfparse
for the actual parsing of a.gtf
file into memory aspandas.DataFrame
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.