Skip to main content

sPYce is a Python package for the cross-species integration of single-nucleus ATAC-seq data (snATAC-seq)

Project description

(Credits avatar: Created with https://deepai.org/machine-learning-model/pop-art-generator)

Single Cell Integration for Single Cell and Single Nucleus ATAC-seq Across Evolution (sPYce)

We present sPYce, a since cell and single nucleus ATAC-seq integration method across different species implemented in Python. sPYce takes single cell data (ie. a cell x feature matrix) and transforms it to single cell k-mer histograms. It was developed for single cell and single nucleus ATAC-seq data, but it can be similarly applied to any single cell data that follows an on-off relationship. In the case of ATAC-seq, the k-mer histogram represents the sequence composition of accessible regions. sPYce allows the straightforward combination of those k-mer histogram over several species, performs appropriate normalization, and defines an easy-to-use interface to run dimensionality reduction, clustering, embedding, visualisation, and automated label transfer.

For more information, please see our documentation with tutorials, instructions, and detailed API explanations.

Installation

The code was implemented and tested using Python=3.9, but any Python version greater than or equal to 3.6 should work. Make sure you have pip installed.

pip

The easiest way to install our package is using pip. Simply run

python3 -m pip install spyceATAC

Using the packaging service also allows to use the entry points (see below).

Git

If you want to contribute or customize the code, you can also clone the repository. Build and install the package from the code by navigating to the folder to which you cloned the repository. Then run

python3 -m build
python3 -m pip install .

This will install sPYce from the files in the repository. Note that if you have changed the code yourself, they'll be reflected in the global installation.

If you only want to install the requirements, use `pip`` via

python3 -m pip install -r requirements.txt

or, if you're a developer, we recommend running

python3 -m pip install -r requirements_dev.txt

Data

sPYce requires as data input the following files:

  • a peak matrix per sample and species, rows representing cells, columns representing peaks
  • a bed file with peaks in the same order as the peaks in the peak matrix. You need to have either one per peak matrix or one per species
  • a reference genome as fasta .fa file per species

If you follow our tutorial on our website, you can download the example data by running:

sh shell/get_testdata.sh /path/to/spyce/dir /output/dir n_cpus

Please replace the paths and the number of used processes with the desired values.

Example

sPYce's use is dependent on your data. Please follow the tutorial for more information. Generally, sPYce follows the workflow

  1. KMer matrix creation sPYce requires a setup file in which file paths per species are saved. This allows easy modification and archiving if you want to test different data versions. We developed the .embed file format to represent hierarchical data easily without knowing any json or xml. It roughly follows a pythonic syntax, and hierarchies are represented as tab indents. Follow the tutorial for more information.

Create your own KMer matrices based on your data using our command-line interface

spyce-create --setup_file /path/to/setup/file --k k [optional parameters]

where you should replace /path/to/setup/file with the path to your own setup file, k with the desired k-mer length, and [optional parameters] with any additional parameter that you want to add. See full description by typing

spyce-create --help
  1. Analysis Once the snATAC-seq peak matrices from all species were converted to a KMer matrix, you can analyse and treat the data (meaning the cell-by-k-mer histogram) as you want. Our interface tries to provide NumPy like interface.
# spyce libraries
from spyce.kmerMatrix import KMerCollection
from spyce.plotting import plot_dr, plot_umap

species_c_vec = ... # define your species color vector

# Load kmer collection
kmer_collect = KMerCollection.load("../data/kmer/tutorial/tutorial_kmer_collection_obj.pkl")
# centered unit-sum normalization
kmer_collect.set_normalization("centered_sum") 

# Perform PCA and get explained variance ratio of the first 3 PCs
explained_var_ratio = kmer_collect.reduce_dimensionality(
    algorithm="pca",
    save_name="pca",
    n_pca_components=n_pca_components
)[:3]

# Remove non-linear species effects that can occur due to unequal cell type distribution
kmer_collect.remove_species_effect(
    batch_vec=species_vec,  # indicate along which values a batch/species effect can occur
    dr_key="pca",  # Use the dimensionality reduction saved under the pca key
    save_name="harmony_pca",  # save the Harmony corrected PCs under harmony_pca 
    algorithm="harmony"  # use Harmony batch correction algorithm
)

# plot PCA
fig, ax = plot_dr(
    kmer_collect,
    ck=species_c_vec,
    dr_key="harmony_pca",
    cmap=None,
    randomize=True,
    title="PCA Species (centered-sum)",
)
handles = [
    Line2D(
        [0], [0], marker="o", color="w", 
        markerfacecolor=c, markersize=5, label=s
    )
    for s, c in species_colors.items()
] 
fig.legend(handles=handles, loc=7)
fig.tight_layout()
fig.subplots_adjust(right=0.7)
plt.show()

# compute UMAP
kmer_collect.umap(
    dr_name="harmony_pca",  # use the Harmony-corrected PCs for UMAP
    n_neighbors=n_umap_neighbors, 
    min_dist=min_dist
)

fig, ax = plot_umap(
    kmer_collect,
    ck=species_c_vec,
    randomize=True,
    title="UMAP Species (centered-sum)",
    cmap=None
)
handles = [
    Line2D(
        [0], [0], marker="o", color="w", 
        markerfacecolor=c, markersize=5, label=s
    )
    for s, c in species_colors.items()
] 
fig.legend(handles=handles, loc=7)
fig.tight_layout()
fig.subplots_adjust(right=0.7)
plt.show()

# Calculate nearest neighbor adjacency matrix to avoid re-calculating it for different leiden parameters.
# However, this can also be done directly by running `run_clustering`.
nn_mat = get_nn_mat(
    kmer_collect.dr["harmony_pca"], 
    n_neighbors=15, 
    verbosity=1, 
    return_distance=True
)

# calculate leiden clustering and test several parameters
kmer_collect.run_clustering(
    algorithm="leiden",  # set algorithm
    save_name="leiden_0.6",  # save result under this name
    resolution=.6,  # additional leiden parameter - resolution
    leiden_beta=0.,  # additional leiden parameter - beta
    adj_mat=nn_mat  # additional leiden parameter - adjacency matrix
)
kmer_collect.run_clustering(
    algorithm="leiden",
    save_name="leiden_0.4",
    resolution=.4,
    leiden_beta=0.,
    adj_mat=nn_mat
)
kmer_collect.run_clustering(
    algorithm="leiden",
    save_name="leiden_0.2",
    resolution=.2,
    leiden_beta=0.,
    adj_mat=nn_mat
)
kmer_collect.run_clustering(
    algorithm="leiden",
    save_name="leiden_0.1",
    resolution=.1,
    leiden_beta=0.,
    adj_mat=nn_mat
)

# plot Leiden embedding
fig, ax = plt.subplots(2, 2, figsize=(8, 8))
x_umap = kmer_collect.x_umap["umap"]  # get UMAP coordinates
idc = np.arange(x_umap.shape[0])
np.random.shuffle(idc)  # randomly shuffle indices

scat = ax.reshape(-1)[0].scatter(
    x_umap[idc, 0],
    x_umap[idc, 1],
    marker=".",
    s=1.,
    cmap="jet",
    c=kmer_collect.clustering["leiden_0.1"][idc],  # set Leiden clustering as color vector
)
ax.reshape(-1)[0].set_title(r"$\sigma=0.1$")

scat = ax.reshape(-1)[1].scatter(
    x_umap[idc, 0],
    x_umap[idc, 1],
    marker=".",
    s=1.,
    cmap="jet",
    c=kmer_collect.clustering["leiden_0.2"][idc],
)
ax.reshape(-1)[1].set_title(r"$\sigma=0.2$")

scat = ax.reshape(-1)[2].scatter(
    x_umap[idc, 0],
    x_umap[idc, 1],
    marker=".",
    s=1.,
    cmap="jet",
    c=kmer_collect.clustering["leiden_0.4"][idc],
)
ax.reshape(-1)[2].set_title(r"$\sigma=0.4$")

scat = ax.reshape(-1)[3].scatter(
    x_umap[idc, 0],
    x_umap[idc, 1],
    marker=".",
    s=1.,
    cmap="jet",
    c=kmer_collect.clustering["leiden_0.6"][idc],
)
ax.reshape(-1)[3].set_title(r"$\sigma=0.6$")

fig.suptitle("Leiden clusters")
fig.tight_layout()
plt.show()

Note that this example is not exhaustive. You can equally perform label transfer and TFBS enrichment analysis. But these steps depend more on your research question in mind. Please see our documentation and tutorials for more information.

Reference

If you found sPYce helpful for your own work, please consider citing. Pre-print coming soon!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spyceatac-0.0.2.tar.gz (68.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spyceatac-0.0.2-py3-none-any.whl (149.2 kB view details)

Uploaded Python 3

File details

Details for the file spyceatac-0.0.2.tar.gz.

File metadata

  • Download URL: spyceatac-0.0.2.tar.gz
  • Upload date:
  • Size: 68.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.18

File hashes

Hashes for spyceatac-0.0.2.tar.gz
Algorithm Hash digest
SHA256 c38dd7ae24f55effedadfcf1bc19e238f3bc2dd22d061467400aac5de329f6c6
MD5 cd00d6a746c240fd46530d10738a9da4
BLAKE2b-256 98e97fac2939dc5d7fff6699bf822491454c76ab76849f5bfe757bdc7492d19b

See more details on using hashes here.

File details

Details for the file spyceatac-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: spyceatac-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 149.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.18

File hashes

Hashes for spyceatac-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e84066ac822175b965de86cd885219e5ba8a8f37042b76d5460967c89ce9f218
MD5 bd3a459a20a87ca04d39fad7f4044e79
BLAKE2b-256 204678f53c42005cab40ef27e0416c7e2cd19467da06b7400b60b136171728ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page