The Transferase Python API
Project description
pyxfr
The pyxfr package is an API for transferase. Learn more about transferase here. This package allows the same queries to be done within Python as with the transferase command line app. Almost all other utilities for manipulating transferase data are available through pyxfr.
Although documentation is still sparse for pyxfr, each class and function in pyxfr has built-in documentation:
from pyxfr import pyxfr, pyxfr_utils
help(pyxfr)
help(pyxfr_utils)
Requirements
- Linux: Python >= 3.12
- macOS: Python >= 3.12 and macOS >= 13 (at least Ventura)
Usage examples
First we import the pyxfr module so we can set our preferred log level for the session or in your Python scripts. Setting it to "debug" let's us see everything. It will be a lot of information, most of it actually for debugging.
import pyxfr
pyxfr.set_log_level("debug")
Next we want to set up transferase for the user (i.e., your login account) on the host system (e.g., your laptop). The following will do a default setup, and might take up to a minute:
from pyxfr import MConfig
config = MConfig()
config.install(["hg38"])
This will create files in ~/.config/transferase which are safe to delete
anytime because you can just run the same command again. The reason this is
done in two steps is because you might want to change something in the
config before doing the installation. By typing the name of the variable
config you will see a dump of its values. For now, leave them as their
default values -- they only need to be changed if you are using local data.
You can select other genomes in the install step (e.g., mm39, rn7, bosTau9,
etc.). If the genomes don't exist or are not on the server, you should see a
RuntimeError exception in Python, indicating a problem downloading. The
server can't tell the difference between an invalid genome assembly name, one
that is misspelled, and a real one that simply isn't on the server. You can
find the list of available genomes by checking out MethBase2 through the UCSC
Genome Browser, or using a command I will show below.
With the setup has completed, we can get a client object:
from pyxfr import MClient
client = MClient()
The client object is what makes the queries. Our query will be based on a set of genomic intervals, which you would get from a BED format file. However, before working with the genomic intervals we need to first load a genome index, which guarantees that we are working with the exact reference genome that the transferase server expects.
from pyxfr import GenomeIndex
genome_index = GenomeIndex.read(client.get_index_dir(), "hg38")
We will now read genomic intervals. If you have a BED format file for hg38, for example around 100k intervals, you can use it. Otherwise you can find the intervals.bed.gz in the docs directory of the repo (likely alongside this file), gunzip it and put it in your working directory.
from pyxfr import GenomicInterval
intervals = GenomicInterval.read(genome_index, "intervals.bed")
At this point we can do a query:
genome_name = "hg38"
methylome_names = ["ERX9474770","ERX9474769"]
levels = client.get_levels(genome_name, methylome_names, intervals)
The levels is an object of class MLevels, which is a matrix where rows
correspond to intervals and columns correspond to methylomes in the query.
The following loop will allow us to see the first 10 methylation levels that
were retrieved for the second (ERX9474769) of the two methylomes in our
query:
print("\n".join([str(levels.at(i, 1)) for i in range(10)]))
It should look like this:
(4, 1)
(1, 471)
(0, 0)
(45, 29)
(0, 0)
(334, 346)
(62, 1581)
(51, 1755)
(74, 664)
(199, 1753)
If you are doing multiple queries that involve the same set of genomic
intervals, it's more efficient to make a MQuery object out of them. This is
done internally by the query above, but the work to do it can be skipped if
they you already have them. Here is an example:
genome_name = "hg38"
methylome_names = ["ERX9474770","ERX9474769"]
query = genome_index.make_query(intervals)
levels = client.get_levels_covered(genome_name, methylome_names, query)
If you want to see the methylation levels alongside the original genomic intervals, including printing the chromosome names for each genomic interval, you can do it as follows:
for col in range(levels.n_methylomes):
for row in range(10):
print(intervals[row].to_string(genome_index), levels.at(row, col))
print()
The at function for class MLevels will return a tuple of
(n_meth,n_unmeth) and if the _covered version of the query was used, it
will return a tuple of (n_meth,n_unmeth,n_covered). These create the tuple
objects that they return Python. If you want the corresponding values without
creating the tuple, you can get them with the get_n_meth(i,j),
get_n_unmeth(i,j) and get_n_covered(i,j) functions, where i is the
row/interval, and j is the column/methylome. Here is an example:
for i in range(10): # 10 for brevity
for j in range(levels.n_methylomes):
print(levels.get_n_covered(i, j), end=' ')
print()
The get_n_covered(i,j) will raise an exception if the information about
covered sites was not requested in the query. There is also a get_wmean(i,j)
function that can be applied similarly, which gives the weighted mean
methylation level for the corresponding interval. This is likely the most
useful summary statistic in all of methylome analysis.
You can convert an MLevels object into a numpy array, which is already
familiar to many Python users. The following commands assume that the levels
object was obtained from a get_levels_covered query, hence the 3 as the
final dimension in a.shape:
a = levels.view_nparray()
methylome_names = ["ERX9474770","ERX9474769"]
assert a.shape == (len(methylome_names), len(intervals), 3)
The intervals in that command came from an earlier command above. The full
set of weighted mean methylation levels can be obtained as a numpy array
(matrix) using the all_wmeans(min_reads) function, applied to the entire
levels object. The min_reads parameter indicates a value below which the
fraction is not interpretable. This is needed, because the information about
whether there are even any reads at all for a given interval would be
lost. Entries without enough reads are assigned a value of -1.0. Here is an
example (the output is a numpy array):
min_reads = 2
means = levels.all_wmeans(min_reads)
print(means[0:10])
As we have seen, the queries named with _covered return, along with
methylation levels, the number of CpG sites covered by reads in each query
interval. It's also helpful to know how many total CpG sites are in the
interval. This is a property of the reference genome, not a particular
methylome. If you have a set of intervals and a GenomeIndex, you can get the
total number of CpG sites in each interval like this:
n_cpgs_intervals = genome_index.get_n_cpgs(intervals)
print("\n".join([str(i) for i in n_cpgs_intervals[0:10]]))
Two other kinds of queries are supported: bins and windows. Bins are fixed-width intervals that tile the genome -- no CpG is in more than one bin. Windows slide, so each CpG can fall within multiple windows. These are based on genome coordinates and not the CpG identify. Here is an example of a bins query:
bin_size = 10000
genome_name = "hg38"
methylome_names = ["ERX9474770","ERX9474769"]
levels = client.get_levels(genome_name, methylome_names, bin_size)
Since the human genome is a bit more than 3G in size, this query will return a bit more than 300,000 level values for each of the two query methylomes. Be careful with a query like this, because you can easily create massive output files if your bins are too small. The public transferase server will also restrict queries for very small bin sizes -- 100bp bins could mean many GB of data transmitted.
A windows query is like a bins query but the windows slide and are therefore overlapping. A step size is needed in addition to a window size:
window_size = 30000
window_step = 10000
genome_name = "hg38"
methylome_names = ["ERX9474770","ERX9474769"]
levels = client.get_levels(genome_name, methylome_names, window_size, window_step)
The above command uses a window size of 30Kbp, and a step size of 10Kbp. The number of windows will be just over 300,000, because one window begins at each 10Kbp offset. But in contrast with the bins query, each of the windows will span 30Kbp.
The output format for bins and windows queries is the same as for queries
based on a given set of genomic intervals. This means you can use the same
functions to manipulate the levels, for example get_wmean(i,j) or
get_n_meth(i, j), and the conversion to numpy arrays.
You can get the metadata for MethBase2 methylomes available on the public transferase server. Here is the command and along with what you might expect to see:
methbase_metadata = pyxfr_utils.load_methbase_metadata("hg38")
sum(methbase_metadata.sample_name == "Sperm")
# 174
list(methbase_metadata.columns.values)
# ['study', 'experiment', 'bsrate', 'meth', 'depth', 'mapping', 'uniq',...
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyxfr-0.6.4-cp313-none-manylinux_2_17_x86_64.whl.
File metadata
- Download URL: pyxfr-0.6.4-cp313-none-manylinux_2_17_x86_64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
292ceb741a50cee52336ead964a1d1cf8cf5ab0cfdbc870f9a9f46170e0932fe
|
|
| MD5 |
5e19240347bf7ee013fdc70ae0a11f50
|
|
| BLAKE2b-256 |
3bdc81b79aba05ee2623cf350d7a880553e55f4f4d60b19dccda48648409f340
|
File details
Details for the file pyxfr-0.6.4-cp313-none-macosx_13_0_x86_64.whl.
File metadata
- Download URL: pyxfr-0.6.4-cp313-none-macosx_13_0_x86_64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.13, macOS 13.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c94772bc99e951e61072cb5b971fea4c6e24bf8d8791b38f7e220a717b1fabb
|
|
| MD5 |
4fc1dc18d57dd0043d655546e50da70f
|
|
| BLAKE2b-256 |
8beb1c525f5e8d19838c695f248bb0e3b25adc0745bdb82fbc80c8870f3b6f38
|
File details
Details for the file pyxfr-0.6.4-cp313-none-macosx_13_0_arm64.whl.
File metadata
- Download URL: pyxfr-0.6.4-cp313-none-macosx_13_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.13, macOS 13.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da9c7e1d704ea8fffa752d54e8a590e21a6e6244ee06ddc273719eb7fbecd324
|
|
| MD5 |
04b63f7fb99bbbd892552a30c544f79a
|
|
| BLAKE2b-256 |
d2c496d14c4052355d7265d70f113ca3896fb6e3e1ae8cdff6fba7dde0c2417a
|
File details
Details for the file pyxfr-0.6.4-cp312-none-manylinux_2_17_x86_64.whl.
File metadata
- Download URL: pyxfr-0.6.4-cp312-none-manylinux_2_17_x86_64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a538279b4fa96da9b98ed324c19658948423bea3c553b0fe3e926d2d30018fc
|
|
| MD5 |
d871aee1afec3c6985c765299ad4bfca
|
|
| BLAKE2b-256 |
4c1a445f4e4744c4c19bda4fa1572b999cb274b5a0ceb40122ed0cb59631fa18
|
File details
Details for the file pyxfr-0.6.4-cp312-none-macosx_13_0_x86_64.whl.
File metadata
- Download URL: pyxfr-0.6.4-cp312-none-macosx_13_0_x86_64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.12, macOS 13.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db4251b186d59447815f018e53a6adec7be31be93c91209a0de9786583754c71
|
|
| MD5 |
3e8f0721b4ffc199befbf586c841f408
|
|
| BLAKE2b-256 |
7f7a605ff324a800753aa2a82585e5553e3b17aed1e99848b5fe68959f0ea790
|
File details
Details for the file pyxfr-0.6.4-cp312-none-macosx_13_0_arm64.whl.
File metadata
- Download URL: pyxfr-0.6.4-cp312-none-macosx_13_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.12, macOS 13.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c43d85a6dae6eda71f8c4dfd8b1bb5b8dce87c8dfbb118b2a5a61db4a5672f5f
|
|
| MD5 |
c26cee04f6582659297769a2a1a54ef3
|
|
| BLAKE2b-256 |
bb15c6a6b2f32ecead74ff6a023efd8426b9e9b9d87b41ea7a344fe4a39a3012
|