A Python library to fetch and process homolog data from Phytozome.
Project description
PhytoMiner
This is a package for fetching Phytozome data
A Python library for efficiently fetching and processing gene homolog data from the Phytozome database via its InterMine API.
This library is designed to simplify complex, iterative bioinformatic queries, allowing researchers to trace gene homology across multiple species with ease.
Features
- Initial Fetch: Start a search with a list of genes from a source organism.
- Iterative Search: Perform chained or subsequent searches using homologs found in previous steps.
- Parallel Processing: Utilizes multithreading for efficient, parallel data fetching, significantly speeding up large queries.
- Data Processing: Includes functions to clean, de-duplicate, and enrich the fetched data by calculating occurrence counts and aggregating source information.
- Visualisation: Comes with a utility function to quickly generate a heatmap of homolog distribution across species and subunits.
Installation
You can install the latest PhytoMiner release directly from PyPI:
pip install phytominer
Usage
Here is a complete example of a common workflow:
- Start with a set of known genes in a source organism (e.g.
A. thaliana TAIR10). - Perform an
initial_fetchto find homologs in other species. - Use the results to perform a
subsequent_fetchfor a specific target organism (S. bicolor v3.1.1). - Combine and process the data.
- Visualize the homolog distribution with
pivotmap.
import pandas as pd
from phytominer import (
run_homolog_pipeline,
initial_fetch,
subsequent_fetch,
process_homolog_data,
pivotmap,
print_summary
)
# 1. Define initial query genes for Arabidopsis thaliana
# (Using a small subset for this example)
athaliana_genes = {
'AT5G52100': 'CRR1',
'AT3G46790': 'CRR2',
'AT2G01590': 'CRR3',
}
# 2. Run the whole pipeline
from phytominer.workflow import run_homologs_pipeline
from phytominer.config import DEFAULT_MAX_WORKERS
results = run_homologs_pipeline(
initial_organism="athaliana",
initial_genes_dict=athaliana_genes,
subsequent_organisms=["osativa", "slycopersicum"],
max_workers=DEFAULT_MAX_WORKERS,
checkpoint_dir="homolog_checkpoints"
)
# 3. Expression Data Fetch Workflow
from phytominer.workflow import run_expressions_workflow
from phytominer.config import (
JOIN2_OUTPUT_FILE,
EXPRESSION_CHECKPOINT_DIR,
EXPRESSIONS_OUTPUT_FILE
)
from phytominer.processing import load_master_df, fetch_expression_data
run_expressions_workflow(
master_file=JOIN2_OUTPUT_FILE,
checkpoint_dir=EXPRESSION_CHECKPOINT_DIR,
output_file=EXPRESSIONS_OUTPUT_FILE,
fetch_expression_data=fetch_expression_data,
load_master_df=load_master_df
)
# 4. Alternatively perform the initial fetch from Arabidopsis thaliana
print("--- Starting Initial Fetch ---")
initial_df = initial_fetch(
source_organism_name="A. thaliana TAIR10",
transcript_names=list(athaliana_genes.keys()),
subunit_dict=athaliana_genes,
max_workers=4
)
print_summary(initial_df, "Initial Fetch Results")
# 5. Perform a subsequent fetch using homologs found in Sorghum bicolor
print("\n--- Starting Subsequent Fetch for Sorghum bicolor ---")
subsequent_df = subsequent_fetch(
current_master_df=initial_df,
target_organism_name="S. bicolor v3.1.1",
max_workers=4
)
print_summary(s_df, "Subsequent Fetch Results for Sorghum")
# 6. Combine and process the data
print("\n--- Combining and Processing Data ---")
master_df = pd.concat([initial_df, subsequent_df], ignore_index=True)
processed_df = process_homolog_data(master_df)
print_summary(processed_df, "Final Processed DataFrame")
# 7. Visualize missing genes
print("\n--- Generating Heatmap ---")
# For a cleaner plot, let's display the top 15 organisms by homolog count
top_organisms = processed_df['organism.shortName'].value_counts().nlargest(15).index
filtered_df = processed_df[processed_df['organism.shortName'].isin(top_organisms)]
pivot_table = pivotmap(filtered_df)
print("\nPivot Table Head:")
print(pivot_table.head())
API Overview
Core Functions
initial_fetch(source_organism_name, transcript_names, subunit_dict, max_workers): Kicks off the homolog search with a defined set of genes.subsequent_fetch(current_master_df, target_organism_name, max_workers): Expands the search by using the results from a previous fetch as input for a new target organism.- run_homologs_pipeline(initial_organism, initial_genes_dict, subsequent_organisms, max_workers, checkpoint_dir): Run the full homolog search pipeline.
- run_expressions_workflow(master_file, checkpoint_dir, output_file, fetch_expression_data_for_gene_chunk, load_master_df, ...): Fetch and merge expression data for all subunits.
Utility Functions
pivotmap(dataframe, index, columns, values): Generates a pivot table and a corresponding heatmap to visualize the count of homologs.print_summary(df, stage_message): Prints a quick summary of a DataFrame's shape and contents.- load_master_df(filepath): Load and validate the master homolog DataFrame.
- fetch_expression_data(gene_id_chunk, subunit_name_for_context, chunk_num, total_chunks): Fetch expression data for chunks of gene IDs.
Continuous Integration & Deployment
This project uses GitHub Actions for automated testing and publishing.
- Automated Testing:
Every push to themainbranch triggers the test suite using Python 3.9. - Automated Publishing:
When a new release is published on GitHub, the package is automatically built and uploaded to PyPI.
You can find the workflow configuration in .github/workflows/python-publish.yml.
Contributing
Contributions are welcome! If you have a suggestion or find a bug, please open an issue. Pull requests are also encouraged.
- Fork the repository.
- Create your feature branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
Running Tests Locally
To run the test suite locally:
pip install -e .[dev]
pytest
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contact
Author: Kris Kari Email: toffe.kari@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phytominer-0.1.4.tar.gz.
File metadata
- Download URL: phytominer-0.1.4.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df3343f4e9a90c7c9080d84c84c6bde08f73f1ea3cfbbc19a36c99bd6499176d
|
|
| MD5 |
4a4b87851421dff9e3a541c9bebe50c7
|
|
| BLAKE2b-256 |
7e3a42147fb5ec656479863fd86c5c165df7c4ada3a17635a6ca5511b5916075
|
Provenance
The following attestation bundles were made for phytominer-0.1.4.tar.gz:
Publisher:
python-publish.yml on boffus/PhytoMiner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phytominer-0.1.4.tar.gz -
Subject digest:
df3343f4e9a90c7c9080d84c84c6bde08f73f1ea3cfbbc19a36c99bd6499176d - Sigstore transparency entry: 301312530
- Sigstore integration time:
-
Permalink:
boffus/PhytoMiner@2fd5ef655d925a759055eb1156f470ee7c5f7b36 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/boffus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2fd5ef655d925a759055eb1156f470ee7c5f7b36 -
Trigger Event:
release
-
Statement type:
File details
Details for the file phytominer-0.1.4-py3-none-any.whl.
File metadata
- Download URL: phytominer-0.1.4-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acce77153f60ef7b5c7adfad498f11212b3e4a1e67ca51be695b20d599981dcf
|
|
| MD5 |
4eeac89ffa776ffa5efdbf3d9464a35d
|
|
| BLAKE2b-256 |
59d2100f08bdbc1a718907669e26953b6902dadba26c191494af18c15e15f72d
|
Provenance
The following attestation bundles were made for phytominer-0.1.4-py3-none-any.whl:
Publisher:
python-publish.yml on boffus/PhytoMiner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phytominer-0.1.4-py3-none-any.whl -
Subject digest:
acce77153f60ef7b5c7adfad498f11212b3e4a1e67ca51be695b20d599981dcf - Sigstore transparency entry: 301312545
- Sigstore integration time:
-
Permalink:
boffus/PhytoMiner@2fd5ef655d925a759055eb1156f470ee7c5f7b36 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/boffus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2fd5ef655d925a759055eb1156f470ee7c5f7b36 -
Trigger Event:
release
-
Statement type: