Skip to main content

A Python library to fetch and process homolog data from Phytozome.

Project description

PhytoMiner

This is a package for fetching Phytozome data

PyPI version License: MIT

A Python library for efficiently fetching and processing gene homolog data from the Phytozome database via its InterMine API.

This library is designed to simplify complex, iterative bioinformatic queries, allowing researchers to trace gene homology across multiple species with ease.

Features

  • Initial Fetch: Start a search with a list of genes from a source organism.
  • Iterative Search: Perform chained or subsequent searches using homologs found in previous steps.
  • Parallel Processing: Utilizes multithreading for efficient, parallel data fetching, significantly speeding up large queries.
  • Data Processing: Includes functions to clean, de-duplicate, and enrich the fetched data by calculating occurrence counts and aggregating source information.
  • Visualisation: Comes with a utility function to quickly generate a heatmap of homolog distribution across species and subunits.

Installation

You can install PhytoMiner directly from PyPI:

pip install phytominer

Usage

Here is a complete example of a common workflow:

  1. Start with a set of known genes in a source organism (e.g. A. thaliana TAIR10).
  2. Perform an initial_fetch to find homologs in other species.
  3. Use the results to perform a subsequent_fetch for a specific target organism (S. bicolor v3.1.1).
  4. Combine and process the data.
  5. Visualize the homolog distribution with pivotmap.
import pandas as pd
from phytozome_fetcher import (
    initial_fetch,
    subsequent_fetch,
    process_homolog_data,
    pivotmap,
    print_summary
)

# 1. Define initial query genes for Arabidopsis thaliana
# (Using a small, representative subset for this example)
athaliana_genes = {
    'AT1G74840': 'NdhB',
    'ATCG00520': 'NdhD',
    'AT4G05320': 'PnsL4',
}

# 2. Perform the initial fetch from Arabidopsis thaliana
print("--- Starting Initial Fetch ---")
initial_df = initial_fetch(
    source_organism_name="A. thaliana TAIR10",
    transcript_names=list(athaliana_genes.keys()),
    subunit_dict=athaliana_genes,
    max_workers=4
)
print_summary(initial_df, "Initial Fetch Results")

# 3. Perform a subsequent fetch using homologs found in Sorghum bicolor
print("\n--- Starting Subsequent Fetch for Sorghum bicolor ---")
sorghum_df = subsequent_fetch(
    current_master_df=initial_df,
    target_organism_name="S. bicolor v3.1.1",
    max_workers=4
)
print_summary(sorghum_df, "Subsequent Fetch Results for Sorghum")

# 4. Combine and process the data
print("\n--- Combining and Processing Data ---")
master_df = pd.concat([initial_df, sorghum_df], ignore_index=True)
processed_df = process_homolog_data(master_df)
print_summary(processed_df, "Final Processed DataFrame")

# 5. Visualize the results
print("\n--- Generating Heatmap ---")
# For a cleaner plot, let's display the top 15 organisms by homolog count
top_organisms = processed_df['organism.shortName'].value_counts().nlargest(15).index
filtered_df = processed_df[processed_df['organism.shortName'].isin(top_organisms)]

pivot_table = pivotmap(filtered_df)
print("\nPivot Table Head:")
print(pivot_table.head())

API Overview

Core Functions

  • initial_fetch(source_organism_name, transcript_names, subunit_dict, max_workers): Kicks off the homolog search with a defined set of genes.
  • subsequent_fetch(current_master_df, target_organism_name, max_workers): Expands the search by using the results from a previous fetch as input for a new target organism.
  • process_homolog_data(df_combined): Takes a raw DataFrame of combined fetch results and performs cleaning, aggregation, and de-duplication. It adds homolog.occurrences and origin.source.organisms columns.

Utility Functions

  • pivotmap(dataframe, index, columns, values): Generates a pivot table and a corresponding heatmap to visualize the count of homologs.
  • print_summary(df, stage_message): Prints a quick summary of a DataFrame's shape and contents.

Contributing

Contributions are welcome! If you have a suggestion or find a bug, please open an issue. Pull requests are also encouraged.

  1. Fork the repository.
  2. Create your feature branch (git checkout -b feature/AmazingFeature).
  3. Commit your changes (git commit -m 'Add some AmazingFeature').
  4. Push to the branch (git push origin feature/AmazingFeature).
  5. Open a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phytominer-0.1.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phytominer-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file phytominer-0.1.0.tar.gz.

File metadata

  • Download URL: phytominer-0.1.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for phytominer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 17f93141be2721d1fb840e872744d08c040f1ff6456d089929e18dd6c02f4238
MD5 3cdfd76da3c3b7b982ae841fa90d6aae
BLAKE2b-256 d97ea883eed08582a417f59f79f69191c002b2092a50d654fb450fbcb339202f

See more details on using hashes here.

Provenance

The following attestation bundles were made for phytominer-0.1.0.tar.gz:

Publisher: python-publish.yml on boffus/PhytoMiner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file phytominer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: phytominer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for phytominer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3152505cfa555f0cddd707ca13c62b48e46ba6ad9a9a0678fb2076a6efb1791c
MD5 87fad259dc9a01585b0d8d35d3667186
BLAKE2b-256 a4dd315709aaedc29569e02c08aaf1bf7faae76ed978d06f65b4549970db705f

See more details on using hashes here.

Provenance

The following attestation bundles were made for phytominer-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on boffus/PhytoMiner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page