Deep learning software package to predict geographic coordinates from genome-wide SNP data.

These details have not been verified by PyPI

Project links

Project description

GeoGenIE User Manual

GeoGenIE Logo {width=40%}

Introduction

GeoGenIE (Geographic-Genetic Inference Engine) is a comprehensive software tool designed to predict geographic coordinates (longitude and latitude) from genetic SNP data. GeoGenIE utilizes deep learning models and offers several advanced features such as automated parameter tuning with Optuna [1], outlier detection to remove individuals that do not conform to expected geographic and genetic patterns (e.g., isolation by distance) [3], handling of imbalanced sampling using a custom oversampling algorithm adapted from SMOTE [2], and weighting the loss function by inverse sampling densities. The software is user-friendly and provides extensive visualizations and metrics for evaluating predictions, making it a robust and accurate solution for geographic-genetic inference.

Installation

To install GeoGenIE, it is recommended to use a virtual environment or conda environment. From the root project directory, enter the following command while in the root project directory.

pip install GeoGenIE

Dependencies

The following packages will be installed when running pip install GeoGenIE:

geopandas
geopy
imblearn
jenkspy
kaleido
kneed
matplotlib
numba
numpy
optuna
pandas
plotly
pykrige
pynndescent
pysam
pyyaml
requests
scikit-learn
scipy
seaborn
statsmodels
torch
xgboost

Usage

Running GeoGenIE

GeoGenIE can be run with individual command-line arguments, but using a YAML config file is recommended. See config_files/config.yaml for an example YAML file. Assuming GeoGenIE is installed in your environment, you can run it like this:

geogenie --config config_files/config.yaml

Command-line Options

You can see all the command-line options by running the help flag:

geogenie -h

Note: If you don't want to use the configuration file, you can specify each argument individually on the command line. For example:

geogenie --output_dir "my_analysis" --prefix "my_prefix" <other arguments>

Configuration File

You can set all the options for input files, model parameters, etc., in the config_files/config.yaml file. Using a configuration file allows tracking of parameters across multiple runs and ensures better reproducibility.

Python None values are represented by null (without quotes).
Python True values are represented by true (all lowercase, no quotes).
Python False values are represented by false (all lowercase, no quotes).

You can also leave comments with # my_comment. The arguments can be in any order in the config_files/config.yaml file.

Running the Software

geogenie --config config_files/config.yaml

Option	Description	Default	Notes
`--vcf`	Path to the VCF file with SNPs.	None	Required input file.
`--sample_data`	Path to the coordinates file.	None	Required input file.
`--known_sample_data`	Path to the known coordinates file.	None	Used for per-sample bootstrapped output plots.
`--prop_unknowns`	Proportion of samples set to "unknown."	0.1	Useful for testing model prediction.
`--min_mac`	Minimum minor allele count to retain SNPs.	2	Filters rare variants.
`--max_SNPs`	Maximum number of SNPs to randomly subset.	None	Reduces computational load.
`--embedding_type`	Embedding type for input SNP dataset.	"none"	Options: 'pca', 'kernelpca', 'nmf', 'lle', 'mca', 'mds', 'polynomial', 'tsne', and 'none'.
`--n_components`	Number of components for PCA/tSNE embeddings.	None	Auto-detected for optimal performance.
`--embedding_sensitivity`	Sensitivity for optimal component selection.	1.0	Adjust for balance between overfitting and underfitting.
`--tsne_perplexity`	Perplexity setting for T-SNE embedding.	30	Lower values for finer details.
`--polynomial_degree`	Polynomial degree for 'polynomial' embedding.	2	Higher degrees increase complexity and computational overhead.
`--n_init`	Initialization runs for Multi-Dimensional Scaling embedding.	4	More initialization runs provide stable results but increase computation time.
`--nlayers`	Number of hidden layers in the neural network.	10	Adjust for model complexity.
`--width`	Number of neurons per layer.	256	Controls model capacity.
`--dropout_prop`	Dropout rate to prevent overfitting.	0.25	Regularization technique.
`--criterion`	Model loss criterion.	"rmse"	Options: 'rmse', 'huber', 'drms'.
`--load_best_params`	Load best parameters from previous Optuna search.	None	Save time by reusing optimized parameters.
`--use_gradient_boosting`	Use Gradient Boosting model instead of deep learning model.	False	Deprecated; may be removed in future versions.
`--dtype`	PyTorch data type.	"float32"	Options: 'float32', 'float64'.
`--batch_size`	Training batch size.	32	Affects training stability and memory usage.
`--max_epochs`	Maximum number of training epochs.	5000	Early stopping will terminate training earlier if conditions are met.
`--learning_rate`	Learning rate for the optimizer.	0.001	Adjust to control training speed.
`--l2_reg`	L2 regularization weight.	0.0	Reduces overfitting.
`--early_stop_patience`	Epochs to wait after no improvement before stopping.	48	Prevents excessive training time when no further improvements are observed.
`--train_split`	Proportion of data used for training.	0.8	Validation and test splits must sum to 1.0.
`--val_split`	Proportion of data used for validation.	0.2	Adjust based on the dataset size.
`--do_bootstrap`	Enable bootstrap replicates.	False	Generates confidence intervals for predictions.
`--nboots`	Number of bootstrap replicates.	100	Increase for more accurate confidence intervals.
`--do_gridsearch`	Perform Optuna parameter search.	False	Optimizes model parameters for better performance.
`--n_iter`	Iterations for parameter optimization using Optuna.	100	Higher values result in more thorough parameter searches.
`--lr_scheduler_patience`	Learning rate scheduler patience.	16	Reduce learning rate after no improvement over specified epochs.
`--lr_scheduler_factor`	Factor to reduce learning rate when scheduler is triggered.	0.5	Lower values make finer adjustments to the learning rate.
`--factor`	Scale factor for neural network widths in successive layers.	1.0	Controls width reduction of hidden layers.
`--grad_clip`	Enable gradient clipping.	False	Prevents gradient explosion in deep networks.
`--use_weighted`	Use inverse-weighted probability sampling.	"none"	Options: 'loss', 'sampler', 'both', or 'none'.
`--oversample_method`	Synthetic oversampling method.	"none"	Options: 'kmeans', 'none'.
`--oversample_neighbors`	Number of nearest neighbors for oversampling.	5	Controls synthetic sample generation.
`--n_bins`	Number of bins for synthetic resampling.	8	Higher values enable finer-grained sampling density adjustments.
`--use_kmeans`	Use KMeans clustering for weighted loss.	False	Weighted loss focuses on undersampled geographic regions.
`--use_kde`	Use Kernel Density Estimation to obtain sample weights.	False	Experimental feature.
`--use_dbscan`	Use DBSCAN clustering to obtain sample weights.	False	Experimental feature.
`--w_power`	Exponential power for inverse density weighting.	1.0	Controls the emphasis on undersampled regions.
`--max_clusters`	Maximum number of clusters for KMeans.	10	Higher values allow more detailed clustering.
`--focus_regions`	Geographic regions of interest for sampling density weights.	None	Format: [(lon_min, lon_max, lat_min, lat_max), ...].
`--normalize_sample_weights`	Normalize density-based sample weights from 0 to 1.	False	Ensures all sample weights are on a comparable scale.
`--detect_outliers`	Enable outlier detection to remove translocated individuals.	False	Helps to clean the dataset.
`--min_nn_dist`	Minimum distance between nearest neighbors for outlier detection (meters).	1000	Adjust to control outlier detection stringency.
`--scale_factor`	Geographic distance scaling factor for outlier detection.	100	Controls how geographic distances are evaluated.
`--significance_level`	Significance level (alpha) for outlier detection.	0.05	P-values ≤ alpha are removed as outliers.
`--maxk`	Maximum number of nearest neighbors for outlier detection.	50	Controls the scope of neighbor comparisons.
`--show_plots`	Display plots inline (e.g., in Jupyter Notebooks).	False	Visualization aid.
`--fontsize`	Font size for plot labels, ticks, and titles.	24	Adjust to improve readability of plots.
`--filetype`	File type for saving plots.	"png"	Options: 'png', 'pdf', 'jpg'.
`--plot_dpi`	DPI for image format plots.	300	Higher DPI produces clearer images but increases file size.
`--remove_splines`	Remove bottom and left axis splines from map plots.	False	Use for cleaner map visualizations.
`--shapefile`	URL or file path for shapefile used in plotting.	Default USA map	Specify a custom shapefile for regions outside the USA.
`--basemap_fips`	FIPS code for basemap.	None	Specify to focus on specific states or regions (e.g., "05" for Arkansas).
`--highlight_basemap_counties`	Highlight specified counties on the base map in gray.	None	Useful for emphasizing areas of interest.
`--samples_to_plot`	Comma-separated sample IDs for per-sample bootstrap plots.	None	Defaults to plotting all samples.
`--n_contour_levels`	Number of contour levels for the Kriging plot.	20	Adjust for more detailed or simplified contour plots.
`--min_colorscale`	Minimum colorbar value for the Kriging plot.	0	Set to match the range of your data.
`--max_colorscale`	Maximum colorbar value for the Kriging plot.	300	Increase if your data range exceeds the default.
`--sample_point_scale`	Scale factor for sample point size on Kriging plot.	2	Adjust to ensure points are visible but not overwhelming.
`--bbox_buffer`	Buffer for the sampling bounding box on map visualizations.	0.1	Increase to add more context around the sampling area.
`--prefix`	Output file prefix.	"output"	Helps organize output files for multiple runs.
`--sqldb`	SQLite3 database directory for Optuna optimization.	None	Saves optimization results for future use.
`--output_dir`	Directory to store output files.	"./output"	Organize files by specifying unique output directories.
`--seed`	Random seed for reproducibility.	None	Ensures consistent results across runs.
`--gpu_number`	GPU number for computation.	None (CPU)	Specify to enable GPU acceleration. Requires CUDA-compatible hardware.
`--n_jobs`	Number of CPU jobs to use.	-1	Use all CPUs by default.
`--verbose`	Verbosity level for logging.	1	Options: 0 (silent) to 3 (most verbose).
`--debug`	Enable debug mode for verbose logging.	False	Useful for troubleshooting.

Output Files and File Structure

Outputs are saved to the directory specified by --output_dir <my_output_dir>/<prefix_>_*. The prefix is specified with --prefix <prefix>. The directory structure of <output_dir> includes:

Directory	Description
`benchmarking`	Execution times for model training and prediction, with one line per bootstrap replicate if using bootstrapping.
`bootstrapped_sample_ci`	One plot per sample showing confidence intervals on a map.
`bootstrap_metrics`	JSON files with statistical metrics for each bootstrap replicate, in `test` and `val` subdirectories.
`bootstrap_predictions`	CSV files containing predictions for each bootstrap replicate, in `test`, `val`, and `unknown` subdirectories.
`bootstrap_summaries`	Mean, median, and standard deviation representations of bootstrap replicates for the test, val, and unknown ("pred") datasets.
`data`	Text files with sample IDs detected as outliers if `--detect_outliers` is enabled.
`logfiles`	Logs with INFO, WARNING, and ERROR messages, including timestamps and GeoGenIE modules.
`models`	Trained PyTorch models saved as ".pt" files, one per bootstrap if `--do_bootstrap` is enabled.
`optimize`	Optuna results, including the best-found parameters as a JSON file.
`plots`	All plots and visualizations, including model prediction error visualizations and the basemap shapefile specified with `--shapefile <url>`. Per-sample plots visualizing bootstrapped prediction error are saved in the `plots/bootstrapped_sample_ci` subdirectory.

Warning: Re-running GeoGenIE with the same output_dir and prefix will overwrite all outputs except the Optuna SQL database.

References

[1] Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/3292500.3330701

[2] Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17), 1-5. http://jmlr.org/papers/v18/16-365

[3] Chang et al., (2023). GGoutlieR: an R package to identify and visualize unusual geo-genetic patterns of biological samples. Journal of Open Source Software, 8(91), 5687. https://doi.org/10.21105/joss.05687

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.4

Apr 10, 2025

1.0.2

Apr 10, 2025

1.0.1

Apr 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geogenie-1.0.4.tar.gz (124.6 kB view details)

Uploaded Apr 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

geogenie-1.0.4-py3-none-any.whl (127.1 kB view details)

Uploaded Apr 10, 2025 Python 3

File details

Details for the file geogenie-1.0.4.tar.gz.

File metadata

Download URL: geogenie-1.0.4.tar.gz
Upload date: Apr 10, 2025
Size: 124.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for geogenie-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`a1b852390548e003ceff3d16059cd5f2764fab830682cc2921213d7ec9b2113e`
MD5	`3a1c6031b56c2d96ba16b163abf5b5c0`
BLAKE2b-256	`c0ddc97ec0145e75396c375696671e8c9b8589d4aff6d3b2e530fb38756a03c7`

See more details on using hashes here.

File details

Details for the file geogenie-1.0.4-py3-none-any.whl.

File metadata

Download URL: geogenie-1.0.4-py3-none-any.whl
Upload date: Apr 10, 2025
Size: 127.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for geogenie-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cf9301bd9b9fd3eb84614f200a0e5c26142e74e684327265496c8460cb621b33`
MD5	`a281808d7d8e818ce5988d6464c3e2e8`
BLAKE2b-256	`cd55c6a9c3f06c1a098d68dc877aec08d705d1b3bbf004c80bcaecd2d5ba5fff`

See more details on using hashes here.

GeoGenIE 1.0.4

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

GeoGenIE User Manual

Introduction

Installation

Dependencies

Usage

Running GeoGenIE

Command-line Options

Configuration File

Running the Software

Output Files and File Structure

References

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes