Skip to main content

Simulation analysis package for working with disordered proteins

Project description

SOURSOP

Build Status codecov Documentation Status

ABOUT

SOURSOP is a Python-based simulation analysis package for working with intrinsically disordered and unfolded proteins. It is built on top of mdtraj, and was developed by Jared Lalmansingh and Alex Holehouse.

The current stable release on PyPI is 0.2.7 (May 2026).

INSTALLATION

SOURSOP can be installed from PyPI with either pip or uv:

pip install soursop
# or
uv pip install soursop

To install the latest development version directly from GitHub:

pip install "git+https://github.com/holehouse-lab/soursop.git"

Verify the install with:

python -c "import soursop; print(soursop.__version__)"

Full installation instructions (conda, editable/source installs, running the tests) are in the documentation.

DOCUMENTATION

All documentation, including installation information can be found here.

ERRORS, FEATURES, REQUESTS

If you find a bug, typo, or error please raise an issue on GitHub.

If you wish to add a new feature, please see our Development information in the docs (especially for adding plugins).

PUBLICATION

To read about SOURSOP please see our paper:

Lalmansingh, J. M., Keeley, A. T., Ruff, K. M., Pappu, R. V. & Holehouse, A. S. SOURSOP: A Python Package for the Analysis of Simulations of Intrinsically Disordered Proteins. J. Chem. Theory Comput. (2023). doi:10.1021/acs.jctc.3c00190

Copyright

Copyright (c) 2014-2026 under the GNU LESSER GENERAL PUBLIC LICENSE

Changelog

Update May 2026 (0.2.7)

Bug fixes

  • ssnmr: Corrected three copy-paste errors in the propagation of Ser/Thr/Tyr neighbour-correction values to pSer/pThr/pTyr slots in the glycine-specific random-coil correction tables (gly_ca_c, gly_co_c, gly_n_c). Phospho-residues at the i+1 neighbour position of glycine now correctly apply the intended Kjaergaard correction; previously these slots were silently zeroed or used the wrong row, causing errors of up to ~2.7 ppm in rare Gly–phospho-Gly contexts.
  • ssprotein.get_Q: Replaced manual sigmoid with scipy.special.expit to avoid np.exp overflow for large positive arguments.
  • ssprotein.get_Q: Fixed a division-by-zero in residue–residue contact matrix normalisation when a pair had zero atomic contacts in all frames; affected entries are now safely set to zero rather than producing NaN.
  • ssprotein.get_contact_map: Added an explicit check for glycine residues before computing sidechain-heavy contact maps; an SSException is now raised consistently rather than producing mdtraj-version-dependent behaviour (this resolves the historical edge case that previously caused the continuous-integration failure noted above).
  • ssprotein.get_clusters: Fixed a division-by-zero in RMSD centroid selection when all pairwise distances within a cluster are identical (std == 0); the first frame is returned as the centroid in that degenerate case.
  • ssutils.set_numpy_threads: Raises a descriptive SSException when neither MKL nor OpenBLAS is detected (e.g. macOS Apple Accelerate) rather than silently failing.
  • sstrajectory.get_interchain_contact_map / get_interchain_distance: Fixed IndexError raised for terminal cap residues that lack a CA atom; added eager protein-ID and mode validation; wrapped per-residue computations in try/except so a single problematic residue no longer aborts the entire map.
  • sstrajectory.parallel_load_trjs: Fixed a broken topology-broadcast condition that prevented a single shared topology (or a one-element list) from being applied to multiple trajectories; the argument now accepts either a str or a per-trajectory list and raises a descriptive SSException on a topology/trajectory count mismatch. This previously broke PENGUIN / sssampling parallel loading.
  • sstrajectory (__readTrajectory): Fixed the PDB-vs-trajectory unit-cell comparison, which compared the z-dimension three times and never checked the y-dimension; the warning is now an f-string that reports the actual differing cell dimensions (and several message typos were corrected).
  • ssprotein (initialization, __check_cg_onebead): No longer builds the one-letter sequence purely to obtain a residue count - that path raised KeyError on coarse-grained or non-standard residue names and could crash construction. It now compares the atom and residue counts directly.
  • ssprotein (reset_cache / __init__): Consolidated the duplicated cache-initialisation logic into a single private helper so a freshly constructed object and one after reset_cache() are identical. Previously reset_cache() forced __cg_onechain = False, silently disabling coarse-grained handling and forcing the slow per-residue path after a reset.

New features and improvements

  • sstrajectory.get_interchain_contact_map and sstrajectory.get_interchain_distance: Added a stride parameter that subsamples frames before the mdtraj distance computation, giving substantial speed-ups on large trajectories without any post-hoc loss of information.
  • sstrajectory: Applied functools.wraps to the lazy_loading_single_protein_trajectory decorator so that get_overall_radius_of_gyration, get_overall_hydrodynamic_radius, and get_overall_asphericity preserve their docstrings and signatures at runtime (this also fixes Sphinx autodoc rendering for these three methods).
  • ssprotein: Converted all legacy %-format strings to f-strings for consistency and readability.
  • ssprotein (initialization): Replaced the per-residue topology.select() loop in __get_resid_with_CA (the documented initialization bottleneck) with a single topology pass, faithfully reproducing the original memoisation side effects so resid_with_CA, cap flags and get_CA_index results (and their dtypes) are unchanged.
  • ssprotein: Replaced == None checks with is None; __get_subtrajectory now uses native mdtraj slicing (traj[::stride]).
  • sstrajectory: Removed dead code and unused imports (time, copy) and the unused single_chain_sim value threaded through __get_proteins; simplified per-chain atom collection in __get_all_proteins.
  • sstrajectory.get_interchain_distance_map: Hoisted the per-residue centre-of-mass computation out of the O(n1n2) nested loop (chain-2 COMs were previously recomputed once per chain-1 residue). The non-periodic distance is now vectorised per row; the periodic path is preserved per-pair. Byte-for-byte identical output, with the redundant O(n1n2) COM computations reduced to O(n1+n2).
  • sstrajectory.get_interchain_contact_map: For the default mode='atom' (non-periodic), each residue's named-atom position is now computed once instead of via an O(n1*n2) loop of get_interchain_distance calls. Every other mode and the periodic path are unchanged. Output is byte-for-byte identical (verified against a pre-refactor baseline across CA/COM/atom and the untouched ca/closest-heavy/sidechain modes).
  • ssprotein (get_internal_scaling, get_scaling_exponent, get_local_to_global_correlation): For mode='CA', hoisted the per-residue CA centre-of-mass out of the O(n^2) sequence-separation loops via a new shared __ca_position_cache helper. Previously get_inter_residue_atomic_distance recomputed each residue's CA position from scratch on every pair (this path was not memoised, unlike the COM path). The helper reproduces the exact __residue_atom_lookup -> atom_slice -> __get_subtrajectory -> 10*compute_center_of_mass sequence, so results - including the seeded error-bootstrap / pair-resampling - are byte-for-byte identical, while the CA-position work drops from O(n^2) to O(n) (≈20-40x faster on the affected calls).

Consistent ensemble reweighting (frame weights)

  • Every function that returns an ensemble-average value now accepts an optional per-frame weights vector, applied consistently and deterministically (closed-form weighted statistics; no stochastic resampling, fully reproducible, seed-independent). weights=False (the default) is an exact no-op, so all existing behaviour and the reference-observable suite are numerically unchanged.
  • New shared validation/reduction helpers in ssutils: validate_weights (the single source of truth - raises SSException unless the vector has one finite entry per frame, every entry in [0, 1], and sums to 1 within etol), plus weighted_mean, weighted_rms, weighted_std (reliability-weighted population estimator) and weighted_corr. SSProtein.__check_weights is now a thin wrapper around validate_weights.
  • stride + weights is now correct: the weight vector is subsampled and renormalised to sum to 1 over the retained frames (previously it was silently left un-normalised).
  • Converted the previously stochastic numpy.random.choice resampling in get_internal_scaling, get_internal_scaling_RMS, get_scaling_exponent (including its seeded error-bootstrap, now deterministically re-weighted per chunk) and get_local_to_global_correlation to deterministic weighted statistics, so weighting is uniform across the whole package. This also fixes a latent bug where get_local_to_global_correlation silently discarded user-supplied weights.
  • Added the opt-in weights (and etol) parameter to the per-frame getters (get_radius_of_gyration, get_hydrodynamic_radius, get_asphericity, get_end_to_end_distance, get_gyration_tensor, get_all_SASA), to SSTrajectory.get_overall_radius_of_gyration / _asphericity / _hydrodynamic_radius and get_interchain_distance_map, and to the collapsing observables (get_site_accessibility, get_regional_SASA, get_end_to_end_vs_rg_correlation, get_angle_decay, get_local_collapse). When weights is supplied the frame axis is collapsed and the deterministic weighted statistic is returned.
  • get_internal_scaling(mean_vals=False, weights=...) and get_local_heterogeneity(weights=...) raise a descriptive SSException: these describe a distribution / pairwise frame-vs-frame quantity for which a single per-frame probability vector is not well-defined (rather than silently returning a questionable number).
  • ssmutualinformation.calc_MI: Fixed a latent bug (if weights: on a numpy array raised ValueError: truth value ... ambiguous) that made the weighted path of get_dihedral_mutual_information unusable for any real weight vector; now correctly guarded with is not False.
  • Added the opt-in weights (+etol) parameter to five more methods so they collapse to a deterministic re-weighted ensemble value when supplied: get_inter_residue_atomic_distance, get_inter_residue_COM_distance and get_t (per-frame array -> scalar weighted mean; stride-aware), and get_secondary_structure_DSSP / get_secondary_structure_BBSEG only when return_per_frame=False (the per-residue fractional occupancies become frame-weighted; supplying weights with return_per_frame=True raises SSException, since a per-frame binary mask is not an ensemble average). Defaults (weights=False) are exact no-ops.
  • ssprotein.get_contact_map: Removed the explicit "stride must be 1 if weights given" hard-block - stride != 1 and weights now work together. The weight vector is validated by ssutils.validate_weights (one entry per trajectory frame, len == n_frames) and stride-subsampled/renormalised so it aligns 1:1 with the strided frames; the weighted map is the deterministic weighted mean of the per-frame contact maps. Default/unweighted behaviour unchanged.
  • ssprotein.get_Q: Restructured into two explicit phases - (1) extract the reference frame and define the native contacts / reference distances r0 from the full trajectory first, completely independent of stride/weights; (2) then resolve the analysis ensemble (apply the stride, validate the weights). This decouples native_state_reference_frame from the strided/re-weighted analysis and fixes a latent bug where the alignment (superpose) indexed the reference into the strided trajectory. stride != 1 and weights can now be used together (previously hard-blocked). The weight contract is consistent with the rest of the package - one entry per trajectory frame (len == n_frames), validated by ssutils.validate_weights and stride-subsampled/renormalised to align 1:1 with the strided frames; the weighted protein_average=False breakdown is now averaged over exactly the same frame set as the unweighted path (the old code silently dropped the first frame only in the weighted branch). protein_average=True still (intentionally) refuses weights. Default/unweighted behaviour and the reference-observable suite are numerically unchanged.

Testing

  • Added tests/test_weights.py (~68 tests) giving every weights-bearing entry point dedicated coverage: the ssutils validator/reduction helpers, the extreme cases (uniform weights == unweighted mean, one-hot weights == single-frame value, invalid vectors raise, stride+weights renormalisation), the converted stochastic functions, the per-frame getters, the SASA / scaling / correlation / secondary-structure / get_Q / get_contact_map / get_polymer_scaled_distance_map weighted paths, and a direct ssmutualinformation.calc_MI weighted test.
  • Renamed all test-data trajectory files to carry an explicit resolution suffix (_AA for all-atom, _CG for coarse-grained), e.g. ctl9.pdbctl9_AA.pdb. This makes the resolution unambiguous for both tests and build scripts.
  • Added two coarse-grained test trajectories (sigA_CG, synth_1_CG) to extend coverage of CG loading paths.
  • Added a pickle-based regression test suite (test_reference_observables.py, ~600+ parametrized tests) that recomputes every public SSProtein observable against stored reference values and asserts numerical agreement. Reference pickles are built by tests/build_reference/build_references.py.
  • Extended test_sstrajectory.py with systematic parametrized tests covering properties, consistency checks, contact-map correctness, and stride behaviour across all test trajectories.
  • Updated conftest.py with new module-scoped fixtures for the additional trajectories.
  • Fixed test_trajectory_repr_string which was inadvertently returning a tuple rather than asserting, so the test never validated the repr string.
  • test_set_numpy_threads now skips gracefully on platforms without a controllable BLAS backend rather than failing.

Documentation

  • Rewrote all public docstrings in ssprotein.py, sstrajectory.py, sssampling.py, ssutils.py, ssmutualinformation.py, sspre.py, ssnmr.py, and sstools.py to follow the numpy docstring standard (one-line summary, extended description, typed Parameters, Returns/Yields with shapes, Raises, short Example), and completed the previously empty SSPRE.__repr__ docstring.
  • Added detailed overview preambles to all five module RST pages (ssprotein, sstrajectory, ssnmr, sspre, sssampling), describing what each module does, how its classes relate, and what categories of analysis are available.
  • Rewrote examples.rst with nine worked IDP analysis examples covering trajectory loading, global dimensions, polymer scaling, secondary structure, contact maps, solvent accessibility, multi-chain systems, NMR comparison (PRE and random-coil chemical shifts), and sampling quality assessment with PENGUIN.
  • Expanded the narrative documentation pages: installation.rst now has separate pip and uv instructions (PyPI and direct-from-GitHub, plus conda, editable/source installs and test instructions); overview.rst now covers the core SSTrajectory/SSProtein concepts and a module map; and development.rst now documents dev-environment setup, repository layout, running the tests and coding/docstring conventions.
  • Corrected the version-check snippet (soursop.version() does not exist) to python -c "import soursop; print(soursop.__version__)" across the docs and README, refreshed the README (CI/codecov badge links now point to holehouse-lab/soursop, updated version/copyright lines, added a concise installation section), and fixed Sphinx build errors caused by duplicate SSProtein object descriptions and RST citation-label collisions (numeric [1]_ / .. [1] footnotes shared across methods; resolved by adding :no-index: to the second autoclass block and converting all numeric footnotes to inline prose citations).
  • Added a new "Ensemble reweighting (frame weights)" documentation page (docs/usage/weights.rst, linked from the toctree) covering the weight-vector contract, the stride interaction, the behaviour at the extremes, the full list of reweighting-capable functions, and the ssutils validation/reduction API.

Packaging and repository maintenance

  • Bumped the GitHub Actions CI Python matrix from 3.7-3.9 to 3.9-3.12 (3.7 and 3.8 are end-of-life).
  • Removed dead CI configuration: .travis.yml (Travis CI retired; still referenced the old camparitraj name) and .lgtm.yml (LGTM.com was shut down).
  • Reconciled requirements.txt and anaconda_requirements.txt with pyproject.toml: removed non-dependencies (cx_Freeze, PyYAML, ruamel_yaml), added the missing real dependencies (natsort, matplotlib, cython), and unified the mdtraj>=1.9.5 specifier.
  • Fixed a malformed MANIFEST.in data-file line and a stale metapredict/_version.py path in the setup.cfg coverage-omit list (now soursop/_version.py).

Update November 2024 (0.2.6)

  • Major update to SOURSOP.
  • Explicit support for single-chain one-bead-per-residue trajectory loading that dramatically improves load time (~30x improvement)
  • Officially added the SSSampling class and support for PENGUIN (see Lotthammer & Holehouse, bioRxiv
  • Added support for NH3 and FOR residue types in get_amino_acid_sequence()
  • Switched over packaging to use pyproject.toml instead of setup.py and switched versioning to use versioningit

Update March 2024 (0.2.5 [patch])

  • Added return_instantaneous_maps=False keyword to get_distance_maps() function so we can return a [t,n,n] matrix where t = number of frames and n=number of residues for the instantaneous conformer-specific distance maps.

Update Feb 2024 (0.2.5)

  • Removed periodic correction in SSProtein functions. We could provide periodic boundary condition checks for some but not all functions, and ensuring this flag was possible everywhere is not feasable. With this in mind, we opted to make a design decision to remove the periodic flag from the small number of functions that had it, such that there's no risk of a user forgetting and analyzing two distinct properties for a trajectory that requires PBC fixing where one analysis used periodic=True whereas another did not have this option. Now, the user must ensure their protein trajectories are corrected ahead of time. Note that we also added support for periodic=True for some of the SSTrajectory analysis functions, because in cases where multiple chains are present reconstructing a non-PBC trajectory becomes much more difficult. As such, intermolecular analysis does provide intrinsic PBC correction, whereas intramolecular analysis does not.

Update July 2023

  • Added in explicit_residue_checking flag into SSTrajectory constructor, which makes it possible to use a solvated .gro file as an input file.

Update February 2023

  • Added plugins example
  • Added additional tests and finalized documentation

Update July 2022

For version 0.2.1 we introduce potentially breaking changes into how COM distances are reported.

Details:

All center-of-mass (COM)-based functions now return distances (as before) and relative positions in x/y/z (this is new) in Angstroms, not nanometers. Previously, the various center-of-mass based functions that returned absolute positions (i.e. 3xn matrices of x/y/z vs frame number) returned them where x/y/z was in units nm. This is fine if you know this, but means if you manually calculate distance between two COM vectors you'd get a distance in nanometers and not Angstroms. This is an unhelpful and unexpected behavior given all other distances are in Angstroms, so we have made the decision to fully update to Angstroms even for vector positions. This DOES NOT break any code internal to soursop, but if you were using COM positions to manually calculate distances these may need to be recalculated.

Update April 2022

For version 0.1.9 the documentation has been extensively extended

Update July 2021

CAMPARITraj is SOURSOP! For the final release we have re-named and re-branded CAMPARITraj as SOURSOP. This change in the name serves two important purposes.

Firstly, CAMPARITraj was borne out of a collection of scripts and code built to work with CAMPARI. However, it has evolved into a stand-alone package for the analysis of all-atom simulations of IDPs and IDRs, and importantly, much of the analysis it performs can also be done in CAMPARI. As such, we felt it was important to decoupled SOURSOP from CAMPARI, both to avoid the implication that CAMPARI cannot perform analyses itself, and to avoid a scenario in which it may appear that this package only works with CAMPARI simulations.

Secondly, there is a long, rich tradition of naming software tools after drinks in the Pappu lab (CAMPARI, ABSINTH, CIDER, LASSI etc.). As such SOURSOP is first-author Jared's Caribbean twist on this theme!

WARNING

We are currently and systematically updating all of the CAMPARITraj codebase to SOURSOP. As such, for now, we recommend using a previous version. Note the SOURSOP change breaks all backwards compatibility. Sorry about that.

Update December 2020

Release 0.1.2 includes updated support to ensure CAMPARITraj will continue to work with MDTraj 1.9.5, as well as numerous additional updated.

Update May 2019

This is the development repository of CAMPARITraj and SHOULD NOT be used for production. Seriously, it is being modified constantly and with no building requirements during code pushes. If you want a building copy PLEASE contact Alex directly! [last touched June 24th 2019].

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soursop-0.2.7.tar.gz (11.0 MB view details)

Uploaded Source

File details

Details for the file soursop-0.2.7.tar.gz.

File metadata

  • Download URL: soursop-0.2.7.tar.gz
  • Upload date:
  • Size: 11.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for soursop-0.2.7.tar.gz
Algorithm Hash digest
SHA256 101bc0f49be2a1f47c591a7dd1ec9f6088f403b286d3e97244ca781a23753f70
MD5 6c51fe15de708707bc9a9dd27d42142e
BLAKE2b-256 58dfc33e777ce361d71d3113401f96c5122f3ac82a368a06471eea8992a670b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page