Skip to main content

Multiscale Gauss Linking Integral Library for Biomolecular 3D Topology

Project description

GaussBio3D: Multiscale Gauss Linking Integral Library

GaussBio3D: 多尺度高斯链接积分库

A Python library for multiscale Gauss linking integral (mGLI)–based 3D topological descriptors for small molecules, proteins and nucleic acids.

一个基于多尺度高斯链接积分(mGLI)的Python库,用于小分子、蛋白质和核酸的3D拓扑描述符计算。

It is designed to be a unified 3D representation framework for biomolecular interaction tasks such as:

本库旨在为生物分子交互任务提供统一的3D表示框架,支持以下任务:

  • Drug–Target Interaction (DTI) / 药物-靶点交互
  • Protein–Protein Interaction (PPI) / 蛋白质-蛋白质交互
  • Drug–Drug Interaction (DDI) / 药物-药物交互
  • miRNA/Nucleic acid–Target Interaction (MTI) / miRNA/核酸-靶点交互
  • Protein–DNA/RNA complexes / 蛋白质-DNA/RNA复合物等

1. Mathematical Background / 数学背景

1.1 Gauss Linking Integral (Continuous) / 高斯链接积分(连续形式)

Given two smooth space curves C₁ and C₂, the Gauss linking integral is

给定两条光滑空间曲线 C₁ 和 C₂,高斯链接积分定义为:

GLI(C₁, C₂) = (1/4π) ∫∫ [(dr₁ × dr₂) · (r₁ - r₂)] / ||r₁ - r₂||³
              C₁ C₂

It measures the topological linking / winding between two curves. For closed curves it is an integer (linking number), but for open curves (e.g. biomolecular fragments) it is a real-valued "linking strength".

它度量两条曲线之间的拓扑缠绕/缠结关系。对于闭合曲线,它是一个整数(链接数),但对于开放曲线(如生物分子片段),它是一个实值的"链接强度"。

1.2 Discrete Segment Approximation / 离散线段近似

We approximate each curve by a set of straight segments:

我们用一组直线段来近似每条曲线:

  • C₁ = {Lᵢ}, where Lᵢ = [a₀, a₁]
  • C₂ = {Mⱼ}, where Mⱼ = [b₀, b₁]

Then: / 则有:

GLI(C₁, C₂) ≈ Σᵢⱼ GLI(Lᵢ, Mⱼ)

For line segments L=[a₀,a₁] and M=[b₀,b₁], we use a standard spherical geometry–based approximation (the same as in your scripts):

对于线段 L=[a₀,a₁] 和 M=[b₀,b₁],我们使用基于球面几何的标准近似方法

  1. Define / 定义:
r₀₀ = b₀ - a₀,  r₀₁ = b₁ - a₀
r₁₀ = b₀ - a₁,  r₁₁ = b₁ - a₁
  1. Normalize these vectors to get four unit vectors on the unit sphere 将这些向量归一化得到单位球面上的四个单位向量

  2. Construct four oriented spherical triangles and sum their signed areas using arcsin of dot products between successive cross products 构造四个定向球面三角形,使用连续叉积的点积的 arcsin 求和它们的有向面积

  3. Multiply by a sign determined by the orientation of the two segments 乘以由两个线段方向确定的符号

The library exposes gli_segment(seg1, seg2, signed=True/False) which computes this value. With signed=False, we use the absolute value |GLI| to measure linking strength independent of chirality.

本库提供 gli_segment(seg1, seg2, signed=True/False) 函数来计算此值。当 signed=False 时,我们使用绝对值 |GLI| 来度量与手性无关的链接强度


2. Multiscale & Grouped mGLI Features / 多尺度与分组mGLI特征

We want features that capture how strongly and at what distance scales parts of molecule A and B are topologically linked.

我们希望捕获分子A和B的各部分在何种强度和何种距离尺度下的拓扑链接特征。

2.1 Node Pair Quantities / 节点对量

For nodes (atoms / residues / bases) i ∈ A, j ∈ B:

对于节点(原子/残基/碱基)i ∈ A, j ∈ B:

  • Position / 位置: xᵢ, xⱼ
  • Distance / 距离: rᵢⱼ = ||xᵢ - xⱼ||
  • Local GLI / 局部GLI: gᵢⱼ = aggregated GLI between segments incident to node i and node j (sum or median over the node's local segments, as in your original code) 节点i和节点j相关联线段之间的聚合GLI(对节点的局部线段求和或取中位数)

2.2 Radial Weighting (Multi-scale) / 径向加权(多尺度)

We define radial basis functions φₖ(r) (either hard bins or RBF):

我们定义径向基函数 φₖ(r)(硬分箱RBF):

  • Hard bins / 硬分箱:
φₖ(r) = 𝟙[r ∈ [Rₖ, Rₖ₊₁)], k=1..K
  • RBF / 径向基函数:
φₖ(r) = exp(-(r-μₖ)²/(2σₖ²))

Then multi-scale aggregated features / 则多尺度聚合特征为:

hₖ = Σᵢⱼ φₖ(rᵢⱼ) · f(gᵢⱼ)

where f can be gᵢⱼ, |gᵢⱼ| or different statistics (sum/mean/max/min/median over node pairs in that scale).

其中 f 可以是 gᵢⱼ、|gᵢⱼ| 或不同的统计量(该尺度下节点对的求和/均值/最大值/最小值/中位数)。

2.3 Grouping: Elements / Residues / Bases / 分组:元素/残基/碱基

We further group nodes by discrete categories:

我们进一步按离散类别对节点分组:

  • small molecule / 小分子: element / functional group / 元素/官能团
  • protein / 蛋白质: residue type or residue class (hydrophobic/aromatic/etc.) / 残基类型或残基类别(疏水/芳香等)
  • nucleic acid / 核酸: base type (A/C/G/T/U) or backbone vs base / 碱基类型(A/C/G/T/U)或主链vs碱基

Define / 定义:

cₐ(i) ∈ {1,...,Cₐ},  c_B(j) ∈ {1,...,C_B}

Then / 则:

h_{cₐ,c_b,k} = Σ_{i,j: cₐ(i)=cₐ, c_B(j)=c_b} φₖ(rᵢⱼ) · f(gᵢⱼ)

Stacking these h_{cₐ,c_b,k} (and possibly their min/max/mean/median) gives a global mGLI descriptor for a structure pair.

堆叠这些 h_{cₐ,c_b,k}(以及可能的最小/最大/均值/中位数)可以得到结构对的全局mGLI描述符


3. Unified Geometry Representation / 统一几何表示

We represent each biomolecule as / 我们将每个生物分子表示为:

  • Node / 节点: atom / residue / base / 原子/残基/碱基
  • Segment / 线段: oriented segment between two 3D points, optionally attached to nodes / 两个3D点之间的有向线段,可选地附着到节点
  • Curve / 曲线: a polyline made of segments, e.g. backbone, side-chain, ring / 由线段组成的折线,如主链、侧链、环
  • Structure / 结构: collection of nodes + curves + mapping from nodes to their local segments / 节点+曲线的集合+节点到其局部线段的映射

This supports / 这支持:

  • small molecule / 小分子:
    • backbone curves (bond chains) / 主链曲线(键链)
    • ring curves (aromatic / aliphatic rings) / 环曲线(芳香环/脂肪环)
  • protein / 蛋白质:
    • backbone curve (Cα trace) / 主链曲线(Cα追踪)
    • sidechain curves per residue / 每个残基的侧链曲线
  • nucleic acid / 核酸:
    • backbone curve (phosphate or sugar-phosphate) / 主链曲线(磷酸或糖-磷酸)
    • base ring curves / 碱基环曲线

4. Installation & Dependencies / 安装和依赖

GaussBio3D requires RDKit for small-molecule I/O (SDF/MOL2/SMILES) and requires Biopython for PDB/mmCIF parsing. GaussBio3D 强制依赖 RDKit(用于小分子 I/O:SDF/MOL2/SMILES)以及 Biopython(用于 PDB/mmCIF 解析)。

Required / 必需:

  • Python 3.9+
  • numpy
  • rdkit
  • biopython

Recommended installation on Windows/macOS/Linux via Conda(推荐方式):

conda install -c conda-forge rdkit
pip install gaussbio3d

If you prefer pip-only and have an RDKit wheel available for your platform: 若仅使用 pip 并且你的平台可用 RDKit 轮子:

pip install rdkit-pypi
pip install gaussbio3d

From source / 从源码安装:

git clone https://github.com/yourusername/GaussBio3D
cd GaussBio3D
pip install -e .

5. Basic Usage / 基本用法

5.1 Compute a Protein–Ligand Global mGLI Descriptor / 计算蛋白质-配体全局mGLI描述符

from gaussbio3d.molecules import Protein, Ligand
from gaussbio3d.config import MgliConfig
from gaussbio3d.features.descriptor import global_mgli_descriptor

# Load protein and ligand / 加载蛋白质和配体
prot = Protein.from_pdb("examples/target.pdb", chain_id="A")
lig = Ligand.from_sdf("examples/drug.sdf")

# Configure mGLI parameters / 配置mGLI参数
config = MgliConfig(
    distance_bins=[0.0, 3.0, 6.0, 10.0, 20.0],
    use_rbf=False,
    signed=False,
    group_mode_A="residue_class",
    group_mode_B="element",
)

# Compute global descriptor / 计算全局描述符
feat = global_mgli_descriptor(prot, lig, config)
print("Feature shape:", feat.shape)

Quick DTI example / 快速 DTI 示例:

from gaussbio3d.tasks.dti import compute_dti_features
from gaussbio3d.config import MgliConfig

cfg = MgliConfig()
feats = compute_dti_features(
    pdb_path="examples/target.pdb",  # supports .pdb or .cif
    sdf_path="examples/drug.sdf",
    chain_id="A",
    config=cfg,
)
print({k: v.shape for k, v in feats.items()})

5.2 Node-level mGLI Features for a DTI Model / DTI模型的节点级mGLI特征

from gaussbio3d.features.node_features import node_mgli_features

# Compute node-level features / 计算节点级特征
node_feat_prot = node_mgli_features(prot, lig, config)
node_feat_lig  = node_mgli_features(lig, prot, config)

These can be concatenated with PLM embeddings / GeoGNN embeddings as 3D topological channels.

这些可以与PLM嵌入/GeoGNN嵌入连接作为3D拓扑通道。

5.3 Pairwise mGLI Matrix for Cross-attention / 用于交叉注意力的成对mGLI矩阵

from gaussbio3d.features.pairwise import pairwise_mgli_matrix

# Compute pairwise matrix / 计算成对矩阵
M = pairwise_mgli_matrix(prot, lig, config)
# M.shape = (N_prot_nodes, N_lig_nodes)

Use M as a bias term or edge feature in a DTI cross-attention GNN.

在DTI交叉注意力GNN中将M用作偏置项或边特征。


6. Tasks Helpers (DTI / PPI / MTI) / 任务辅助工具

We provide thin convenience wrappers in gaussbio3d.tasks to integrate easily with your existing pipelines.

我们在 gaussbio3d.tasks 中提供了简便的包装器,以便轻松集成到您现有的流程中。

Example / 示例:

from gaussbio3d.tasks.dti import compute_dti_features

# Compute all DTI features at once / 一次性计算所有DTI特征
dti_feats = compute_dti_features(
    pdb_path="examples/target.pdb",
    sdf_path="examples/drug.sdf",
)

7. Caveats & TODO / 注意事项和待办

  • This library is intended as a research prototype / 本库旨在作为研究原型:

    • efficiency is not highly optimized yet (GLI is O(#segments²) in the worst case) 效率尚未高度优化(GLI在最坏情况下是O(#segments²))
    • some geometric heuristics (ring detection, nucleic acid parsing) are simplified and should be refined for production use 一些几何启发式方法(环检测、核酸解析)被简化,应在生产使用中进一步优化
  • You are encouraged to / 建议您:

    • adjust distance bins / RBF parameters to your task 根据您的任务调整距离分箱/RBF参数
    • design more nuanced groupings (e.g. binding pocket residues vs non-pocket) 设计更细致的分组(如结合口袋残基vs非口袋残基)
    • integrate with your causal / adversarial training pipeline to debias abundance 与您的因果/对抗训练流程集成以消除丰度偏差

8. Project Structure / 项目结构

GaussBio3D/
├── gaussbio3d/
│   ├── __init__.py
│   ├── config.py              # Configuration / 配置
│   ├── core/                  # Core algorithms / 核心算法
│   │   ├── geometry.py        # Geometric primitives / 几何基元
│   │   └── gli.py             # GLI computation / GLI计算
│   ├── features/              # Feature extraction / 特征提取
│   │   ├── descriptor.py      # Global descriptors / 全局描述符
│   │   ├── node_features.py   # Node-level features / 节点级特征
│   │   └── pairwise.py        # Pairwise features / 成对特征
│   ├── io/                    # Input/Output / 输入输出
│   │   ├── mol.py             # Molecule file I/O / 分子文件I/O
│   │   └── pdb.py             # PDB file I/O / PDB文件I/O
│   ├── molecules/             # Molecule representations / 分子表示
│   │   ├── ligand.py          # Small molecules / 小分子
│   │   ├── protein.py         # Proteins / 蛋白质
│   │   └── nucleic_acid.py    # Nucleic acids / 核酸
│   └── tasks/                 # Task-specific helpers / 特定任务辅助
│       ├── dti.py             # Drug-Target Interaction / 药物-靶点交互
│       ├── ppi.py             # Protein-Protein Interaction / 蛋白质-蛋白质交互
│       └── mti.py             # Molecule-Target Interaction / 分子-靶点交互
├── examples/                  # Example scripts / 示例脚本
├── tests/                     # Unit tests / 单元测试
├── README.md
├── setup.py
└── requirements.txt

License / 许可证

MIT License


Citation / 引用

If you use GaussBio3D in your research, please cite:

如果您在研究中使用了GaussBio3D,请引用:

@software{gaussbio3d,
  title={GaussBio3D: Multiscale Gauss Linking Integral Library for Biomolecular 3D Topology},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/GaussBio3D}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gaussbio3d-0.1.0.tar.gz (33.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gaussbio3d-0.1.0-py3-none-any.whl (38.0 kB view details)

Uploaded Python 3

File details

Details for the file gaussbio3d-0.1.0.tar.gz.

File metadata

  • Download URL: gaussbio3d-0.1.0.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gaussbio3d-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c668e02ef0f253dc35a08155b0c1b39c3e0c7d1d3d3d10ca282e1da39c95990f
MD5 5333038215998f315eb3bf1162396a43
BLAKE2b-256 ef47e7d018bba53710cfeb25e1032454d4c4777b27f3c17cbff2a5601fd7421e

See more details on using hashes here.

File details

Details for the file gaussbio3d-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gaussbio3d-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gaussbio3d-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7a077eaeff514e17382cf660db5b0a90cc84da4d98763c6b8d3590d4eb9babdf
MD5 0dc70d3404df91563340f4ea819198fa
BLAKE2b-256 97ae38d3e689cc325c98a022a5e4e8acea29d4661b83b77a55eb13beb8bc94bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page