Skip to main content

Easy-2-use long text NLP toolkit.

Project description

Ask DeepWiki

Deep Long Text Learning Kit

Author: 吴子豪

开箱即用的长文本语义建模框架

安装

  • 使用 pip

    pip install -U deeplotx
    
  • 使用 uv (推荐)

    uv add -U deeplotx
    
  • 从 github 安装最新特性

    pip install -U git+https://github.com/vortezwohl/DeepLoTX.git
    

核心功能

  • 长文本嵌入

    • 基于通用 BERT 的长文本嵌入 (最大支持长度, 无限长, 可通过 max_length 限制长度)

      from deeplotx import LongTextEncoder
      
      # 块大小为 448 个 tokens, 块间重叠部分为 32 个 tokens.
      encoder = LongTextEncoder(
          chunk_size=448,
          overlapping=32
      )
      # 对 "我是吴子豪, 这是一个测试文本." 计算嵌入, 并堆叠.
      encoder.encode('我是吴子豪, 这是一个测试文本.', flatten=False)
      

      输出:

      tensor([ 2.2316e-01,  2.0300e-01,  ...,  1.5578e-01, -6.6735e-02])
      
    • 基于 Longformer 的长文本嵌入 (最大支持长度 4096 个 tokens)

      from deeplotx import LongformerEncoder
      
      encoder = LongformerEncoder()
      encoder.encode('我是吴子豪, 这是一个测试文本.')
      

      输出:

      tensor([-2.7490e-02,  6.6503e-02, ..., -6.5937e-02,  6.7802e-03])
      
  • 相似性计算

    • 基于向量的相似性

      import deeplotx.similarity as sim
      
      vector_0, vector_1 = [1, 2, 3, 4], [4, 3, 2, 1]
      # 欧几里得距离
      distance_0 = sim.euclidean_similarity(vector_0, vector_1)
      print(distance_0)
      # 余弦距离
      distance_1 = sim.cosine_similarity(vector_0, vector_1)
      print(distance_1)
      # 切比雪夫距离
      distance_2 = sim.chebyshev_similarity(vector_0, vector_1)
      print(distance_2)
      

      输出:

      4.47213595499958
      0.33333333333333337
      3
      
    • 基于集合的相似性

      import deeplotx.similarity as sim
      
      set_0, set_1 = {1, 2, 3, 4}, {4, 5, 6, 7}
      # 杰卡德距离
      distance_0 = sim.jaccard_similarity(set_0, set_1)
      print(distance_0)
      # Ochiai 距离
      distance_1 = sim.ochiai_similarity(set_0, set_1)
      print(distance_1)
      # Dice 系数
      distance_2 = sim.dice_coefficient(set_0, set_1)
      print(distance_2)
      # Overlap 系数
      distance_3 = sim.overlap_coefficient(set_0, set_1)
      print(distance_3)
      

      输出:

      0.1428571428572653
      0.2500000000001875
      0.25000000000009376
      0.2500000000001875
      
    • 基于概率分布的相似性

      import deeplotx.similarity as sim
      
      dist_0, dist_1 = [0.3, 0.2, 0.1, 0.4], [0.2, 0.1, 0.3, 0.4]
      # 交叉熵
      distance_0 = sim.cross_entropy(dist_0, dist_1)
      print(distance_0)
      # KL 散度
      distance_1 = sim.kl_divergence(dist_0, dist_1)
      print(distance_1)
      # JS 散度
      distance_2 = sim.js_divergence(dist_0, dist_1)
      print(distance_2)
      # Hellinger 距离
      distance_3 = sim.hellinger_distance(dist_0, dist_1)
      print(distance_3)
      

      输出:

      0.3575654913778237
      0.15040773967762736
      0.03969123741566945
      0.20105866986400994
      
  • 预定义深度神经网络

    from deeplotx import (
        FeedForward,  # 前馈神经网络
        MultiHeadFeedForward,  # 多头前馈神经网络
        LinearRegression,  # 线性回归
        LogisticRegression,  # 逻辑回归 / 二分类 / 多标签分类
        SoftmaxRegression,  # Softmax 回归 / 多分类
        RecursiveSequential,  # 序列模型 / 循环神经网络
        LongContextRecursiveSequential,  # 长上下文序列模型 / 自注意力融合循环神经网络
        RoPE,  # RoPE 位置编码
        Attention,  # 自注意力 / 交叉注意力
        MultiHeadAttention,  # 并行多头注意力
        RoFormerEncoder,  # Roformer (Transformer + RoPE) 编码器模型
        AutoRegression,  # 自回归模型 / 循环神经网络
        LongContextAutoRegression  # 长上下文自回归模型 / 自注意力融合循环神经网络
    )
    

    基础网络结构:

    from typing_extensions import override
    
    import torch
    from torch import nn
    
    from deeplotx.nn.base_neural_network import BaseNeuralNetwork
    
    
    class FeedForwardUnit(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, expansion_factor: int | float = 2,
                    bias: bool = True, dropout_rate: float = 0.05, model_name: str | None = None,
                    device: str | None = None, dtype: torch.dtype | None = None):
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name, device=device, dtype=dtype)
            self._dropout_rate = dropout_rate
            self.up_proj = nn.Linear(in_features=feature_dim, out_features=int(feature_dim * expansion_factor),
                                    bias=bias, device=self.device, dtype=self.dtype)
            self.down_proj = nn.Linear(in_features=int(feature_dim * expansion_factor), out_features=feature_dim,
                                    bias=bias, device=self.device, dtype=self.dtype)
            self.parametric_relu = nn.PReLU(num_parameters=1, init=5e-3,
                                            device=self.device, dtype=self.dtype)
            self.layer_norm = nn.LayerNorm(normalized_shape=self.up_proj.in_features, eps=1e-9,
                                        device=self.device, dtype=self.dtype)
    
        @override
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            residual = x
            x = self.layer_norm(x)
            x = self.up_proj(x)
            x = self.parametric_relu(x)
            if self._dropout_rate > .0:
                x = torch.dropout(x, p=self._dropout_rate, train=self.training)
            return self.down_proj(x) + residual
    
    
    class FeedForward(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, num_layers: int = 1, expansion_factor: int | float = 2,
                    bias: bool = True, dropout_rate: float = 0.05, model_name: str | None = None,
                    device: str | None = None, dtype: torch.dtype | None = None):
            if num_layers < 1:
                raise ValueError('num_layers cannot be less than 1.')
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name, device=device, dtype=dtype)
            self.ffn_layers = nn.ModuleList([FeedForwardUnit(feature_dim=feature_dim,
                                                            expansion_factor=expansion_factor, bias=bias,
                                                            dropout_rate=dropout_rate,
                                                            device=self.device, dtype=self.dtype) for _ in range(num_layers)])
    
        @override
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            for ffn in self.ffn_layers:
                x = ffn(x)
            return x
    

    注意力模块:

    from typing_extensions import override
    
    import torch
    
    from deeplotx.nn.base_neural_network import BaseNeuralNetwork
    from deeplotx.nn.feed_forward import FeedForward
    from deeplotx.nn.rope import RoPE, DEFAULT_THETA
    
    
    class Attention(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, bias: bool = True, positional: bool = True,
                    proj_layers: int = 1, proj_expansion_factor: int | float = 1.5, dropout_rate: float = 0.02,
                    model_name: str | None = None, device: str | None = None, dtype: torch.dtype | None = None,
                    **kwargs):
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name,
                            device=device, dtype=dtype)
            self._positional = positional
            self._feature_dim = feature_dim
            self.q_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            self.k_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            self.v_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            if self._positional:
                self.rope = RoPE(feature_dim=self._feature_dim, theta=kwargs.get('theta', DEFAULT_THETA),
                                device=self.device, dtype=self.dtype)
    
        def _attention(self, x: torch.Tensor, y: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
            q, k = self.q_proj(x), self.k_proj(y)
            if self._positional:
                q, k = self.rope(q), self.rope(k)
            attn = torch.matmul(q, k.transpose(-2, -1))
            attn = attn / (self._feature_dim ** 0.5)
            attn = attn.masked_fill(mask == 0, -1e9) if mask is not None else attn
            return torch.softmax(attn, dtype=self.dtype, dim=-1)
    
        @override
        def forward(self, x: torch.Tensor, y: torch.Tensor | None = None, mask: torch.Tensor | None = None) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            y = x if y is None else self.ensure_device_and_dtype(y, device=self.device, dtype=self.dtype)
            if mask is not None:
                mask = self.ensure_device_and_dtype(mask, device=self.device, dtype=self.dtype)
            v = self.v_proj(y)
            return torch.matmul(self._attention(x, y, mask), v)
    
  • 使用预定义训练器实现文本二分类任务

    from deeplotx import TextBinaryClassifierTrainer, LongTextEncoder
    from deeplotx.util import get_files, read_file
    
    # 定义向量编码策略 (默认使用 FacebookAI/xlm-roberta-base 作为嵌入模型)
    long_text_encoder = LongTextEncoder(
        max_length=2048,  # 最大文本大小, 超出截断
        chunk_size=448,  # 块大小 (按 Token 计)
        overlapping=32,  # 块间重叠大小 (按 Token 计)
        cache_capacity=512  # 缓存大小
    )
    
    trainer = TextBinaryClassifierTrainer(
        long_text_encoder=long_text_encoder,
        batch_size=2,
        train_ratio=0.9  # 训练集和验证集比例
    )
    
    # 读取数据
    pos_data_path = 'path/to/pos_dir'
    neg_data_path = 'path/to/neg_dir'
    pos_data = [read_file(x) for x in get_files(pos_data_path)]
    neg_data = [read_file(x) for x in get_files(neg_data_path)]
    
    # 开始训练
    model = trainer.train(pos_data, neg_data, 
                        num_epochs=36, learning_rate=2e-5, 
                        balancing_dataset=True, alpha=1e-4, 
                        rho=.2, encoder_layers=2,  # 2 层 Roformer 编码器
                        attn_heads=8,  # 8 个注意力头
                        recursive_layers=2)  # 2 层 Bi-LSTM
    
    # 保存模型权重
    model.save(model_name='test_model', model_dir='model')
    
    # 加载已保存的模型
    model = model.load(model_name='test_model', model_dir='model')
    
    # 使用训练好的模型进行预测
    model.predict(long_text_encoder.encode('这是一个测试文本.', flatten=False))
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplotx-0.8.5.tar.gz (31.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplotx-0.8.5-py3-none-any.whl (39.5 kB view details)

Uploaded Python 3

File details

Details for the file deeplotx-0.8.5.tar.gz.

File metadata

  • Download URL: deeplotx-0.8.5.tar.gz
  • Upload date:
  • Size: 31.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for deeplotx-0.8.5.tar.gz
Algorithm Hash digest
SHA256 2525290f036a12bbd2499863acf275dbf36b83315ef2a7a0f35384ffa1b88dd8
MD5 13169cba5812d06ddd797d8df8b25e25
BLAKE2b-256 bff7393cbe25361ce3974ba6f342808672121dae97ed075f3beaeeee4d4ddb6e

See more details on using hashes here.

File details

Details for the file deeplotx-0.8.5-py3-none-any.whl.

File metadata

  • Download URL: deeplotx-0.8.5-py3-none-any.whl
  • Upload date:
  • Size: 39.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for deeplotx-0.8.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9fcf61772320ee2cf95d768335a4c4d7a9f4ad6d5aac965708df87f1233506e6
MD5 040f0ada4deed2e7276fbfb065025824
BLAKE2b-256 e13d32b8c98bb158cbfba5f26ca9a367b5373a497df6bbe71449efe2a5b2fb25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page