Skip to main content

Easy-2-use long text NLP toolkit.

Project description

Ask DeepWiki

Deep Long Text Learning Kit

Author: 吴子豪

开箱即用的长文本语义建模框架

安装

  • 使用 pip

    pip install -U deeplotx
    
  • 使用 uv (推荐)

    uv add -U deeplotx
    
  • 从 github 安装最新特性

    pip install -U git+https://github.com/vortezwohl/DeepLoTX.git
    

核心功能

  • 长文本嵌入

    • 基于通用 BERT 的长文本嵌入 (最大支持长度, 无限长, 通过 max_length 定义)

      from deeplotx import LongTextEncoder
      
      # 最大文本长度为 2048 个 tokens, 块大小为 512 个 tokens, 块间重叠部分为 64 个 tokens.
      encoder = LongTextEncoder(
          max_length=2048,
          chunk_size=512,
          overlapping=64
      )
      # 对 "我是吴子豪, 这是一个测试文本." 计算嵌入, 并展平.
      encoder.encode('我是吴子豪, 这是一个测试文本.', flatten=True, use_cache=True)
      

      输出:

      tensor([ 0.5163,  0.2497,  0.5896,  ..., -0.9815, -0.3095,  0.4232])
      
    • 基于 Longformer 的长文本嵌入 (最大支持长度 4096 个 tokens)

      from deeplotx import LongformerEncoder
      
      encoder = LongformerEncoder()
      encoder.encode('我是吴子豪, 这是一个测试文本.')
      
  • 相似性计算

    • 基于向量的相似性

      import deeplotx.similarity as sim
      
      vector_0, vector_1 = [1, 2, 3, 4], [4, 3, 2, 1]
      # 欧几里得距离
      distance_0 = sim.euclidean_similarity(vector_0, vector_1)
      print(distance_0)
      # 余弦距离
      distance_1 = sim.cosine_similarity(vector_0, vector_1)
      print(distance_1)
      # 切比雪夫距离
      distance_2 = sim.chebyshev_similarity(vector_0, vector_1)
      print(distance_2)
      

      输出:

      4.47213595499958
      0.33333333333333337
      3
      
    • 基于集合的相似性

      import deeplotx.similarity as sim
      
      set_0, set_1 = {1, 2, 3, 4}, {4, 5, 6, 7}
      # 杰卡德距离
      distance_0 = sim.jaccard_similarity(set_0, set_1)
      print(distance_0)
      # Ochiai 距离
      distance_1 = sim.ochiai_similarity(set_0, set_1)
      print(distance_1)
      # Dice 系数
      distance_2 = sim.dice_coefficient(set_0, set_1)
      print(distance_2)
      # Overlap 系数
      distance_3 = sim.overlap_coefficient(set_0, set_1)
      print(distance_3)
      

      输出:

      0.1428571428572653
      0.2500000000001875
      0.25000000000009376
      0.2500000000001875
      
    • 基于概率分布的相似性

      import deeplotx.similarity as sim
      
      dist_0, dist_1 = [0.3, 0.2, 0.1, 0.4], [0.2, 0.1, 0.3, 0.4]
      # 交叉熵
      distance_0 = sim.cross_entropy(dist_0, dist_1)
      print(distance_0)
      # KL 散度
      distance_1 = sim.kl_divergence(dist_0, dist_1)
      print(distance_1)
      # JS 散度
      distance_2 = sim.js_divergence(dist_0, dist_1)
      print(distance_2)
      # Hellinger 距离
      distance_3 = sim.hellinger_distance(dist_0, dist_1)
      print(distance_3)
      

      输出:

      0.3575654913778237
      0.15040773967762736
      0.03969123741566945
      0.20105866986400994
      
  • 预定义深度神经网络

    from deeplotx import (
        LinearRegression,  # 线性回归
        LogisticRegression,  # 逻辑回归 / 二分类 / 多标签分类
        SoftmaxRegression,  # Softmax 回归 / 多分类
        RecursiveSequential,  # 序列模型 / 循环神经网络
        AutoRegression  # 自回归模型
    )
    

    基础网络结构:

    from typing_extensions import override
    
    import torch
    from torch import nn
    
    from deeplotx.nn.base_neural_network import BaseNeuralNetwork
    
    
    class LinearRegression(BaseNeuralNetwork):
        def __init__(self, input_dim: int, output_dim: int, model_name: str | None = None):
            super().__init__(model_name=model_name)
            self.fc1 = nn.Linear(input_dim, 1024)
            self.fc1_to_fc4_res = nn.Linear(1024, 64)
            self.fc2 = nn.Linear(1024, 768)
            self.fc3 = nn.Linear(768, 128)
            self.fc4 = nn.Linear(128, 64)
            self.fc5 = nn.Linear(64, output_dim)
            self.parametric_relu_1 = nn.PReLU(num_parameters=1, init=5e-3)
            self.parametric_relu_2 = nn.PReLU(num_parameters=1, init=5e-3)
            self.parametric_relu_3 = nn.PReLU(num_parameters=1, init=5e-3)
            self.parametric_relu_4 = nn.PReLU(num_parameters=1, init=5e-3)
    
        @override
        def forward(self, x) -> torch.Tensor:
            fc1_out = self.parametric_relu_1(self.fc1(x))
            x = nn.LayerNorm(normalized_shape=1024, eps=1e-9)(fc1_out)
            x = torch.dropout(x, p=0.2, train=self.training)
            x = self.parametric_relu_2(self.fc2(x))
            x = nn.LayerNorm(normalized_shape=768, eps=1e-9)(x)
            x = torch.dropout(x, p=0.2, train=self.training)
            x = self.parametric_relu_3(self.fc3(x))
            x = torch.dropout(x, p=0.2, train=self.training)
            x = self.parametric_relu_4(self.fc4(x)) + self.fc1_to_fc4_res(fc1_out)
            x = self.fc5(x)
            return x
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplotx-0.4.12b5.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplotx-0.4.12b5-py3-none-any.whl (29.1 kB view details)

Uploaded Python 3

File details

Details for the file deeplotx-0.4.12b5.tar.gz.

File metadata

  • Download URL: deeplotx-0.4.12b5.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for deeplotx-0.4.12b5.tar.gz
Algorithm Hash digest
SHA256 f29c1e6702b731e4e879bc3668e5e81e1541681cb388b654986f545e14af2bb4
MD5 19c7f6543b36d45c4a337eaba341227b
BLAKE2b-256 6a4d0447b72666f52d3173b782617840dcd5ee2c9fe333ab27acb93d671d7658

See more details on using hashes here.

File details

Details for the file deeplotx-0.4.12b5-py3-none-any.whl.

File metadata

File hashes

Hashes for deeplotx-0.4.12b5-py3-none-any.whl
Algorithm Hash digest
SHA256 63a4e2014fc3f6089f71455d8e085ce5640ac9eb0559bb6a27ac1c14a7146beb
MD5 0190f66094805b4cf383997d07f994dd
BLAKE2b-256 99885a19b2143b6b1d11e3dcc763665698b12b09b248fc8193932b39235585f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page