Easy-2-use long text NLP toolkit.
Project description
Deep Long Text Learning Kit
Author: 吴子豪
开箱即用的长文本语义建模框架
安装
-
使用 pip
pip install -U deeplotx -
使用 uv (推荐)
uv add -U deeplotx -
从 github 安装最新特性
pip install -U git+https://github.com/vortezwohl/DeepLoTX.git
核心功能
-
长文本嵌入
-
基于通用 BERT 的长文本嵌入 (最大支持长度, 无限长, 通过 max_length 定义)
from deeplotx import LongTextEncoder # 最大文本长度为 2048 个 tokens, 块大小为 512 个 tokens, 块间重叠部分为 64 个 tokens. encoder = LongTextEncoder( max_length=2048, chunk_size=512, overlapping=64 ) # 对 "我是吴子豪, 这是一个测试文本." 计算嵌入, 并展平. encoder.encode('我是吴子豪, 这是一个测试文本.', flatten=True, use_cache=True)
输出:
tensor([ 0.5163, 0.2497, 0.5896, ..., -0.9815, -0.3095, 0.4232]) -
基于 Longformer 的长文本嵌入 (最大支持长度 4096 个 tokens)
from deeplotx import LongformerEncoder encoder = LongformerEncoder() encoder.encode('我是吴子豪, 这是一个测试文本.')
-
-
相似性计算
-
基于向量的相似性
import deeplotx.similarity as sim vector_0, vector_1 = [1, 2, 3, 4], [4, 3, 2, 1] # 欧几里得距离 distance_0 = sim.euclidean_similarity(vector_0, vector_1) print(distance_0) # 余弦距离 distance_1 = sim.cosine_similarity(vector_0, vector_1) print(distance_1) # 切比雪夫距离 distance_2 = sim.chebyshev_similarity(vector_0, vector_1) print(distance_2)
输出:
4.47213595499958 0.33333333333333337 3 -
基于集合的相似性
import deeplotx.similarity as sim set_0, set_1 = {1, 2, 3, 4}, {4, 5, 6, 7} # 杰卡德距离 distance_0 = sim.jaccard_similarity(set_0, set_1) print(distance_0) # Ochiai 距离 distance_1 = sim.ochiai_similarity(set_0, set_1) print(distance_1) # Dice 系数 distance_2 = sim.dice_coefficient(set_0, set_1) print(distance_2) # Overlap 系数 distance_3 = sim.overlap_coefficient(set_0, set_1) print(distance_3)
输出:
0.1428571428572653 0.2500000000001875 0.25000000000009376 0.2500000000001875 -
基于概率分布的相似性
import deeplotx.similarity as sim dist_0, dist_1 = [0.3, 0.2, 0.1, 0.4], [0.2, 0.1, 0.3, 0.4] # 交叉熵 distance_0 = sim.cross_entropy(dist_0, dist_1) print(distance_0) # KL 散度 distance_1 = sim.kl_divergence(dist_0, dist_1) print(distance_1) # JS 散度 distance_2 = sim.js_divergence(dist_0, dist_1) print(distance_2) # Hellinger 距离 distance_3 = sim.hellinger_distance(dist_0, dist_1) print(distance_3)
输出:
0.3575654913778237 0.15040773967762736 0.03969123741566945 0.20105866986400994
-
-
预定义深度神经网络
from deeplotx import ( LinearRegression, # 线性回归 LogisticRegression, # 逻辑回归 / 二分类 / 多标签分类 SoftmaxRegression, # Softmax 回归 / 多分类 RecursiveSequential, # 序列模型 / 循环神经网络 AutoRegression # 自回归模型 )
基础网络结构:
from typing_extensions import override import torch from torch import nn from deeplotx.nn.base_neural_network import BaseNeuralNetwork class LinearRegression(BaseNeuralNetwork): def __init__(self, input_dim: int, output_dim: int, model_name: str | None = None): super().__init__(model_name=model_name) self.fc1 = nn.Linear(input_dim, 1024) self.fc1_to_fc4_res = nn.Linear(1024, 64) self.fc2 = nn.Linear(1024, 768) self.fc3 = nn.Linear(768, 128) self.fc4 = nn.Linear(128, 64) self.fc5 = nn.Linear(64, output_dim) self.parametric_relu_1 = nn.PReLU(num_parameters=1, init=5e-3) self.parametric_relu_2 = nn.PReLU(num_parameters=1, init=5e-3) self.parametric_relu_3 = nn.PReLU(num_parameters=1, init=5e-3) self.parametric_relu_4 = nn.PReLU(num_parameters=1, init=5e-3) @override def forward(self, x) -> torch.Tensor: fc1_out = self.parametric_relu_1(self.fc1(x)) x = nn.LayerNorm(normalized_shape=1024, eps=1e-9)(fc1_out) x = torch.dropout(x, p=0.2, train=self.training) x = self.parametric_relu_2(self.fc2(x)) x = nn.LayerNorm(normalized_shape=768, eps=1e-9)(x) x = torch.dropout(x, p=0.2, train=self.training) x = self.parametric_relu_3(self.fc3(x)) x = torch.dropout(x, p=0.2, train=self.training) x = self.parametric_relu_4(self.fc4(x)) + self.fc1_to_fc4_res(fc1_out) x = self.fc5(x) return x
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
deeplotx-0.4.12b5.tar.gz
(25.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deeplotx-0.4.12b5.tar.gz.
File metadata
- Download URL: deeplotx-0.4.12b5.tar.gz
- Upload date:
- Size: 25.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f29c1e6702b731e4e879bc3668e5e81e1541681cb388b654986f545e14af2bb4
|
|
| MD5 |
19c7f6543b36d45c4a337eaba341227b
|
|
| BLAKE2b-256 |
6a4d0447b72666f52d3173b782617840dcd5ee2c9fe333ab27acb93d671d7658
|
File details
Details for the file deeplotx-0.4.12b5-py3-none-any.whl.
File metadata
- Download URL: deeplotx-0.4.12b5-py3-none-any.whl
- Upload date:
- Size: 29.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63a4e2014fc3f6089f71455d8e085ce5640ac9eb0559bb6a27ac1c14a7146beb
|
|
| MD5 |
0190f66094805b4cf383997d07f994dd
|
|
| BLAKE2b-256 |
99885a19b2143b6b1d11e3dcc763665698b12b09b248fc8193932b39235585f2
|