Skip to main content

An out-of-the-box long-text NLP framework.

Project description

Ask DeepWiki

Deep Long Text Learning

An out-of-the-box long-text NLP framework.

Author: vortezwohl

Citation

If you are incorporating the DeepLoTX framework into your research, please remember to properly cite it to acknowledge its contribution to your work.

Если вы интегрируете фреймворк DeepLoTX в своё исследование, пожалуйста, не забудьте правильно сослаться на него, указывая его вклад в вашу работу.

もしあなたが研究に DeepLoTX フレームワークを組み入れているなら、その貢献を認めるために適切に引用することを忘れないでください.

如果您正在將 DeepLoTX 框架整合到您的研究中,請務必正確引用它,以聲明它對您工作的貢獻.

@software{Wu_DeepLoTX_2025,
author = {Wu, Zihao},
license = {GPL-3.0},
month = aug,
title = {{DeepLoTX}},
url = {https://github.com/vortezwohl/DeepLoTX},
version = {0.9.5},
year = {2025}
}

Installation

  • With pip

    pip install -U deeplotx
    
  • With uv (recommended)

    uv add -U deeplotx
    
  • Get the latest features from GitHub

    pip install -U git+https://github.com/vortezwohl/DeepLoTX.git
    

Quick start

  • Named entity recognition

    Multilingual is supported.

    Gender recognition is supported.

    Import dependencies

    from deeplotx import BertNER
    
    ner = BertNER()
    
    ner('你好, 我的名字是吴子豪, 来自福建福州.')
    

    stdout:

    [NamedPerson(text='吴子豪', type='PER', base_probability=0.9995428418719051, gender=<Gender.Male: 'male'>, gender_probability=0.9970703125),
    NamedEntity(text='福建', type='LOC', base_probability=0.9986373782157898),
    NamedEntity(text='福州', type='LOC', base_probability=0.9993632435798645)]
    
    ner("Hi, i'm Vortez Wohl, author of DeeploTX.")
    

    stdout:

    [NamedPerson(text='Vortez Wohl', type='PER', base_probability=0.9991965342072855, gender=<Gender.Male: 'male'>, gender_probability=0.87255859375)]
    
  • Gender recognition

    Multilingual is supported.

    Integrated from Name2Gender

    Import dependencies

    from deeplotx import Name2Gender
    
    n2g = Name2Gender()
    

    Recognize gender of "Elon Musk":

    n2g('Elon Musk')
    

    stdout:

    <Gender.Male: 'male'>
    

    Recognize gender of "Anne Hathaway":

    n2g('Anne Hathaway')
    

    stdout:

    <Gender.Female: 'female'>
    

    Recognize gender of "吴彦祖":

    n2g('吴彦祖', return_probability=True)
    

    stdout:

    (<Gender.Male: 'male'>, 1.0)
    
  • Long text embedding

    • BERT based long text embedding

      from deeplotx import LongTextEncoder
      
      encoder = LongTextEncoder(
          chunk_size=448,
          overlapping=32
      )
      encoder.encode('我是吴子豪, 这是一个测试文本.', flatten=False)
      

      stdout:

      tensor([ 2.2316e-01,  2.0300e-01,  ...,  1.5578e-01, -6.6735e-02])
      
    • Longformer based long text embedding

      from deeplotx import LongformerEncoder
      
      encoder = LongformerEncoder()
      encoder.encode('Thank you for using DeepLoTX.')
      

      stdout:

      tensor([-2.7490e-02,  6.6503e-02, ..., -6.5937e-02,  6.7802e-03])
      
  • Similarities calculation

    • Vector based

      import deeplotx.similarity as sim
      
      vector_0, vector_1 = [1, 2, 3, 4], [4, 3, 2, 1]
      distance_0 = sim.euclidean_similarity(vector_0, vector_1)
      print(distance_0)
      distance_1 = sim.cosine_similarity(vector_0, vector_1)
      print(distance_1)
      distance_2 = sim.chebyshev_similarity(vector_0, vector_1)
      print(distance_2)
      

      stdout:

      4.47213595499958
      0.33333333333333337
      3
      
    • Set based

      import deeplotx.similarity as sim
      
      set_0, set_1 = {1, 2, 3, 4}, {4, 5, 6, 7}
      distance_0 = sim.jaccard_similarity(set_0, set_1)
      print(distance_0)
      distance_1 = sim.ochiai_similarity(set_0, set_1)
      print(distance_1)
      distance_2 = sim.dice_coefficient(set_0, set_1)
      print(distance_2)
      distance_3 = sim.overlap_coefficient(set_0, set_1)
      print(distance_3)
      

      stdout:

      0.1428571428572653
      0.2500000000001875
      0.25000000000009376
      0.2500000000001875
      
    • Distribution based

      import deeplotx.similarity as sim
      
      dist_0, dist_1 = [0.3, 0.2, 0.1, 0.4], [0.2, 0.1, 0.3, 0.4]
      distance_0 = sim.cross_entropy(dist_0, dist_1)
      print(distance_0)
      distance_1 = sim.kl_divergence(dist_0, dist_1)
      print(distance_1)
      distance_2 = sim.js_divergence(dist_0, dist_1)
      print(distance_2)
      distance_3 = sim.hellinger_distance(dist_0, dist_1)
      print(distance_3)
      

      stdout:

      0.3575654913778237
      0.15040773967762736
      0.03969123741566945
      0.20105866986400994
      
  • Pre-defined neural networks

    from deeplotx import (
        FeedForward, 
        MultiHeadFeedForward, 
        LinearRegression, 
        LogisticRegression, 
        SoftmaxRegression, 
        RecursiveSequential, 
        LongContextRecursiveSequential, 
        RoPE, 
        Attention, 
        MultiHeadAttention, 
        RoFormerEncoder, 
        AutoRegression, 
        LongContextAutoRegression 
    )
    

    The fundamental FFN (MLPs):

    from typing_extensions import override
    
    import torch
    from torch import nn
    
    from deeplotx.nn.base_neural_network import BaseNeuralNetwork
    
    
    class FeedForwardUnit(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, expansion_factor: int | float = 2,
                    bias: bool = True, dropout_rate: float = 0.05, model_name: str | None = None,
                    device: str | None = None, dtype: torch.dtype | None = None):
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name, device=device, dtype=dtype)
            self._dropout_rate = dropout_rate
            self.up_proj = nn.Linear(in_features=feature_dim, out_features=int(feature_dim * expansion_factor),
                                    bias=bias, device=self.device, dtype=self.dtype)
            self.down_proj = nn.Linear(in_features=int(feature_dim * expansion_factor), out_features=feature_dim,
                                    bias=bias, device=self.device, dtype=self.dtype)
            self.parametric_relu = nn.PReLU(num_parameters=1, init=5e-3,
                                            device=self.device, dtype=self.dtype)
            self.layer_norm = nn.LayerNorm(normalized_shape=self.up_proj.in_features, eps=1e-9,
                                        device=self.device, dtype=self.dtype)
    
        @override
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            residual = x
            x = self.layer_norm(x)
            x = self.up_proj(x)
            x = self.parametric_relu(x)
            if self._dropout_rate > .0:
                x = torch.dropout(x, p=self._dropout_rate, train=self.training)
            return self.down_proj(x) + residual
    
    
    class FeedForward(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, num_layers: int = 1, expansion_factor: int | float = 2,
                    bias: bool = True, dropout_rate: float = 0.05, model_name: str | None = None,
                    device: str | None = None, dtype: torch.dtype | None = None):
            if num_layers < 1:
                raise ValueError('num_layers cannot be less than 1.')
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name, device=device, dtype=dtype)
            self.ffn_layers = nn.ModuleList([FeedForwardUnit(feature_dim=feature_dim,
                                                            expansion_factor=expansion_factor, bias=bias,
                                                            dropout_rate=dropout_rate,
                                                            device=self.device, dtype=self.dtype) for _ in range(num_layers)])
    
        @override
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            for ffn in self.ffn_layers:
                x = ffn(x)
            return x
    

    Attention:

    from typing_extensions import override
    
    import torch
    
    from deeplotx.nn.base_neural_network import BaseNeuralNetwork
    from deeplotx.nn.feed_forward import FeedForward
    from deeplotx.nn.rope import RoPE, DEFAULT_THETA
    
    
    class Attention(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, bias: bool = True, positional: bool = True,
                    proj_layers: int = 1, proj_expansion_factor: int | float = 1.5, dropout_rate: float = 0.02,
                    model_name: str | None = None, device: str | None = None, dtype: torch.dtype | None = None,
                    **kwargs):
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name,
                            device=device, dtype=dtype)
            self._positional = positional
            self._feature_dim = feature_dim
            self.q_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            self.k_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            self.v_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            if self._positional:
                self.rope = RoPE(feature_dim=self._feature_dim, theta=kwargs.get('theta', DEFAULT_THETA),
                                device=self.device, dtype=self.dtype)
    
        def _attention(self, x: torch.Tensor, y: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
            q, k = self.q_proj(x), self.k_proj(y)
            if self._positional:
                q, k = self.rope(q), self.rope(k)
            attn = torch.matmul(q, k.transpose(-2, -1))
            attn = attn / (self._feature_dim ** 0.5)
            attn = attn.masked_fill(mask == 0, -1e9) if mask is not None else attn
            return torch.softmax(attn, dtype=self.dtype, dim=-1)
    
        @override
        def forward(self, x: torch.Tensor, y: torch.Tensor | None = None, mask: torch.Tensor | None = None) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            y = x if y is None else self.ensure_device_and_dtype(y, device=self.device, dtype=self.dtype)
            if mask is not None:
                mask = self.ensure_device_and_dtype(mask, device=self.device, dtype=self.dtype)
            v = self.v_proj(y)
            return torch.matmul(self._attention(x, y, mask), v)
    
  • Text binary classification task with predefined trainer

    from deeplotx import TextBinaryClassifierTrainer, LongTextEncoder
    from deeplotx.util import get_files, read_file
    
    long_text_encoder = LongTextEncoder(
        max_length=2048, 
        chunk_size=448, 
        overlapping=32, 
        cache_capacity=512 
    )
    trainer = TextBinaryClassifierTrainer(
        long_text_encoder=long_text_encoder,
        batch_size=2,
        train_ratio=0.9 
    )
    pos_data_path = 'path/to/pos_dir'
    neg_data_path = 'path/to/neg_dir'
    pos_data = [read_file(x) for x in get_files(pos_data_path)]
    neg_data = [read_file(x) for x in get_files(neg_data_path)]
    model = trainer.train(pos_data, neg_data, 
                        num_epochs=36, learning_rate=2e-5, 
                        balancing_dataset=True, alpha=1e-4, 
                        rho=.2, encoder_layers=2, 
                        attn_heads=8, 
                        recursive_layers=2) 
    model.save(model_name='test_model', model_dir='model')
    model = model.load(model_name='test_model', model_dir='model')
    model.predict(long_text_encoder.encode('这是一个测试文本.', flatten=False))
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplotx-0.9.7.tar.gz (35.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplotx-0.9.7-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file deeplotx-0.9.7.tar.gz.

File metadata

  • Download URL: deeplotx-0.9.7.tar.gz
  • Upload date:
  • Size: 35.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for deeplotx-0.9.7.tar.gz
Algorithm Hash digest
SHA256 9f1f7add7f277ba4b29b7610a13e1f393dff8648d07138fb3c4d2a82b4645174
MD5 f2e6394af51572e6fe16e50825b96a68
BLAKE2b-256 535203aab3fc3cbf562380d48ec8d0fc4ff4e23248476a7e691d288028bb6360

See more details on using hashes here.

File details

Details for the file deeplotx-0.9.7-py3-none-any.whl.

File metadata

  • Download URL: deeplotx-0.9.7-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for deeplotx-0.9.7-py3-none-any.whl
Algorithm Hash digest
SHA256 2ba2b31f0c7fdb39dd618c897fb519c60f525158d218d2883f094590ab0bc1b2
MD5 f99904db416c98bb4618fed32a29eb5d
BLAKE2b-256 5aea19b4426d052da34edf1c505fc4e1135804e99b86306737f95d3d63e7a687

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page