Skip to main content

An out-of-the-box long-text NLP framework.

Project description

Ask DeepWiki

Deep Long Text Learning

An out-of-the-box long-text NLP framework.

Author: vortezwohl

Installation

  • With pip

    pip install -U deeplotx
    
  • With uv (recommended)

    uv add -U deeplotx
    
  • Get the latest features from GitHub

    pip install -U git+https://github.com/vortezwohl/DeepLoTX.git
    

Quick start

  • Named entity recognition

    Multilingual is supported.

    Gender recognition is supported.

    Import dependencies

    from deeplotx import BertNER
    
    ner = BertNER()
    
    ner('你好, 我的名字是吴子豪, 来自福建福州.')
    

    stdout:

    [NamedPerson(text='吴子豪', type='PER', base_probability=0.9995428418719051, gender=<Gender.Male: 'male'>, gender_probability=0.9970703125),
    NamedEntity(text='福建', type='LOC', base_probability=0.9986373782157898),
    NamedEntity(text='福州', type='LOC', base_probability=0.9993632435798645)]
    
    ner("Hi, i'm Vortez Wohl, author of DeeploTX.")
    

    stdout:

    [NamedPerson(text='Vortez Wohl', type='PER', base_probability=0.9991965342072855, gender=<Gender.Male: 'male'>, gender_probability=0.87255859375)]
    
  • Gender recognition

    Multilingual is supported.

    Integrated from Name2Gender

    Import dependencies

    from deeplotx import Name2Gender
    
    n2g = Name2Gender()
    

    Recognize gender of "Elon Musk":

    n2g('Elon Musk')
    

    stdout:

    <Gender.Male: 'male'>
    

    Recognize gender of "Anne Hathaway":

    n2g('Anne Hathaway')
    

    stdout:

    <Gender.Female: 'female'>
    

    Recognize gender of "吴彦祖":

    n2g('吴彦祖', return_probability=True)
    

    stdout:

    (<Gender.Male: 'male'>, 1.0)
    
  • Long text embedding

    • BERT based long text embedding

      from deeplotx import LongTextEncoder
      
      encoder = LongTextEncoder(
          chunk_size=448,
          overlapping=32
      )
      encoder.encode('我是吴子豪, 这是一个测试文本.', flatten=False)
      

      stdout:

      tensor([ 2.2316e-01,  2.0300e-01,  ...,  1.5578e-01, -6.6735e-02])
      
    • Longformer based long text embedding

      from deeplotx import LongformerEncoder
      
      encoder = LongformerEncoder()
      encoder.encode('Thank you for using DeepLoTX.')
      

      stdout:

      tensor([-2.7490e-02,  6.6503e-02, ..., -6.5937e-02,  6.7802e-03])
      
  • Similarities calculation

    • Vector based

      import deeplotx.similarity as sim
      
      vector_0, vector_1 = [1, 2, 3, 4], [4, 3, 2, 1]
      distance_0 = sim.euclidean_similarity(vector_0, vector_1)
      print(distance_0)
      distance_1 = sim.cosine_similarity(vector_0, vector_1)
      print(distance_1)
      distance_2 = sim.chebyshev_similarity(vector_0, vector_1)
      print(distance_2)
      

      stdout:

      4.47213595499958
      0.33333333333333337
      3
      
    • Set based

      import deeplotx.similarity as sim
      
      set_0, set_1 = {1, 2, 3, 4}, {4, 5, 6, 7}
      distance_0 = sim.jaccard_similarity(set_0, set_1)
      print(distance_0)
      distance_1 = sim.ochiai_similarity(set_0, set_1)
      print(distance_1)
      distance_2 = sim.dice_coefficient(set_0, set_1)
      print(distance_2)
      distance_3 = sim.overlap_coefficient(set_0, set_1)
      print(distance_3)
      

      stdout:

      0.1428571428572653
      0.2500000000001875
      0.25000000000009376
      0.2500000000001875
      
    • Distribution based

      import deeplotx.similarity as sim
      
      dist_0, dist_1 = [0.3, 0.2, 0.1, 0.4], [0.2, 0.1, 0.3, 0.4]
      distance_0 = sim.cross_entropy(dist_0, dist_1)
      print(distance_0)
      distance_1 = sim.kl_divergence(dist_0, dist_1)
      print(distance_1)
      distance_2 = sim.js_divergence(dist_0, dist_1)
      print(distance_2)
      distance_3 = sim.hellinger_distance(dist_0, dist_1)
      print(distance_3)
      

      stdout:

      0.3575654913778237
      0.15040773967762736
      0.03969123741566945
      0.20105866986400994
      
  • Pre-defined neural networks

    from deeplotx import (
        FeedForward, 
        MultiHeadFeedForward, 
        LinearRegression, 
        LogisticRegression, 
        SoftmaxRegression, 
        RecursiveSequential, 
        LongContextRecursiveSequential, 
        RoPE, 
        Attention, 
        MultiHeadAttention, 
        RoFormerEncoder, 
        AutoRegression, 
        LongContextAutoRegression 
    )
    

    The fundamental FFN (MLPs):

    from typing_extensions import override
    
    import torch
    from torch import nn
    
    from deeplotx.nn.base_neural_network import BaseNeuralNetwork
    
    
    class FeedForwardUnit(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, expansion_factor: int | float = 2,
                    bias: bool = True, dropout_rate: float = 0.05, model_name: str | None = None,
                    device: str | None = None, dtype: torch.dtype | None = None):
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name, device=device, dtype=dtype)
            self._dropout_rate = dropout_rate
            self.up_proj = nn.Linear(in_features=feature_dim, out_features=int(feature_dim * expansion_factor),
                                    bias=bias, device=self.device, dtype=self.dtype)
            self.down_proj = nn.Linear(in_features=int(feature_dim * expansion_factor), out_features=feature_dim,
                                    bias=bias, device=self.device, dtype=self.dtype)
            self.parametric_relu = nn.PReLU(num_parameters=1, init=5e-3,
                                            device=self.device, dtype=self.dtype)
            self.layer_norm = nn.LayerNorm(normalized_shape=self.up_proj.in_features, eps=1e-9,
                                        device=self.device, dtype=self.dtype)
    
        @override
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            residual = x
            x = self.layer_norm(x)
            x = self.up_proj(x)
            x = self.parametric_relu(x)
            if self._dropout_rate > .0:
                x = torch.dropout(x, p=self._dropout_rate, train=self.training)
            return self.down_proj(x) + residual
    
    
    class FeedForward(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, num_layers: int = 1, expansion_factor: int | float = 2,
                    bias: bool = True, dropout_rate: float = 0.05, model_name: str | None = None,
                    device: str | None = None, dtype: torch.dtype | None = None):
            if num_layers < 1:
                raise ValueError('num_layers cannot be less than 1.')
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name, device=device, dtype=dtype)
            self.ffn_layers = nn.ModuleList([FeedForwardUnit(feature_dim=feature_dim,
                                                            expansion_factor=expansion_factor, bias=bias,
                                                            dropout_rate=dropout_rate,
                                                            device=self.device, dtype=self.dtype) for _ in range(num_layers)])
    
        @override
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            for ffn in self.ffn_layers:
                x = ffn(x)
            return x
    

    Attention:

    from typing_extensions import override
    
    import torch
    
    from deeplotx.nn.base_neural_network import BaseNeuralNetwork
    from deeplotx.nn.feed_forward import FeedForward
    from deeplotx.nn.rope import RoPE, DEFAULT_THETA
    
    
    class Attention(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, bias: bool = True, positional: bool = True,
                    proj_layers: int = 1, proj_expansion_factor: int | float = 1.5, dropout_rate: float = 0.02,
                    model_name: str | None = None, device: str | None = None, dtype: torch.dtype | None = None,
                    **kwargs):
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name,
                            device=device, dtype=dtype)
            self._positional = positional
            self._feature_dim = feature_dim
            self.q_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            self.k_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            self.v_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            if self._positional:
                self.rope = RoPE(feature_dim=self._feature_dim, theta=kwargs.get('theta', DEFAULT_THETA),
                                device=self.device, dtype=self.dtype)
    
        def _attention(self, x: torch.Tensor, y: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
            q, k = self.q_proj(x), self.k_proj(y)
            if self._positional:
                q, k = self.rope(q), self.rope(k)
            attn = torch.matmul(q, k.transpose(-2, -1))
            attn = attn / (self._feature_dim ** 0.5)
            attn = attn.masked_fill(mask == 0, -1e9) if mask is not None else attn
            return torch.softmax(attn, dtype=self.dtype, dim=-1)
    
        @override
        def forward(self, x: torch.Tensor, y: torch.Tensor | None = None, mask: torch.Tensor | None = None) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            y = x if y is None else self.ensure_device_and_dtype(y, device=self.device, dtype=self.dtype)
            if mask is not None:
                mask = self.ensure_device_and_dtype(mask, device=self.device, dtype=self.dtype)
            v = self.v_proj(y)
            return torch.matmul(self._attention(x, y, mask), v)
    
  • Text binary classification task with predefined trainer

    from deeplotx import TextBinaryClassifierTrainer, LongTextEncoder
    from deeplotx.util import get_files, read_file
    
    long_text_encoder = LongTextEncoder(
        max_length=2048, 
        chunk_size=448, 
        overlapping=32, 
        cache_capacity=512 
    )
    trainer = TextBinaryClassifierTrainer(
        long_text_encoder=long_text_encoder,
        batch_size=2,
        train_ratio=0.9 
    )
    pos_data_path = 'path/to/pos_dir'
    neg_data_path = 'path/to/neg_dir'
    pos_data = [read_file(x) for x in get_files(pos_data_path)]
    neg_data = [read_file(x) for x in get_files(neg_data_path)]
    model = trainer.train(pos_data, neg_data, 
                        num_epochs=36, learning_rate=2e-5, 
                        balancing_dataset=True, alpha=1e-4, 
                        rho=.2, encoder_layers=2, 
                        attn_heads=8, 
                        recursive_layers=2) 
    model.save(model_name='test_model', model_dir='model')
    model = model.load(model_name='test_model', model_dir='model')
    model.predict(long_text_encoder.encode('这是一个测试文本.', flatten=False))
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplotx-0.9.5.tar.gz (34.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplotx-0.9.5-py3-none-any.whl (44.1 kB view details)

Uploaded Python 3

File details

Details for the file deeplotx-0.9.5.tar.gz.

File metadata

  • Download URL: deeplotx-0.9.5.tar.gz
  • Upload date:
  • Size: 34.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for deeplotx-0.9.5.tar.gz
Algorithm Hash digest
SHA256 98d9eb6c2bb1b6f38d9e95c9a99fafb9574955d79d30fe3a0b9cb61b722849e7
MD5 22ec9b43267ad7366375a6d9a9b819ab
BLAKE2b-256 d1f244203219bf99eeb450d989d489a007d5cb53e51980defd16a99d2f8e04b0

See more details on using hashes here.

File details

Details for the file deeplotx-0.9.5-py3-none-any.whl.

File metadata

  • Download URL: deeplotx-0.9.5-py3-none-any.whl
  • Upload date:
  • Size: 44.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for deeplotx-0.9.5-py3-none-any.whl
Algorithm Hash digest
SHA256 01eac329682d3a0461e02f6f464ce4bc525edb2e2a1d028c8455e59363c738e7
MD5 36ecaa743830cc1956b97b9da0173302
BLAKE2b-256 828b8b953126393a1060bd824f773accd112fe1275c7bf41e8589c42084a1b29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page