Skip to main content

An out-of-the-box long-text NLP framework.

Project description

Ask DeepWiki

Deep Long Text Learning

An out-of-the-box long-text NLP framework.

Author: vortezwohl

Installation

  • With pip

    pip install -U deeplotx
    
  • With uv (recommended)

    uv add -U deeplotx
    
  • Get the latest features from GitHub

    pip install -U git+https://github.com/vortezwohl/DeepLoTX.git
    

Quick start

  • Named entity recognition

    Multilingual is supported.

    Gender recognition is supported.

    Import dependencies

    from deeplotx import BertNER
    
    ner = BertNER()
    
    ner('你好, 我的名字是吴子豪, 来自福建福州.')
    

    stdout:

    [NamedPerson(text='吴子豪', type='PER', base_probability=0.9995428418719051, gender=<Gender.Male: 'male'>, gender_probability=0.9970703125),
    NamedEntity(text='福建', type='LOC', base_probability=0.9986373782157898),
    NamedEntity(text='福州', type='LOC', base_probability=0.9993632435798645)]
    
    ner("Hi, i'm Vortez Wohl, author of DeeploTX.")
    

    stdout:

    [NamedPerson(text='Vortez Wohl', type='PER', base_probability=0.9991965342072855, gender=<Gender.Male: 'male'>, gender_probability=0.87255859375)]
    
  • Gender recognition

    Multilingual is supported.

    Integrated from Name2Gender

    Import dependencies

    from deeplotx import Name2Gender
    
    n2g = Name2Gender()
    

    Recognize gender of "Elon Musk":

    n2g('Elon Musk')
    

    stdout:

    <Gender.Male: 'male'>
    

    Recognize gender of "Anne Hathaway":

    n2g('Anne Hathaway')
    

    stdout:

    <Gender.Female: 'female'>
    

    Recognize gender of "吴彦祖":

    n2g('吴彦祖', return_probability=True)
    

    stdout:

    (<Gender.Male: 'male'>, 1.0)
    
  • Long text embedding

    • BERT based long text embedding

      from deeplotx import LongTextEncoder
      
      encoder = LongTextEncoder(
          chunk_size=448,
          overlapping=32
      )
      encoder.encode('我是吴子豪, 这是一个测试文本.', flatten=False)
      

      stdout:

      tensor([ 2.2316e-01,  2.0300e-01,  ...,  1.5578e-01, -6.6735e-02])
      
    • Longformer based long text embedding

      from deeplotx import LongformerEncoder
      
      encoder = LongformerEncoder()
      encoder.encode('Thank you for using DeepLoTX.')
      

      stdout:

      tensor([-2.7490e-02,  6.6503e-02, ..., -6.5937e-02,  6.7802e-03])
      
  • Similarities calculation

    • Vector based

      import deeplotx.similarity as sim
      
      vector_0, vector_1 = [1, 2, 3, 4], [4, 3, 2, 1]
      distance_0 = sim.euclidean_similarity(vector_0, vector_1)
      print(distance_0)
      distance_1 = sim.cosine_similarity(vector_0, vector_1)
      print(distance_1)
      distance_2 = sim.chebyshev_similarity(vector_0, vector_1)
      print(distance_2)
      

      stdout:

      4.47213595499958
      0.33333333333333337
      3
      
    • Set based

      import deeplotx.similarity as sim
      
      set_0, set_1 = {1, 2, 3, 4}, {4, 5, 6, 7}
      distance_0 = sim.jaccard_similarity(set_0, set_1)
      print(distance_0)
      distance_1 = sim.ochiai_similarity(set_0, set_1)
      print(distance_1)
      distance_2 = sim.dice_coefficient(set_0, set_1)
      print(distance_2)
      distance_3 = sim.overlap_coefficient(set_0, set_1)
      print(distance_3)
      

      stdout:

      0.1428571428572653
      0.2500000000001875
      0.25000000000009376
      0.2500000000001875
      
    • Distribution based

      import deeplotx.similarity as sim
      
      dist_0, dist_1 = [0.3, 0.2, 0.1, 0.4], [0.2, 0.1, 0.3, 0.4]
      distance_0 = sim.cross_entropy(dist_0, dist_1)
      print(distance_0)
      distance_1 = sim.kl_divergence(dist_0, dist_1)
      print(distance_1)
      distance_2 = sim.js_divergence(dist_0, dist_1)
      print(distance_2)
      distance_3 = sim.hellinger_distance(dist_0, dist_1)
      print(distance_3)
      

      stdout:

      0.3575654913778237
      0.15040773967762736
      0.03969123741566945
      0.20105866986400994
      
  • Pre-defined neural networks

    from deeplotx import (
        FeedForward, 
        MultiHeadFeedForward, 
        LinearRegression, 
        LogisticRegression, 
        SoftmaxRegression, 
        RecursiveSequential, 
        LongContextRecursiveSequential, 
        RoPE, 
        Attention, 
        MultiHeadAttention, 
        RoFormerEncoder, 
        AutoRegression, 
        LongContextAutoRegression 
    )
    

    The fundamental FFN (MLPs):

    from typing_extensions import override
    
    import torch
    from torch import nn
    
    from deeplotx.nn.base_neural_network import BaseNeuralNetwork
    
    
    class FeedForwardUnit(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, expansion_factor: int | float = 2,
                    bias: bool = True, dropout_rate: float = 0.05, model_name: str | None = None,
                    device: str | None = None, dtype: torch.dtype | None = None):
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name, device=device, dtype=dtype)
            self._dropout_rate = dropout_rate
            self.up_proj = nn.Linear(in_features=feature_dim, out_features=int(feature_dim * expansion_factor),
                                    bias=bias, device=self.device, dtype=self.dtype)
            self.down_proj = nn.Linear(in_features=int(feature_dim * expansion_factor), out_features=feature_dim,
                                    bias=bias, device=self.device, dtype=self.dtype)
            self.parametric_relu = nn.PReLU(num_parameters=1, init=5e-3,
                                            device=self.device, dtype=self.dtype)
            self.layer_norm = nn.LayerNorm(normalized_shape=self.up_proj.in_features, eps=1e-9,
                                        device=self.device, dtype=self.dtype)
    
        @override
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            residual = x
            x = self.layer_norm(x)
            x = self.up_proj(x)
            x = self.parametric_relu(x)
            if self._dropout_rate > .0:
                x = torch.dropout(x, p=self._dropout_rate, train=self.training)
            return self.down_proj(x) + residual
    
    
    class FeedForward(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, num_layers: int = 1, expansion_factor: int | float = 2,
                    bias: bool = True, dropout_rate: float = 0.05, model_name: str | None = None,
                    device: str | None = None, dtype: torch.dtype | None = None):
            if num_layers < 1:
                raise ValueError('num_layers cannot be less than 1.')
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name, device=device, dtype=dtype)
            self.ffn_layers = nn.ModuleList([FeedForwardUnit(feature_dim=feature_dim,
                                                            expansion_factor=expansion_factor, bias=bias,
                                                            dropout_rate=dropout_rate,
                                                            device=self.device, dtype=self.dtype) for _ in range(num_layers)])
    
        @override
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            for ffn in self.ffn_layers:
                x = ffn(x)
            return x
    

    Attention:

    from typing_extensions import override
    
    import torch
    
    from deeplotx.nn.base_neural_network import BaseNeuralNetwork
    from deeplotx.nn.feed_forward import FeedForward
    from deeplotx.nn.rope import RoPE, DEFAULT_THETA
    
    
    class Attention(BaseNeuralNetwork):
        def __init__(self, feature_dim: int, bias: bool = True, positional: bool = True,
                    proj_layers: int = 1, proj_expansion_factor: int | float = 1.5, dropout_rate: float = 0.02,
                    model_name: str | None = None, device: str | None = None, dtype: torch.dtype | None = None,
                    **kwargs):
            super().__init__(in_features=feature_dim, out_features=feature_dim, model_name=model_name,
                            device=device, dtype=dtype)
            self._positional = positional
            self._feature_dim = feature_dim
            self.q_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            self.k_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            self.v_proj = FeedForward(feature_dim=self._feature_dim, num_layers=proj_layers,
                                    expansion_factor=proj_expansion_factor,
                                    bias=bias, dropout_rate=dropout_rate, device=self.device, dtype=self.dtype)
            if self._positional:
                self.rope = RoPE(feature_dim=self._feature_dim, theta=kwargs.get('theta', DEFAULT_THETA),
                                device=self.device, dtype=self.dtype)
    
        def _attention(self, x: torch.Tensor, y: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
            q, k = self.q_proj(x), self.k_proj(y)
            if self._positional:
                q, k = self.rope(q), self.rope(k)
            attn = torch.matmul(q, k.transpose(-2, -1))
            attn = attn / (self._feature_dim ** 0.5)
            attn = attn.masked_fill(mask == 0, -1e9) if mask is not None else attn
            return torch.softmax(attn, dtype=self.dtype, dim=-1)
    
        @override
        def forward(self, x: torch.Tensor, y: torch.Tensor | None = None, mask: torch.Tensor | None = None) -> torch.Tensor:
            x = self.ensure_device_and_dtype(x, device=self.device, dtype=self.dtype)
            y = x if y is None else self.ensure_device_and_dtype(y, device=self.device, dtype=self.dtype)
            if mask is not None:
                mask = self.ensure_device_and_dtype(mask, device=self.device, dtype=self.dtype)
            v = self.v_proj(y)
            return torch.matmul(self._attention(x, y, mask), v)
    
  • Text binary classification task with predefined trainer

    from deeplotx import TextBinaryClassifierTrainer, LongTextEncoder
    from deeplotx.util import get_files, read_file
    
    long_text_encoder = LongTextEncoder(
        max_length=2048, 
        chunk_size=448, 
        overlapping=32, 
        cache_capacity=512 
    )
    trainer = TextBinaryClassifierTrainer(
        long_text_encoder=long_text_encoder,
        batch_size=2,
        train_ratio=0.9 
    )
    pos_data_path = 'path/to/pos_dir'
    neg_data_path = 'path/to/neg_dir'
    pos_data = [read_file(x) for x in get_files(pos_data_path)]
    neg_data = [read_file(x) for x in get_files(neg_data_path)]
    model = trainer.train(pos_data, neg_data, 
                        num_epochs=36, learning_rate=2e-5, 
                        balancing_dataset=True, alpha=1e-4, 
                        rho=.2, encoder_layers=2, 
                        attn_heads=8, 
                        recursive_layers=2) 
    model.save(model_name='test_model', model_dir='model')
    model = model.load(model_name='test_model', model_dir='model')
    model.predict(long_text_encoder.encode('这是一个测试文本.', flatten=False))
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplotx-0.9.3.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplotx-0.9.3-py3-none-any.whl (43.8 kB view details)

Uploaded Python 3

File details

Details for the file deeplotx-0.9.3.tar.gz.

File metadata

  • Download URL: deeplotx-0.9.3.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for deeplotx-0.9.3.tar.gz
Algorithm Hash digest
SHA256 329f576a10cd08f1c1bd863c54afed03fe8300ffea9dc12bf2adad375685bb63
MD5 51d37d31c73b0af1c82f71d1679c12de
BLAKE2b-256 01d8a21289d13b701767567f6f6641d88de7df5a1ed8de9d1a4a4fd3ad9a0674

See more details on using hashes here.

File details

Details for the file deeplotx-0.9.3-py3-none-any.whl.

File metadata

  • Download URL: deeplotx-0.9.3-py3-none-any.whl
  • Upload date:
  • Size: 43.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for deeplotx-0.9.3-py3-none-any.whl
Algorithm Hash digest
SHA256 100e0cce2e0d1605588e3c24d2ec8a8174afe5d08090366c4ba0d93e876345a4
MD5 4c0ae86a20a4848467a635579e9bf403
BLAKE2b-256 db79ec1c7d4a82445abd79dde567e6aed2760d0ad9521feef0d147b9e56d0ac8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page