Skip to main content

A package for performing advanced regression on text data using unified deep learning framework.

Project description

PyPI Downloads CI Test

TextRegress v1.3.0

TextRegress is a Python package designed to help researchers perform advanced regression analysis on text data. It provides a unified deep learning framework to handle long-text data and supports:

  • Modular Architecture: Clean, extensible package structure with registry systems for models, encoders, and losses
  • Multiple Model Types: LSTM and GRU models with full feature parity including cross-attention and feature mixing
  • Configurable text encoding using SentenceTransformer, TFIDF, or any pretrained Hugging Face models
  • Automatic text chunking for long documents
  • Deep learning backend based on PyTorch Lightning with RNN layers
  • Integration of exogenous features through normalization and attention mechanisms
  • Explainability features including gradient-based importance and integrated gradients
  • Model persistence with save/load functionality
  • Sklearn-like API with fit, predict, and fit_predict methods

Citation

If this package was helpful in your work, feel free to cite it as

  • Jiang, Jinhang and Liu, Ben and Peng, Weiyao and Srinivasan, Karthik, TextRegress: A Python package for advanced regression analysis on long-form text data (May 5, 2025). Software Impacts. https://doi.org/10.1016/j.simpa.2025.100760
@article{jiang2025textregress,
  title={TextRegress: A Python package for advanced regression analysis on long-form text data},
  author={Jiang, Jinhang and Liu, Ben and Peng, Weiyao and Srinivasan, Karthik},
  journal={Software Impacts},
  pages={100760},
  year={2025},
  publisher={Elsevier}
}

Installation

Note: Version 1.3.0 includes comprehensive feature importance analysis, robust device handling, and enhanced explainability. Use get_feature_importance() for model interpretability and device argument for GPU/CPU control.

pip install textregress==1.3.0

Or install from the repository:

git clone https://github.com/jinhangjiang/textregress.git
cd textregress
pip install -e .

Quick Start

import pandas as pd
from textregress import TextRegressor

# Create sample data
data = {
    'text': [
        "This is a positive review about the product.",
        "The quality is excellent and I recommend it.",
        "Not satisfied with the purchase.",
        "Great value for money."
    ],
    'y': [4.5, 4.8, 2.1, 4.2],
    'feature1': [1.0, 1.2, 0.8, 1.1],
    'feature2': [0.5, 0.6, 0.3, 0.7]
}
df = pd.DataFrame(data)

# Create and train the model
regressor = TextRegressor(
    model_name="lstm",  # or "gru"
    encoder_model="sentence-transformers/all-MiniLM-L6-v2",
    exogenous_features=["feature1", "feature2"],
    max_steps=100
)

# Fit and predict
predictions = regressor.fit_predict(df)
print(f"Predictions: {predictions}")

# Device handling (optional)
print(f"Current device: {regressor.get_device()}")
regressor.set_device("cuda")  # Move to GPU if available
regressor.set_device("cpu")   # Move back to CPU

# Get feature importance for explainability
importance = regressor.get_feature_importance()
print(f"Text importance shape: {importance['text_importance'].shape}")
if 'exogenous_importance' in importance:
    print(f"Exogenous importance shape: {importance['exogenous_importance'].shape}")

# Analyze with different modes
gradient_importance = regressor.get_feature_importance(mode="gradient")
attention_importance = regressor.get_feature_importance(mode="attention")  # Requires cross-attention

# Analyze custom data
custom_data = pd.DataFrame({
    'text': ["New text to analyze"],
    'feature1': [1.5],
    'feature2': [0.8]
})
custom_importance = regressor.get_feature_importance(df=custom_data)

Implementation

TextRegress Model (encoder_model, encoder_params=None, rnn_type, rnn_layers, hidden_size, bidirectional,
inference_layer_units, exogenous_features=None, feature_mixer=False, learning_rate: float, loss_function: Union[str, Callable], encoder_output_dim: int, optimizer_name: str, optimizer_params: dict=None,
cross_attention_enabled: bool=False, cross_attention_layer: Optional[nn.Module]=None, dropout_rate: float=0.0,
se_layer: bool=True, random_seed: int=1)

Parameters:

  • encoder_model: str
    Specifies the pretrained encoder model to use. This can be a HuggingFace model identifier (e.g., "sentence-transformers/all-MiniLM-L6-v2") or "tfidf" for a TFIDF-based encoder.

  • encoder_params: Optional[dict]
    A dictionary of additional parameters for configuring the encoder. For example, when using a TFIDF encoder, users can supply parameters such as {"max_features": 1000, "ngram_range": (1, 2)}. These parameters are passed directly to the underlying encoder.

  • rnn_type: str
    Specifies the type of recurrent unit to use. Acceptable values include "LSTM" and "GRU". This choice determines the basic building block of the temporal processing module.

  • rnn_layers: int
    The number of stacked RNN layers in the model. More layers can capture higher-order temporal features but may require more data and computation.

  • hidden_size: int
    The number of hidden units in each RNN layer. This parameter defines the dimensionality of the hidden state and directly influences the model's capacity.

  • bidirectional: bool
    When set to True, the RNN operates in a bidirectional manner, processing the sequence in both forward and backward directions. This effectively doubles the output dimension of the RNN.

  • inference_layer_units: int
    The number of units in the intermediate inference (fully connected) layer. This layer transforms the processed features into a representation suitable for the final regression output.

  • exogenous_features: Optional[List[str]]
    A list of column names representing additional (exogenous) features to be incorporated into the model.

    • When cross_attention_enabled is True, these features are projected to match the RNN output dimension and integrated via a cross-attention mechanism.
    • When cross_attention_enabled is False and feature_mixer is also False, the normalized exogenous features are directly concatenated with the document embedding.
    • When feature_mixer is True, the model first computes an inference output from the document embedding and then mixes in the normalized exogenous features via an additional mixing layer before making predictions.
  • feature_mixer: bool
    A flag to enable additional mixing of exogenous features. When set to True, the model mixes normalized exogenous features with the inference output of the document embedding via a dedicated linear layer. When False, the exogenous features are concatenated directly with the document embedding.

  • learning_rate: float
    The learning rate used by the optimizer during training. This controls how quickly the model weights are updated.

  • loss_function: Union[str, Callable]
    Specifies the loss function for training. Supported string options include "mae", "mse", "rmse", "smape", "wmape", and "mape". Alternatively, users can provide a custom loss function as a callable. Custom loss functions must accept keyword arguments pred and target.

  • encoder_output_dim: int
    The dimensionality of the vector output from the encoder module. This value is used to configure the input size of the RNN. For instance, when using a TFIDF encoder, this is automatically set based on the size of the fitted vocabulary.

  • optimizer_name: str
    The name of the optimizer to be used (e.g., "adam", "sgd", etc.). The model dynamically searches within PyTorch's optimizers to instantiate the specified optimizer.

  • optimizer_params: dict
    A dictionary containing additional keyword parameters to pass to the optimizer upon instantiation (for example, momentum for SGD).

  • cross_attention_enabled: bool
    A flag indicating whether to enable a cross-attention mechanism. When True, the model generates a global token (by averaging the RNN outputs) and uses it as the query to attend over the projected exogenous features. The output of this attention is concatenated with the RNN's last time-step output before further processing.

  • cross_attention_layer: Optional[nn.Module]
    An optional custom cross-attention layer. If not provided and cross attention is enabled, a default single-head MultiheadAttention layer (from nn.MultiheadAttention) is used.

  • dropout_rate: float
    The dropout rate applied after each major component (e.g., after the RNN output, global token generation, inference layers, cross-attention, and squeeze-and-excitation block). A value of 0.0 means no dropout is applied.

  • se_layer: bool
    Specifies whether to enable the squeeze-and-excitation (SE) block on the output of the inference layer. When enabled, the SE block recalibrates channel-wise feature responses, potentially enhancing model performance.

  • random_seed: int
    Sets the random seed for reproducibility. This value is used to initialize PyTorch (via torch.manual_seed), ensuring that training results are consistent across runs.


Usage Example:

from textregress import TextRegressor

# Instantiate the TextRegressor with custom encoder parameters and feature mixing:
regressor = TextRegressor(
    encoder_model="tfidf",  # Use the TFIDF encoder
    encoder_params={"max_features": 1000, "ngram_range": (1, 2)},  # Custom TFIDF parameters
    rnn_type="GRU",                     # Use GRU instead of LSTM
    rnn_layers=2,                       # Use 2 RNN layers
    hidden_size=100,                    # Hidden size set to 100
    bidirectional=False,                # Unidirectional RNN
    inference_layer_units=50,           # Inference layer with 50 units
    chunk_info=(100, 25),               # Chunk text into segments of 100 words with an overlap of 25 words
    padding_value=0,                    # Padding value for chunks
    exogenous_features=["ex1", "ex2"],  # Include two exogenous features
    feature_mixer=True,                 # Enable feature mixer to combine document embedding with exogenous features
    learning_rate=0.001,                # Learning rate of 0.001
    loss_function="mae",                # MAE loss (or a custom callable loss function)
    encoder_output_dim=1000,            # For TFIDF, this is set to the number of features (e.g., 1000)
    optimizer_name="adam",              # Use Adam optimizer
    cross_attention_enabled=True,       # Enable cross attention between a global token and exogenous features
    cross_attention_layer=None,         # Use default cross attention layer
    dropout_rate=0.1,                   # Apply dropout with a rate of 0.1
    se_layer=True,                      # Enable the squeeze-and-excitation block
    random_seed=42                      # Set a random seed for reproducibility
)

# Fit the model on a DataFrame.
regressor.fit(df, val_size=0.2)

# Predict on the same DataFrame.
predictions = regressor.predict(df)

New in v1.3.0

  • 🎯 Comprehensive Feature Importance: Advanced get_feature_importance() method with gradient-based and attention-based analysis
  • 🖥️ Robust Device Handling: Automatic CPU/GPU alignment with manual override via device argument and set_device() method
  • 🔍 Enhanced Explainability: Support for both text and exogenous feature importance analysis
  • 🧪 Extensive Testing: Comprehensive test suite covering all feature importance functionality
  • 🔄 Modular Architecture: Complete modular packages (models/, encoders/, losses/, utils/) with registry systems
  • 🧠 GRU Model: Full parity with LSTM including cross-attention and feature mixing
  • 💾 Model Persistence: Save and load models with full PyTorch parameter exposure
  • 📊 Embedding Extraction: Extract document and sequence embeddings for transfer learning

Features

  • Unified DataFrame Interface
    The estimator methods (fit, predict, fit_predict) accept a single pandas DataFrame with:

    • text: Input text data (can be long-form text).
    • y: Continuous target variable.
    • Additional columns can be provided as exogenous features.
  • Configurable Text Encoding
    Choose from multiple encoding methods:

    • TFIDF Encoder: Activated when the model identifier contains "tfidf".
    • SentenceTransformer Encoder: Activated when the model identifier contains "sentence-transformers".
    • Generic Hugging Face Encoder: Supports any pre-trained Hugging Face model using AutoTokenizer/AutoModel with a mean-pooling strategy.
  • Text Chunking
    Automatically splits long texts into overlapping, fixed-size chunks (only full chunks are processed) to ensure consistent input size.

  • Deep Learning Regression Model
    Utilizes an RNN-based (LSTM/GRU) network implemented with PyTorch Lightning:

    • Customizable number of layers, hidden size, and bidirectionality.
    • Optionally integrates exogenous features into the regression process.
  • Custom Loss Functions
    Multiple loss functions are available via loss.py:

    • MAE (default)
    • SMAPE
    • MSE
    • RMSE
    • wMAPE
    • MAPE
  • Training Customization
    Fine-tune training behavior with parameters such as:

    • max_steps: Maximum training steps (default: 500).
    • early_stop_enabled: Enable early stopping (default: False).
    • patience_steps: Steps with no improvement before stopping (default: 10 when early stopping is enabled).
    • val_check_steps: Validation check interval (default: 50, automatically adjusted if needed).
    • val_size: Proportion of data reserved for validation when early stopping is enabled.
  • GPU Auto-Detection
    Automatically leverages available GPUs via PyTorch Lightning (using accelerator="auto" and devices="auto").

  • Feature Importance Analysis
    Comprehensive explainability with the get_feature_importance() method:

    • Gradient-based importance: Default mode for analyzing feature contributions using saliency maps
    • Attention weights: For models with cross-attention (requires exogenous features)
    • Flexible data: Analyze training data or new data with custom DataFrames
    • Clean output: Returns numpy arrays ready for visualization and analysis
    • Device handling: Automatic CPU/GPU alignment with state preservation
    • Multiple modes: Support for both "gradient" and "attention" analysis modes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textregress-1.3.0.tar.gz (54.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textregress-1.3.0-py3-none-any.whl (38.5 kB view details)

Uploaded Python 3

File details

Details for the file textregress-1.3.0.tar.gz.

File metadata

  • Download URL: textregress-1.3.0.tar.gz
  • Upload date:
  • Size: 54.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for textregress-1.3.0.tar.gz
Algorithm Hash digest
SHA256 79eee02fe334c975a042153bcbebd2b4d77bcd36220bc5f66b15bdd188598864
MD5 d44643ba905e309cebc0d2c03d75008f
BLAKE2b-256 925b5b4ed0f46579363f9024160f254274b09ffedca1f0fd9efbc81e0d35c3bf

See more details on using hashes here.

File details

Details for the file textregress-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: textregress-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 38.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for textregress-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bd42231b9f66d321d20c0d842e20e24c17e9bd927b2a5f5f10baa985d83ac94b
MD5 43b8f18bd2e61b910438f7d37433fda5
BLAKE2b-256 ce409ba5028f0639a7deee88215e410673d897840a6b8d69261aeeadc9ab9779

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page