Skip to main content

A toolkit for analyzing group appeals in text using fine-tuned language models optimized for direct plain text processing

Project description

GroupAppeals

Python 3.8+ License: MIT

A Python package for analyzing social group appeals in political text using fine-tuned multilingual language models.

Overview

GroupAppeals provides a comprehensive toolkit for identifying and analyzing how political parties reference and appeal to social groups in their communications. The package uses state-of-the-art NLP models trained on political manifestos to:

  • Extract group references from text (tokens like "workers", "immigrants", "families with young children")
  • Detect stance toward the groups identified in the text (positive, negative, or neutral)
  • Identify whether policies are directed at the specific groups identified in the text (no thematic policy classification)
  • Classify group tokens identified in the text into meaningful group categories

Key Features

  • 🌍 Multilingual support - Works with English, German, Spanish, Dutch, Danish, French, Italian and Swedish text
  • 🔧 Modular design - Use individual components or run the complete pipeline
  • 📊 Batch processing - Efficiently analyze large datasets from CSV files
  • 🎯 High accuracy - Models achieve 81%+ accuracy across tasks
  • 📝 Detailed output - Includes confidence scores, text positions, and semantic categories
  • 🚀 Automatic hardware optimization - CUDA → MPS → CPU fallback

Research Background

The conceptual and operational basis of these models as well as the methodology used to construct them are described in:

  • Dolinsky, A. O., Huber, L. M., & Horne, W. (Accepted for Publication 2026). Who do Parties Speak To? Introducing the PSoGA: A New Comprehensive Database of Parties' Social Group Appeals. British Journal of Political Science.
  • Horne, W., Dolinsky, A. O., & Huber, L. M. (2025). Using LLMs to Detect Group Appeals in Parties’ Election Manifestos. Working Paper. https://osf.io/fp2h3_v3
  • Huber, L.M., & Dolinsky, A.O. (2023).How parties shape their relationship with social groups: A roadmap to the study of group-based appeals. Working Paper. https://osf.io/preprints/osf/szaqw_v1

Model Training Details:

  • Models were trained and validated using political parties' general election manifestos.
  • Models were also validated for performance in processing parliamentary speeches (English only)
  • Token classification uses plain text format without special formatting requirements
  • Stance and policy models use Natural Language Inference (NLI) approaches
  • Natural sentences were used as the unit of analysis for the token classifier, stance and policy detection; tokens were used as the unit of analysis for the multi-label meaningful group classifier
  • Training data includes English, German text. The four models were further validated on held-out samples English, German, Dutch, Danish, Spanish, French, Italian and Swedish

Current Implementation

The package provides a streamlined approach:

  • Token extraction: Plain text processing with transformer token classification
  • Stance detection: NLI-based approach for stance detection
  • Policy detection: NLI-based approach for policy identification (binary detector)
  • Group classification: Multi-label classification of social group categories

We thank Josh Allen (joshuafayallen), Dylan Paltra (dpltr22) and Marvin Stecker (vestedinterests) for their support and contributions to the release of this package.

Performance Considerations

  • Batch processing optimizes performance for multiple texts
  • Pipeline integration handles data flow between components
  • Hardware acceleration - GPU is auto-detected (CUDA → MPS → CPU). Override with device="cuda" in Python or --device cuda in the CLI for any command (extract, stance, policy, classify, pipeline)

Best Practices

  1. Use meaningful IDs (party_date_sentence) for traceability through the pipeline
  2. Prepare data properly using the pre-processing functions
  3. Test with small datasets first to validate format and performance

Pipeline Data Flow

The package follows a 6-step workflow:

1. Data Preparation

  • Input: Raw CSV with party (or any other political actor), date, sentence_id, text columns
  • Processing: Create composite IDs for traceability
  • Output: Text with meaningful text_id identifiers

2. Token Classification

  • Input: Text with text_id
  • Processing: Extract social group references using transformer model
  • Output: Entities with positions and confidence scores

3. Data Filtering

  • Processing: Separate texts with/without social group mentions
  • Purpose: Optimize downstream processing and enable complete dataset reconstruction

4. Analysis Steps

  • Stance Detection: Determine stance toward identified groups (positive, negative, neutral)
  • Policy Detection: Identify whether policy content directed at groups is included in the text
  • Meaningful Groups: Classify groups into meaningful categories

5. Post-processing

  • Data cleaning: Process model outputs into clean, usable formats
    • Create Stance_Clean and Policy_Clean columns with simplified labels ('positive'/'negative'/'neutral', 'policy'/'no policy')
    • Split meaningful groups into separate Group1, Group2, etc. columns when requested
  • Output: Both raw model predictions and clean processed labels

6. Final Dataset Assembly

  • Merging: Combine processed texts (with groups) and unprocessed texts (without groups) into complete dataset
  • Validation: Ensure all original texts are preserved with appropriate analysis results

Documentation

For detailed usage examples, API reference, and advanced features, see our complete documentation.

Requirements

  • Python 3.8+
  • PyTorch 2.3.1+ (for NumPy 2.x compatibility and Apple Silicon MPS support)
  • Transformers 4.20.0+
  • Pandas 1.3.0+
  • NumPy 1.21.0+
  • tqdm 0.62.0+
  • openpyxl 3.0.0+

System Requirements

Note: This package requires PyTorch 2.3.1+ for NumPy 2.x compatibility. Intel-based Macs may experience installation issues due to limited PyTorch wheel availability for older architectures. The package is fully supported on Apple Silicon Macs, Linux, and Windows systems.

Citation

If you use GroupAppeals in your research, please cite:

@misc{groupappeals2026,
  title={GroupAppeals: A Python Package for Analyzing Social Group Appeals in Political Texts},
  author={Dolinsky, Alona O. and Horne, Will and Huber, Lena Maria},
  year={2026},
  url={https://github.com/alonadoli/GroupAppeals}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Acknowledgments

This work was supported by funds from the EU's Horizon Europe MSCA Postdoctoral Fellowships under grant agreement no. 101107835.

Package Details:

Input Data Requirements

GroupAppeals processes plain text data and handles ID management for pipeline traceability:

Data Preparation

Step 1: Raw Data Format

Your CSV should contain these columns:

  • party (or political actor)
  • date (or election year/time identifier - converted to string, does not require date formatting)
  • sentence_id (row within text)
  • text (the actual text content)

Step 2: ID Creation

Use the pre-processing functions to create composite IDs:

from groupappeals.pre_and_post_processing import create_composite_id

# Create meaningful composite IDs for traceability
df['text_id'] = create_composite_id(df, 
                                  party_col="party", 
                                  date_col="date", 
                                  sentence_col="sentence_id")

Step 3: Pipeline Processing

The package processes plain text through transformer models without requiring special formatting.

Benefits:

  • ID traceability: Track results back to original party/date/sentence
  • Flexible input: Works with any political actor type (parties, candidates, organizations)
  • Pipeline integration: Seamless data flow between analysis steps
  • Complete reconstruction: Merge processed and unprocessed texts

Installation

pip install groupappeals

Quick Start

Complete Pipeline (Recommended for Most Users)

Use the full pipeline when you want complete analysis. The pipeline handles composite ID creation, group extraction, stance detection, policy detection, and group classification.

from groupappeals.fullpipeline import run_full_pipeline

# Complete analysis starting from raw political text
results = run_full_pipeline(
    input_file="raw_political_text.csv",
    output_file="complete_analysis.csv",
    create_composite_id=["party", "date", "sentence_id"]
)

print(f"Analyzed {len(results)} sentences")
print("Sample results:")
print(results[['text_id', 'Exact.Group.Text', 'Stance', 'Policy']].head())

Step-by-Step Processing (For Custom Control)

Two ways to call each module: Each module provides a CSV function (e.g. process_csv) that reads from a file path, and an equivalent Python list function (e.g. extract_entities) that works directly with data already in memory. Both produce identical output. The CSV functions are shown in the examples below; the equivalent Python list function is noted under each one.

Use individual modules when you need:

  • Intermediate result inspection between steps
  • Selective processing (skip certain analyses)
  • Different parameters for each step
from groupappeals.pre_and_post_processing import create_composite_id
from groupappeals.extractgrouptoken import process_csv
import pandas as pd

# Step 1: Prepare data with meaningful IDs
df = pd.read_csv("raw_political_text.csv")
df['text_id'] = create_composite_id(df, 
                                  party_col="party", 
                                  date_col="election_year", 
                                  sentence_col="sentence_id")

# Select required columns and save
prepared_df = df[['text', 'text_id']]
prepared_df.to_csv("prepared_data.csv", index=False)

# Step 2: Extract group references
token_results = process_csv(
    input_file="prepared_data.csv",
    text_column="text",
    id_column="text_id",
    output_file="extracted_groups.csv"
)

# Step 3: Filter for entities (for downstream processing)
entities_only = token_results[token_results['Exact.Group.Text'].notna()]

print(f"Found {len(entities_only)} social group mentions")
print("Sample extractions:")
print(entities_only[['text_id', 'Exact.Group.Text', 'Average Score']].head())

Standalone Stance Detection

Use this for stance analysis without the full pipeline. The CSV function is process_stance_csv; the equivalent Python list function is detect_stance.

from groupappeals.stancedetection import process_stance_csv
import pandas as pd

# Create sample data with text and group columns
data = {
    'text': [
        "We will support working families with new policies.",
        "Small business owners deserve better treatment.",
        "Students need more affordable education."
    ],
    'group': ["working families", "small business owners", "students"],
    'text_id': ["example_1", "example_2", "example_3"]
}

df = pd.DataFrame(data)
df.to_csv("stance_input.csv", index=False)

# Process CSV file for stance detection
stance_results = process_stance_csv(
    input_file="stance_input.csv",
    text_column="text",
    group_column="group",
    output_file="stance_results.csv",
    clean_labels=True  # Creates Stance_Clean column with simplified labels
)

print(f"Processed {len(stance_results)} text-group pairs")

# Display results
for _, row in stance_results.iterrows():
    print(f"Text: {row['text'][:50]}...")
    print(f"Group: {row['group']}")
    print(f"Raw Stance: {row['Stance']}")
    if 'Stance_Clean' in stance_results.columns:
        print(f"Clean Stance: {row['Stance_Clean']} (confidence: {row['Stance_Confidence']:.3f})")
    print("---")

Standalone Policy Detection

Use this for policy analysis without the full pipeline. The CSV function is process_policy_csv; the equivalent Python list function is detect_policy.

from groupappeals.policydetection import process_policy_csv
import pandas as pd

# Create sample data with text and group columns
data = {
    'text': [
        "We will implement new childcare support for working families.",
        "Small business owners face challenges in our economy.",
        "Students deserve access to quality education."
    ],
    'group': ["working families", "small business owners", "students"],
    'text_id': ["example_1", "example_2", "example_3"]
}

df = pd.DataFrame(data)
df.to_csv("policy_input.csv", index=False)

# Process CSV file for policy detection
policy_results = process_policy_csv(
    input_file="policy_input.csv",
    text_column="text",
    group_column="group",
    output_file="policy_results.csv",
    clean_labels=True  # Creates Policy_Clean column with 'policy'/'no policy' labels
)

print(f"Processed {len(policy_results)} text-group pairs")

# Display results
for _, row in policy_results.iterrows():
    print(f"Text: {row['text'][:50]}...")
    print(f"Group: {row['group']}")
    print(f"Raw Policy: {row['Policy']}")
    if 'Policy_Clean' in policy_results.columns:
        print(f"Clean Policy: {row['Policy_Clean']} (confidence: {row['Policy_Confidence']:.3f})")
    print("---")

print("\nPolicy distribution:")
if 'Policy_Clean' in policy_results.columns:
    print(policy_results['Policy_Clean'].value_counts())
else:
    print(policy_results['Policy'].value_counts())

Standalone Meaningful Groups Classification

Use this for categorizing social group references without the full pipeline. The CSV function is process_groups_csv; the equivalent Python list function is classify_groups.

from groupappeals.classifymeaningfulgroups import process_groups_csv
import pandas as pd

# Create sample data with group references
data = {
    'group_text': [
        "working families",
        "small business owners", 
        "students",
        "elderly citizens",
        "immigrants"
    ],
    'text_id': ["group_1", "group_2", "group_3", "group_4", "group_5"]
}

df = pd.DataFrame(data)
df.to_csv("groups_input.csv", index=False)

# Process CSV file for group classification
classification_results = process_groups_csv(
    input_file="groups_input.csv",
    group_column="group_text",
    output_file="classified_groups.csv",
    score_threshold=0.5,
    split_groups=True  # Create separate Group1, Group2, etc. columns
)

print(f"Processed {len(classification_results)} group references")

# Display results
print("Sample classifications:")
for _, row in classification_results.iterrows():
    print(f"Group: {row['group_text']}")
    print(f"Categories: {row['Meaningful Group']}")
    
    # Show individual Group columns if split_groups=True
    group_cols = [col for col in row.index if col.startswith('Group') and pd.notna(row[col])]
    if group_cols:
        group_values = [f"{col}: '{row[col]}'" for col in group_cols]
        print(f"Split Categories: {', '.join(group_values)}")
    print("---")

print("\nOverall statistics:")
print(classification_results[['group_text', 'Meaningful Group']].head())

Models and Performance

The package uses four specialized models for comprehensive social group analysis:

Model Task Languages Model Architecture
Token Classifier Group extraction EN, DE, ES, NL, DA, FR, IT, SV Transformer token classification
Stance NLI Positive/negative/neutral stance EN, DE, ES, NL, DA, FR, IT, SV Natural Language Inference
Policy NLI Policy detection EN, DE, ES, NL, DA, FR, IT, SV Natural Language Inference
Group Classifier Semantic categorization EN, DE, ES, NL, DA, FR, IT, SV Multi-label classification

Input Data Format

Raw Data Format (Recommended Starting Point)

Your CSV file should contain these columns for optimal traceability:

party,date,sentence_id,text
PartyA,2023,1,"We will support working families with new childcare policies."
PartyA,2023,2,"Small businesses are the backbone of our economy."
PartyB,2023,1,"Students deserve access to affordable education."

Prepared Data Format (After Processing)

After using create_composite_id(), your data will have meaningful unit IDs:

text,text_id
"We will support working families with new childcare policies.","PartyA_2023_1"
"Small businesses are the backbone of our economy.","PartyA_2023_2"
"Students deserve access to affordable education.","PartyB_2023_1"

Token Classification Output Format

The extract_entities() function produces this structure:

text_id,text,Entity,Average Score,Start,End,Exact.Group.Text
"PartyA_2023_1.1","We will support working families...","working families",0.95,16,31,"working families"
"PartyA_2023_2.0","Small businesses are the backbone...","",,,""
"PartyB_2023_1.1","Students deserve access...","Students",0.87,0,8,"Students"

Note: The .0, .1, .2 numbering indicates:

  • .0 = No entities found
  • .1 = First entity found
  • .2 = Second entity found (if multiple entities in same text)

Example Output

Complete Pipeline Output Format

The full pipeline produces comprehensive results with both raw model outputs and clean processed labels:

text_id,text,Exact.Group.Text,Average Score,Stance,Stance_Confidence,Stance_Clean,Policy,Policy_Confidence,Policy_Clean,Meaningful Group,Group1,Group2
"PartyA_2023_1.1","We will support working families with new policies.","working families",0.95,"The text is positive towards working families.",0.95,"positive","The text contains a policy directed towards working families.",0.89,"policy","['Families', 'Workers']","Families","Workers"
"PartyA_2023_2.1","Small businesses are the backbone of our economy.","small businesses",0.89,"The text is positive towards small businesses.",0.89,"positive","The text does not contain a policy directed towards small businesses.",0.67,"no policy","['Economic Groups']","Economic Groups",""
"PartyB_2023_1.0","This is a general statement about the economy.","","","","","","","","","",""

Key Output Features:

  • Raw Model Outputs: Complete verbose predictions from each model
  • Clean Labels: Simplified labels (positive/negative/neutral, policy/no policy)
  • Group Categories: Both list format (Meaningful Group) and split columns (Group1, Group2, etc.)
  • Confidence Scores: Model confidence for stance and policy predictions
  • Token Positions: Character positions of extracted groups (Start, End columns)
  • Complete Coverage: All input texts preserved, even those without group references

Command Line Interface

Complete Pipeline

# Complete pipeline WITH composite ID creation (recommended)
groupappeals pipeline --input manifestos.csv --output results.csv \
  --create-composite-id party,year,sentence_id --clean-labels --split-groups

# Complete pipeline with EXISTING text_id column
groupappeals pipeline --input texts.csv --output results.csv \
  --clean-labels --split-groups

Individual Module Usage

# Individual modules (for step-by-step processing)
groupappeals extract --input texts.csv --output groups.csv
groupappeals stance --input groups.csv --output stance.csv --clean-labels
groupappeals policy --input stance.csv --output policy.csv --clean-labels
groupappeals classify --input policy.csv --output final.csv --split-groups

Advanced Options

# Specify custom columns
groupappeals extract --input data.csv --output groups.csv \
  --text-column my_text --id-column my_id

# Custom batch size
groupappeals stance --input groups.csv --output stance.csv --batch-size 16

# Device selection (optional - auto-detected by default)
groupappeals pipeline --input data.csv --output results.csv --device cuda
groupappeals pipeline --input data.csv --output results.csv --device mps  # Apple Silicon
groupappeals pipeline --input data.csv --output results.csv --device cpu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

groupappeals-1.0.0.tar.gz (52.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

groupappeals-1.0.0-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file groupappeals-1.0.0.tar.gz.

File metadata

  • Download URL: groupappeals-1.0.0.tar.gz
  • Upload date:
  • Size: 52.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for groupappeals-1.0.0.tar.gz
Algorithm Hash digest
SHA256 fa588605d50e8b13db1b08473e1e932a90f8bbc6a6181f6d3c3a588d26f8263f
MD5 20779ef6d15830779bfc48a67cc634f0
BLAKE2b-256 4a2b1304084eda5cd3394b525578c57058a7431b7d3d4d2c6732d568a8572947

See more details on using hashes here.

File details

Details for the file groupappeals-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: groupappeals-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 34.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for groupappeals-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc23640c3ba36168d15883d9437eb99632d8479015002000fcba6fc5d8687865
MD5 995ccd44024d77be4e893e57c34851ea
BLAKE2b-256 d4c971a7ef44bd0ea3bb77427d7035cc6e99d0d11a1ae848003d06b00b69fba0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page