Skip to main content

SetFit-based multi-label classifier for water-related conflict events

Project description

Water Conflict Classifier

SetFit-based multi-label text classifier for identifying water-related conflict events in news headlines.

This folder contains the package source code. For usage instructions with the published package, see the PyPI page.

Project: Experimental research supporting the Pacific Institute's Water Conflict Chronology
Developer: Baobab Tech
License: CC BY-NC 4.0 (Non-Commercial)
PyPI Package: water-conflict-classifier


Package Installation & Usage

Install from PyPI:

pip install water-conflict-classifier

Use the trained model:

from setfit import SetFitModel

model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
predictions = model.predict(["Taliban attack workers at dam"])
# Returns: [[1, 1, 1]]  # [Trigger, Casualty, Weapon]

The rest of this README is for developers who want to train their own model or modify the package.

Frugal AI: Training with Limited Data

This classifier demonstrates an intentional approach to building AI systems with limited data using SetFit - a framework for few-shot learning with sentence transformers. Rather than defaulting to massive language models (GPT, Claude, or 100B+ parameter models) for simple classification tasks, we fine-tune a small, efficient model (~33M parameters) on a focused dataset.

Why this matters: The industry has normalized using trillion-parameter models to classify headlines, answer simple questions, or categorize text - tasks that don't require world knowledge, reasoning, or generative capabilities. This is computationally wasteful and environmentally costly. A properly fine-tuned small model can achieve comparable or better accuracy while using a fraction of the compute resources.

Our approach:

  • Train on ~600 examples (few-shot learning with SetFit)
  • Deploy a 33M parameter model vs. 100B-1T parameter alternatives
  • Achieve specialized task performance without the overhead of general-purpose LLMs
  • Reduce inference costs and latency by orders of magnitude

This is not about avoiding large models altogether - they're invaluable for complex reasoning tasks. But for targeted classification problems with labeled data, fine-tuning remains the professional, responsible choice.

Package Structure

This is the source code for the water-conflict-classifier Python package, published to PyPI.

classifier/
├── __init__.py                         # Package marker
├── data_prep.py                        # Data loading & preprocessing
├── training_logic.py                   # Core training logic
├── evaluation.py                       # Model evaluation & metrics
├── model_card.py                       # Model card generation
├── train_setfit_headline_classifier.py # Local training script
├── pyproject.toml                      # Package configuration
├── setup.py                            # Build configuration
└── README.md                           # This file

Note: Scripts that use this package (like cloud training with HF Jobs) are in the ../scripts/ folder.


Local Training

Train on your own hardware with local data files.

Setup

  1. Install the package in development mode:
uv pip install -e .
# or with regular pip:
pip install -e .

This installs the modules (data_prep, training_logic, etc.) so they can be imported.

  1. Prepare training data:

Training data should be in ../data/:

  • ../data/positives.csv - Water conflict headlines with labels
  • ../data/negatives.csv - Non-water conflict headlines

Generate negatives from ACLED (if needed):

cd ../scripts
python transform_prep_negatives.py
  1. Train:
python train_setfit_headline_classifier.py

Model saved to ./water-conflict-classifier/


Cloud Training

For training on HuggingFace Jobs (managed GPUs), see ../scripts/train_on_hf.py and the scripts README.


Publishing to PyPI

See PUBLISHING.md for complete instructions on building and publishing the package.


Data Sources

The training data combines:

Both positive and negative examples are labeled for three categories: Trigger, Casualty, and Weapon.

Resources


License

Copyright © 2025 Baobab Tech

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

You are free to use, share, and adapt this work for non-commercial purposes with appropriate attribution to Baobab Tech. For commercial licensing inquiries, please contact Baobab Tech.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

water_conflict_classifier-0.1.7.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

water_conflict_classifier-0.1.7-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file water_conflict_classifier-0.1.7.tar.gz.

File metadata

File hashes

Hashes for water_conflict_classifier-0.1.7.tar.gz
Algorithm Hash digest
SHA256 df2640b98fd0a581f7a49cb62a18de5d31173ffe4941d12c785d270daccb1436
MD5 c5cfa46998f705732d0e0a1a811aee1c
BLAKE2b-256 0f8460169666cb3b7a566b06fce69cb09b3c7c2b355f0466ea36479f9798a8e3

See more details on using hashes here.

File details

Details for the file water_conflict_classifier-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for water_conflict_classifier-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 67d5e3dfae9361b39c8ee086de969bcfa5e4f1deaacc30eb17f282730dff2b00
MD5 0084ba0ccb7d9e403b980fc8600f1342
BLAKE2b-256 dfa2c7b36cb1e3160efe3026adc8917e0128a1936597a9418b2dbf851a54f495

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page