SetFit-based multi-label classifier for water-related conflict events

These details have not been verified by PyPI

Project links

Project description

Water Conflict Classifier

SetFit-based multi-label text classifier for identifying water-related conflict events in news headlines.

This folder contains the package source code. For usage instructions with the published package, see the PyPI page.

Project: Experimental research supporting the Pacific Institute's Water Conflict Chronology
Developer: Baobab Tech
License: CC BY-NC 4.0 (Non-Commercial)
PyPI Package: water-conflict-classifier

Package Installation & Usage

Install from PyPI:

pip install water-conflict-classifier

Use the trained model:

from setfit import SetFitModel

model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
predictions = model.predict(["Military group attack workers at dam"])
# Returns: [[1, 1, 1]]  # [Trigger, Casualty, Weapon]

The rest of this README is for developers who want to train their own model or modify the package.

Frugal AI: Training with Limited Data

This classifier demonstrates an intentional approach to building AI systems with limited data using SetFit - a framework for few-shot learning with sentence transformers. Rather than defaulting to massive language models (GPT, Claude, or 100B+ parameter models) for simple classification tasks, we fine-tune small, efficient models (e.g., BAAI/bge-small-en-v1.5 with ~33M parameters) on a focused dataset.

Why this matters: The industry has normalized using trillion-parameter models to classify headlines, answer simple questions, or categorize text - tasks that don't require world knowledge, reasoning, or generative capabilities. This is computationally wasteful and environmentally costly. A properly fine-tuned small model can achieve comparable or better accuracy while using a fraction of the compute resources.

Our approach:

Train on ~600 examples (few-shot learning with SetFit)
Deploy small parameter models (e.g., ~33M params) vs. 100B-1T parameter alternatives
Achieve specialized task performance without the overhead of general-purpose LLMs
Reduce inference costs and latency by orders of magnitude

This is not about avoiding large models altogether - they're invaluable for complex reasoning tasks. But for targeted classification problems with labeled data, fine-tuning remains the professional, responsible choice.

Package Structure

This is the source code for the water-conflict-classifier Python package, published to PyPI.

classifier/
├── __init__.py                         # Package marker
├── data_prep.py                        # Data loading (for training-ready datasets)
├── training_logic.py                   # Core training logic
├── evaluation.py                       # Model evaluation & metrics
├── model_card.py                       # Model card generation
├── versioning.py                       # Experiment tracking & versioning
├── evals_upload.py                     # Upload evaluation results to HF
├── train_setfit_headline_classifier.py # Local training script
├── pyproject.toml                      # Package configuration
├── setup.py                            # Build configuration
└── README.md                           # This file

Note: Scripts that use this package (like cloud training with HF Jobs and dataset preparation) are in the ../scripts/ folder.

Local Training

Train on your own hardware with local data files.

Setup

Install the package in development mode:

uv pip install -e .
# or with regular pip:
pip install -e .

This installs the modules (data_prep, training_logic, etc.) so they can be imported.

Prepare training data:

Training data should be in ../data/:

../data/positives.csv - Water conflict headlines with labels
../data/negatives.csv - Non-water conflict headlines

Generate negatives from ACLED (if needed):

cd ../scripts
python transform_prep_negatives.py

Train:

python train_setfit_headline_classifier.py

Model saved to ./water-conflict-classifier/

Cloud Training

For training on HuggingFace Jobs (managed GPUs):

Prepare training dataset: Use ../scripts/prepare_training_dataset.py to preprocess, balance, and upload training-ready data to HF Hub
Train model: Use ../scripts/train_on_hf.py to train on HF Jobs infrastructure

The training script loads preprocessed data directly from HF Hub - no data preprocessing happens during training. This separation makes training reproducible and efficient.

See the scripts README for complete workflow.

Publishing to PyPI

See PUBLISHING.md for complete instructions on building and publishing the package.

Data Sources

The training data combines:

Positive Examples: Water conflict headlines from Pacific Institute Water Conflict Chronology
Negative Examples: Two types for balanced training:
1. Hard Negatives (~120): Water-related peaceful news (infrastructure, research, conservation) to prevent false positives
2. ACLED Negatives (~600): Non-water conflict events from ACLED

Hard Negatives Strategy

Without hard negatives, the model learns "water mentioned → conflict" instead of "water + violence → conflict". Hard negatives are water-related headlines that lack violence or conflict:

Water infrastructure projects (dams, treatment plants)
Scientific water research and technology
Water conservation initiatives and conferences
Environmental water management

These are tagged with priority_sample=True in the dataset and are ALWAYS included in training (never diluted by sampling). This ensures the model correctly distinguishes peaceful water news from actual water conflicts.

Resources

HF Jobs Guide
UV Script Format (used in train_on_hf.py)
SetFit Documentation
Pacific Institute Water Conflict Chronology
ACLED Data

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

You are free to use, share, and adapt this work for non-commercial purposes with appropriate attribution to Baobab Tech. For commercial licensing inquiries, please contact Baobab Tech.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.22

Nov 30, 2025

0.1.21

Nov 30, 2025

This version

0.1.20

Nov 30, 2025

0.1.19

Nov 30, 2025

0.1.18

Nov 30, 2025

0.1.17

Nov 30, 2025

0.1.16

Nov 30, 2025

0.1.15

Nov 29, 2025

0.1.14

Nov 27, 2025

0.1.13

Nov 27, 2025

0.1.12

Nov 27, 2025

0.1.11

Nov 27, 2025

0.1.10

Nov 27, 2025

0.1.9

Nov 27, 2025

0.1.8

Nov 27, 2025

0.1.7

Nov 27, 2025

0.1.6

Nov 27, 2025

0.1.5

Nov 27, 2025

0.1.4

Nov 26, 2025

0.1.3

Nov 26, 2025

0.1.2

Nov 26, 2025

0.1.0

Nov 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

water_conflict_classifier-0.1.20.tar.gz (22.6 kB view details)

Uploaded Nov 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

water_conflict_classifier-0.1.20-py3-none-any.whl (21.7 kB view details)

Uploaded Nov 30, 2025 Python 3

File details

Details for the file water_conflict_classifier-0.1.20.tar.gz.

File metadata

Download URL: water_conflict_classifier-0.1.20.tar.gz
Upload date: Nov 30, 2025
Size: 22.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for water_conflict_classifier-0.1.20.tar.gz
Algorithm	Hash digest
SHA256	`10d5ac6f38e9b4c0b88daf6f352c4c5eef8925afca6757be25a4f2c12d238d9a`
MD5	`c044533ce1175a15f28169d09d4fbb3e`
BLAKE2b-256	`801ad5e821f3de98037079e83e83e499336e03f5d5496f25e9303ff63476b403`

See more details on using hashes here.

File details

Details for the file water_conflict_classifier-0.1.20-py3-none-any.whl.

File metadata

Download URL: water_conflict_classifier-0.1.20-py3-none-any.whl
Upload date: Nov 30, 2025
Size: 21.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for water_conflict_classifier-0.1.20-py3-none-any.whl
Algorithm	Hash digest
SHA256	`47c98c21eb79b4488190dac581af4ee49888c74aefcb6e6a3e971431bca7b139`
MD5	`017fef9ac3f9674c5e076b5b30c7a549`
BLAKE2b-256	`c66adcebabc88408f8bcbff3dd995a71d74d36c94f5cf39550cc243d975fedcb`

See more details on using hashes here.

water-conflict-classifier 0.1.20

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Water Conflict Classifier

Package Installation & Usage

Frugal AI: Training with Limited Data

Package Structure

Local Training

Setup

Cloud Training

Publishing to PyPI

Data Sources

Hard Negatives Strategy

Resources

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes