SetFit-based multi-label classifier for water-related conflict events

These details have not been verified by PyPI

Project links

Project description

Water Conflict Classifier

SetFit-based multi-label text classifier for identifying water-related conflict events in news headlines.

Project: Experimental research supporting the Pacific Institute's Water Conflict Chronology
Developer: Baobab Tech
License: CC BY-NC 4.0 (Non-Commercial)

Frugal AI: Training with Limited Data

This classifier demonstrates an intentional approach to building AI systems with limited data using SetFit - a framework for few-shot learning with sentence transformers. Rather than defaulting to massive language models (GPT, Claude, or 100B+ parameter models) for simple classification tasks, we fine-tune a small, efficient model (~33M parameters) on a focused dataset.

Why this matters: The industry has normalized using trillion-parameter models to classify headlines, answer simple questions, or categorize text - tasks that don't require world knowledge, reasoning, or generative capabilities. This is computationally wasteful and environmentally costly. A properly fine-tuned small model can achieve comparable or better accuracy while using a fraction of the compute resources.

Our approach:

Train on ~600 examples (few-shot learning with SetFit)
Deploy a 33M parameter model vs. 100B-1T parameter alternatives
Achieve specialized task performance without the overhead of general-purpose LLMs
Reduce inference costs and latency by orders of magnitude

This is not about avoiding large models altogether - they're invaluable for complex reasoning tasks. But for targeted classification problems with labeled data, fine-tuning remains the professional, responsible choice.

Project Structure

Simple, flat structure with shared modules:

classifier/
├── __init__.py                         # Package marker
├── data_prep.py                        # Data loading & preprocessing (shared)
├── training_logic.py                   # Core training logic (shared)
├── evaluation.py                       # Model evaluation & metrics (shared)
├── model_card.py                       # Model card generation (shared)
├── train_setfit_headline_classifier.py # Local training (uses shared modules)
├── train_on_hf.py                      # HF Jobs training (self-contained with UV)
├── upload_datasets.py                  # Upload data to HF Hub
├── transform_prep_negatives.py         # Generate negative examples from ACLED
├── classify_headline.py                # Local inference example
├── classify_headline_hub.py            # HF Hub inference example
└── README.md                           # This file

Package Structure

The classifier is a proper Python package that can be installed via pip/uv.

Local Training: train_setfit_headline_classifier.py imports from the installed package.

HF Jobs Training: train_on_hf.py uses UV with the package as a dependency - clean, no duplication!

Training Options

Option 1: Local Training

Train on your own hardware with local data files:

cd classifier
python train_setfit_headline_classifier.py

Pros: Full control, works offline, no HF account needed
Cons: Requires local GPU (or slow on CPU), manual model management

Option 2: HF Jobs (Cloud Training)

Train on managed GPUs with automatic model upload to HF Hub:

hf jobs uv run \
  --flavor a10g-large \
  --timeout 2h \
  --env HF_ORGANIZATION=your-org \
  --namespace your-org \
  --secrets HF_TOKEN \
  classifier/train_on_hf.py

Pros: Fast GPU training (~2-5 min), auto model upload, reproducible
Cons: Requires HF account, data must be on HF Hub

Note: Package must be published to PyPI or use Git URL. See PUBLISHING.md for complete publishing instructions with UV.

Learn more:

Hugging Face Jobs Documentation
Publishing Guide - How to publish with UV

Setup

Important: This is a package within a mono repo. All commands assume you're in the /classifier directory.

For Local Training

Navigate to package directory:

cd classifier  # Must be in this directory!

Install the package in development mode:

uv pip install -e .
# or with regular pip:
pip install -e .

This installs the modules (data_prep, training_logic, etc.) so they can be imported.

Prepare training data:

Training data should be in ../data/ (one level up from classifier folder):

../data/positives.csv - Water conflict headlines with labels
../data/negatives.csv - Non-water conflict headlines

Generate negatives from ACLED (if needed):

# This script is in the parent scripts folder
cd ../scripts
python transform_prep_negatives.py

Train:

cd classifier  # Make sure you're in classifier/
python train_setfit_headline_classifier.py

Model saved to ./water-conflict-classifier/

For HF Jobs (Cloud Training)

Prerequisites

# Install HF CLI
pip install huggingface-hub[cli]

# Authenticate
hf auth login

Get your token from: huggingface.co/settings/tokens

Step 1: Configure HuggingFace Repos

Copy the sample config to create your own:

cd /path/to/waterconflict
cp config.sample.py config.py

Edit config.py and set your organization or username:

HF_ORGANIZATION = "my-org-name"  # or "my-username"

The config.py file is gitignored so your credentials stay local.

Step 2: Upload Training Data

# Upload script is in parent scripts folder
cd ../scripts
python upload_datasets.py

This creates a dataset repository at YOUR_ORG/water-conflict-training-data (or YOUR_USERNAME/... if using personal account).

Step 3: Publish Package (First Time Only)

Before HF Jobs can use it, publish the package:

cd classifier

# Build and publish to PyPI
uv build
uv publish

# Or use Git URL (see PUBLISHING.md for details)

Step 4: Run Training Job

# From mono repo root
hf jobs uv run \
  --flavor a10g-large \
  --timeout 2h \
  --env HF_ORGANIZATION=baobabtech \
  --secrets HF_TOKEN \
  --namespace baobabtech \
  classifier/train_on_hf.py

Replace baobabtech with your organization name from config.py.

Important: Package must be published to PyPI or available via Git URL. See PUBLISHING.md for details.

Configuration Options:

--secrets HF_TOKEN: Authentication (required for private repos/pushing models)
--env HF_ORGANIZATION: Your HF org/username (required - not in git due to .gitignore)
--namespace: Runs job under org account for billing/tracking (optional)
--timeout: Max runtime before auto-termination

Hardware options: See available flavors - recommend a10g-large for this task.

Dependencies: UV automatically handles all dependencies from inline script declarations.

Monitoring

# List jobs
hf jobs ps -a --namespace baobabtech

# Stream logs
hf jobs logs <job_id> --namespace baobabtech

# Cancel job
hf jobs cancel <job_id> --namespace baobabtech

Training Pipeline

The script follows the same pipeline as the local version but with HF Hub integration:

Authenticate with HF Hub (via HF_TOKEN)
Load data from dataset repo (downloads CSVs)
Preprocess into multi-label format (balances negatives to match positives count)
Split data (85% train pool / 15% held-out test set)
Sample training data (600 examples from train pool for efficient few-shot learning)
Train SetFit model (1 epoch, undersampling strategy)
Evaluate on held-out test set (F1, accuracy, per-label metrics)
Push to Hub (model + comprehensive model card with evaluation tables)

Expected runtime: ~2-5 minutes on A10G GPU

After Training

Your model will be at: https://huggingface.co/YOUR_ORG/water-conflict-classifier (or YOUR_USERNAME/... if using personal account)

Use it with the inference script:

python classify_headline.py

Or directly in Python:

from setfit import SetFitModel

model = SetFitModel.from_pretrained("YOUR_ORG/water-conflict-classifier")
predictions = model.predict(["Taliban attack dam workers in Afghanistan"])
# Output: [[1, 1, 1]]  # [Trigger, Casualty, Weapon]

Troubleshooting

"Not authenticated" → Run hf auth login

"Dataset not found" → Verify DATASET_REPO matches uploaded dataset name

Out of memory → Reduce BATCH_SIZE in script or use smaller GPU flavor

Job timeout → Increase --timeout value

Local Testing of HF Jobs Script

Test the HF Jobs script locally before submitting:

cd classifier
uv pip install -e .  # Install package locally first
uv run train_on_hf.py

Note: Still requires dataset on HF Hub and proper authentication.

Configuration Options

Private Repositories

Set private=True in the upload and push methods (check upload_datasets.py and train_on_hf.py)

Different Base Model

Edit the BASE_MODEL constant in either training script:

BASE_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # Smaller/faster
# or
BASE_MODEL = "BAAI/bge-base-en-v1.5"  # Larger/better quality

Additional Secrets

hf jobs uv run \
  --secrets HF_TOKEN \
  --secrets WANDB_API_KEY \
  --env HF_ORGANIZATION=baobabtech \
  --env WANDB_PROJECT=water-conflict \
  classifier/train_on_hf.py

Data Sources

The training data combines:

Positive Examples: Water conflict headlines from Pacific Institute Water Conflict Chronology
Negative Examples: Non-water conflict events from ACLED

Both positive and negative examples are labeled for three categories: Trigger, Casualty, and Weapon.

Resources

HF Jobs Guide
UV Script Format (used in train_on_hf.py)
SetFit Documentation
Pacific Institute Water Conflict Chronology
ACLED Data

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

You are free to use, share, and adapt this work for non-commercial purposes with appropriate attribution to Baobab Tech. For commercial licensing inquiries, please contact Baobab Tech.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.22

Nov 30, 2025

0.1.21

Nov 30, 2025

0.1.20

Nov 30, 2025

0.1.19

Nov 30, 2025

0.1.18

Nov 30, 2025

0.1.17

Nov 30, 2025

0.1.16

Nov 30, 2025

0.1.15

Nov 29, 2025

0.1.14

Nov 27, 2025

0.1.13

Nov 27, 2025

0.1.12

Nov 27, 2025

0.1.11

Nov 27, 2025

0.1.10

Nov 27, 2025

0.1.9

Nov 27, 2025

0.1.8

Nov 27, 2025

0.1.7

Nov 27, 2025

0.1.6

Nov 27, 2025

0.1.5

Nov 27, 2025

0.1.4

Nov 26, 2025

0.1.3

Nov 26, 2025

0.1.2

Nov 26, 2025

This version

0.1.0

Nov 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

water_conflict_classifier-0.1.0.tar.gz (15.3 kB view details)

Uploaded Nov 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

water_conflict_classifier-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Nov 26, 2025 Python 3

File details

Details for the file water_conflict_classifier-0.1.0.tar.gz.

File metadata

Download URL: water_conflict_classifier-0.1.0.tar.gz
Upload date: Nov 26, 2025
Size: 15.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for water_conflict_classifier-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8745e6447282c61eb60fa080a47b0273187fa5b25ebf15a778e6f613cc60f42e`
MD5	`44e4e804c42be15d090e32d586bf654b`
BLAKE2b-256	`c326d1f7d5cb6bb6960dd24cfa8619c64ee0212b94e677a6661e5c039365b235`

See more details on using hashes here.

File details

Details for the file water_conflict_classifier-0.1.0-py3-none-any.whl.

File metadata

Download URL: water_conflict_classifier-0.1.0-py3-none-any.whl
Upload date: Nov 26, 2025
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for water_conflict_classifier-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7a8b459f3889ca18b5e0a567079d27026fab4ae49792f7671e36f177f45e7388`
MD5	`f11283a755a8d93ff8717b23dea4201a`
BLAKE2b-256	`d7b08141a184ea31e9ee7ae236eb00cafc493c5a22a689844686278e8b19bd0e`

See more details on using hashes here.

water-conflict-classifier 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Water Conflict Classifier

Frugal AI: Training with Limited Data

Project Structure

Package Structure

Training Options

Option 1: Local Training

Option 2: HF Jobs (Cloud Training)

Setup

For Local Training

For HF Jobs (Cloud Training)

Prerequisites

Step 1: Configure HuggingFace Repos

Step 2: Upload Training Data

Step 3: Publish Package (First Time Only)

Step 4: Run Training Job

Monitoring

Training Pipeline

After Training

Troubleshooting

Local Testing of HF Jobs Script

Configuration Options

Private Repositories

Different Base Model

Additional Secrets

Data Sources

Resources

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes