Local Ollama experiment pipeline for source-selection behavior analysis
Project description
Sourcerers: Source-Selection Robustness in LLMs
A reproducible NLP experimentation framework for analyzing how LLMs choose among politically diverse news sources under controlled prompt conditions.
This repository supports:
- Offline incident preparation from real news JSON files.
- Condition-controlled prompt construction for source-selection experiments.
- Multi-model evaluation via local Ollama models.
- Analytics through Streamlit and FastAPI.
- Report-ready artifacts (plots + summary tables) generated from saved runs.
1) Why This Project Exists
Modern LLMs can appear neutral while still exhibiting selection bias, source-identity overreliance, or prompt-format sensitivity. This project tests those risks directly by asking models to choose one article from left/center/right candidates across multiple controlled conditions.
Core research intent:
- Measure robustness of model choices when source labels are manipulated.
- Compare inter-model behavior under identical candidate sets.
- Track reliability signals such as parse stability and latency.
2) Project Scope And Pipeline
Input Data
- Real incidents are built from JSON articles in data/jsons.
- The preparation step groups topic-level incidents that include left, center, and right coverage.
Experimental Conditions
- headlines_only
- headlines_with_sources
- sources_only
- headlines_with_manipulated_sources
These conditions isolate content effects vs source-identity effects.
Output Artifacts Per Run
Each run folder in outputs/run_YYYYMMDD_HHMMSS includes:
- experiment_requests.jsonl
- model_decisions.jsonl
- raw_outputs.jsonl
These files are enough to fully reproduce downstream analytics and plots.
3) Repository Structure (Key Files)
- dashboard.py: Streamlit interface for analytics and experiment execution.
- app/cli/prepare_real_incidents.py: Converts raw article JSON data into experiment-ready incidents.
- app/cli/run_experiments.py: Runs condition/model combinations and writes run artifacts.
- app/cli/generate_report_assets.py: Builds report plots and summary tables from outputs.
- app/cli/generate_llm_dashboard_summary.py: Generates a saved LLM executive summary JSON for the dashboard (offline, via Ollama).
- app/api/engine_analytics.py: Ingestion + metrics engine used by API and dashboard.
- configs/models.example.yaml: Manifest of Ollama models and decoding params.
- docs/figures/: Generated report assets (plots and summary metrics).
4) Quickstart (Reproducible)
Prerequisites
- Python 3.10
- Ollama installed locally
- uv package manager
Setup
uv venv --python 3.10
uv sync
Pull Models (example)
ollama pull qwen2.5:7b
ollama pull qwen3:8b
ollama pull gemma3:4b
Start Ollama
ollama serve
Run Dashboard
uv run streamlit run dashboard.py
Run Tests
uv run pytest -q
5) End-To-End CLI Workflow
A. Prepare incidents from raw JSON
uv run python -m app.cli.prepare_real_incidents \
--json-dir data/jsons \
--output data/real_incidents_all.jsonl \
--min-per-leaning 3 \
--max-articles-per-leaning 8
B. Run experiments
uv run python -m app.cli.run_experiments \
--input data/real_incidents_all.jsonl \
--models-manifest configs/models.example.yaml \
--output-dir outputs \
--conditions headlines_only headlines_with_sources sources_only headlines_with_manipulated_sources \
--max-combinations 3 \
--seed 42
Optional runtime optimization flags (disabled by default for reproducibility with existing outputs):
uv run python -m app.cli.run_experiments \
--input data/real_incidents_all.jsonl \
--models-manifest configs/models.example.yaml \
--output-dir outputs \
--enable-flash-attention \
--enable-kv-cache \
--kv-cache-type q8_0
Notes:
- Default behavior is unchanged unless these flags are explicitly provided.
- Runtime options are recorded in each request row under
runtime_optionsfor traceability.
C. Generate report assets from saved outputs
uv run python -m app.cli.generate_report_assets \
--outputs-dir outputs \
--assets-dir docs/figures
D. Generate offline LLM executive summary for dashboard
uv run python -m app.cli.generate_llm_dashboard_summary \
--outputs-dir outputs \
--model gemma4:latest \
--summary-json outputs/llm_dashboard_summary.json
This writes a reusable summary file that the dashboard shows in the top "✨ LLM Summary" section.
E. Build technical documentation site
uv sync --group docs
uv run mkdocs serve
Static build check:
uv run mkdocs build --strict
The documentation site includes architecture, usage guides, and auto-generated API reference from source modules via mkdocstrings.
GitHub Pages publishing is automated by .github/workflows/docs.yml.
F. Build and publish as a pip package
Distribution name:
sourcerers
Local package build + validation:
uv sync
uv run python -m build
uv run twine check dist/*
Install locally from built artifacts:
pip install dist/*.whl
Import examples:
from app import (
OllamaClient,
parse_model_response,
build_condition_bundles,
build_selection_prompt,
)
Automated PyPI publishing workflow:
.github/workflows/publish-pypi.yml
To enable publishing in your repo:
- Create GitHub environment
pypi. - In PyPI project settings, configure Trusted Publisher for this repository/workflow.
- Publish a GitHub Release (or run the workflow manually).
CI now validates package build health on every push/PR via .github/workflows/ci.yml.
6) Evaluation Protocol
Main Evaluation Signals
- Parse reliability: success/fallback/failure rates from structured response parsing. (Strict JSON)
- Latency: mean and p95 latency per model.
- Selection distribution: left/center/right choice ratios.
- Robustness proxy: sensitivity to manipulated source labels.
- Position effect signal: selected candidate index distribution.
- Counterfactual label sensitivity: change rate between real-source vs swapped-source conditions.
- Cross-model agreement and instability: entropy-based disagreement across models on the same incident.
Baseline Included
The report assets include a candidate-mix random baseline for center selection:
- Baseline center rate = mean proportion of center candidates offered to the model.
- Model center selection rates are compared against this baseline.
This baseline is simple but useful to detect models selecting center above or below chance given candidate availability.
7) Current Empirical Snapshot (From outputs/)
Generated from existing run artifacts in this repository using app/cli/generate_report_assets.py.
Dataset coverage in current snapshot
- Decisions: 9312
- Runs: 6
- Models: 7
- Conditions: 4
Aggregate parser health
- Parse success: 84.91%
- Parse failure: 14.74%
Model summary table
| model | n | parse_success_rate | parse_fallback_rate | parse_failure_rate | avg_latency_ms | p95_latency_ms | center_selection_rate |
|---|---|---|---|---|---|---|---|
| qwen2.5:7b | 1344 | 99.93% | 0.00% | 0.07% | 11668 | 14992 | 38.35% |
| qwen3:8b | 1344 | 99.93% | 0.00% | 0.07% | 9494 | 11139 | 30.68% |
| mistral:latest | 1248 | 99.60% | 0.00% | 0.40% | 3036 | 3576 | 49.32% |
| gemma3:4b | 1344 | 99.33% | 0.00% | 0.67% | 5964 | 6831 | 40.97% |
| phi4-mini:3.8b | 1344 | 96.58% | 2.38% | 1.04% | 2189 | 2717 | 37.22% |
| llama3.2:3b | 1344 | 92.93% | 0.00% | 7.07% | 1572 | 1993 | 41.15% |
| gemma4:latest | 1344 | 7.14% | 0.00% | 92.86% | 2906 | 3490 | 50.00% |
8) Generated Figures
Parse reliability by model
Latency by model (average and p95)
Selection mix by condition
Center delta heatmap (model x condition)
Reliability-speed Pareto (bubble = instability)
Parse reliability calibration
Center selection vs baseline
Condition to selected-leaning Sankey (interactive)
Sankey flow snapshot (counts extracted from the generated interactive figure):
| condition | to_left | to_center | to_right | total |
|---|---|---|---|---|
| headlines_only | 833 | 779 | 636 | 2248 |
| headlines_with_manipulated_sources | 1000 | 442 | 834 | 2276 |
| headlines_with_sources | 1054 | 919 | 303 | 2276 |
| sources_only | 648 | 1471 | 171 | 2290 |
Additional generated analysis assets
- Counterfactual effects table
- Cross-model agreement table
- Failure taxonomy table
- Model instability table
- Qualitative error examples
Counterfactual effects (inline)
| model | n_pairs | label_sensitivity_rate | ci95_low | ci95_high |
|---|---|---|---|---|
| llama3.2:3b | 298 | 0.5268456375838926 | 0.47315436241610737 | 0.5838926174496645 |
| mistral:latest | 310 | 0.5258064516129032 | 0.46766129032258064 | 0.5806451612903226 |
| gemma3:4b | 332 | 0.5030120481927711 | 0.44879518072289154 | 0.5572289156626506 |
| qwen2.5:7b | 335 | 0.4955223880597015 | 0.4417910447761194 | 0.5492537313432836 |
| gemma4:latest | 311 | 0.4212218649517685 | 0.3664790996784566 | 0.47596463022508034 |
| phi4-mini:3.8b | 330 | 0.41515151515151516 | 0.3606060606060606 | 0.4636363636363636 |
| qwen3:8b | 336 | 0.40476190476190477 | 0.3482142857142857 | 0.45535714285714285 |
Cross-model agreement (inline)
| condition | n_groups | mean_agreement_rate | mean_normalized_entropy | instability_score |
|---|---|---|---|---|
| headlines_only | 424 | 0.6409946840371369 | 0.6328563709804557 | 0.6328563709804557 |
| headlines_with_manipulated_sources | 424 | 0.6239845387840671 | 0.6542016926693891 | 0.6542016926693891 |
| headlines_with_sources | 424 | 0.654039383048817 | 0.6167524886152344 | 0.6167524886152344 |
| sources_only | 424 | 0.7438978736148548 | 0.4654712053731028 | 0.4654712053731028 |
Failure taxonomy (inline)
| model | parse_status | error_category | count | ratio_within_model |
|---|---|---|---|---|
| gemma3:4b | success | other | 1335 | 0.9933035714285714 |
| gemma3:4b | failed | invalid_or_missing_selected_article_id | 9 | 0.006696428571428571 |
| gemma4:latest | success | other | 1247 | 0.999198717948718 |
| gemma4:latest | failed | invalid_or_missing_selected_article_id | 1 | 0.0008012820512820513 |
| llama3.2:3b | success | other | 1249 | 0.9293154761904762 |
| llama3.2:3b | failed | invalid_or_missing_selected_article_id | 95 | 0.07068452380952381 |
| mistral:latest | success | other | 1243 | 0.9959935897435898 |
| mistral:latest | failed | invalid_or_missing_selected_article_id | 5 | 0.004006410256410256 |
| phi4-mini:3.8b | success | other | 1298 | 0.9657738095238095 |
| phi4-mini:3.8b | fallback | fallback_after_malformed_json | 32 | 0.023809523809523808 |
| phi4-mini:3.8b | failed | invalid_or_missing_selected_article_id | 14 | 0.010416666666666666 |
| qwen2.5:7b | success | other | 1343 | 0.9992559523809523 |
| qwen2.5:7b | failed | other | 1 | 0.000744047619047619 |
| qwen3:8b | success | other | 1343 | 0.9992559523809523 |
| qwen3:8b | failed | invalid_or_missing_selected_article_id | 1 | 0.000744047619047619 |
Model instability (inline)
| model | n_incidents | instability_score |
|---|---|---|
| qwen2.5:7b | 112 | 0.6279761904761906 |
| llama3.2:3b | 112 | 0.625 |
| phi4-mini:3.8b | 112 | 0.6190476190476192 |
| gemma3:4b | 112 | 0.6130952380952381 |
| qwen3:8b | 112 | 0.6101190476190476 |
| mistral:latest | 104 | 0.592948717948718 |
| gemma4:latest | 104 | 0.5576923076923077 |
Qualitative error examples (inline)
Preview excerpt from generated qualitative errors:
run_20260416_171758 | qwen3:8b | headlines_only | topic_fake_news | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "7JFQGvJ0LKQOMe0t", "reason": "Offers a proactive approach to combating fake news."}
run_20260416_171758 | gemma3:4b | headlines_with_manipulated_sources | topic_technology | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "0xcOUPRRvmYf5mX1H", "reason": "This article discusses the potential role of big tech in radicalization ..."}
run_20260416_171758 | gemma3:4b | headlines_with_sources | topic_us_house | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "3", "reason": "The article from Vox provides a good overview of the situation ..."}
run_20260416_171758 | gemma3:4b | headlines_only | topic_epa | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "3", "reason": "The article detailing the Executive action to kill the Clean Power Plan ..."}
run_20260416_171758 | gemma3:4b | headlines_only | topic_business | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "1", "reason": "This article discusses a major leadership change at PepsiCo ..."}
9) FastAPI Analytics (Optional)
Run API locally:
uv run uvicorn app.api.engine_analytics:app --host 0.0.0.0 --port 8000 --reload
Useful endpoints:
- GET /metrics/inter-model
- GET /metrics/summary
- GET /metrics/conditions-by-model
- GET /metrics/compare-runs?run_a=...&run_b=...
- POST /ingest/run
- POST /ingest/runs
Docs:
10) CI (GitHub Actions)
This repository includes a reliability-focused CI pipeline under .github/workflows.
CI workflow
File: .github/workflows/ci.yml
Runs on every push to main and on pull requests. It enforces reliability by:
- Installing dependencies with uv in a clean environment.
- Running the test suite.
- Regenerating analytics artifacts from outputs.
- Validating summary.json schema and metric ranges.
- Validating generated report artifacts (summary table, qualitative errors, limitations).
- Uploading report assets as a workflow artifact.
11) Streamlit Dashboard Publishing
If you publish dashboard.py via Streamlit Community Cloud:
- Keep dashboard.py as the app entrypoint.
- Point Streamlit Cloud to this repository.
- Use requirements.txt for dependency installation.
Live app:
12) Public FastAPI Deployment
The analytics API can be published separately so others can access your metrics endpoints.
Option A: Render (recommended quick path)
This repo includes render.yaml for one-click web service deployment.
Live API base URL:
Steps:
- Connect this repository in Render.
- Select Blueprint deploy (it will read render.yaml).
- After deploy, use:
- Optional safety defaults already set in render.yaml:
- ENABLE_ANALYTICS_WRITE_ENDPOINTS=0
- API_ALLOW_ORIGINS=*
Option B: Any container/PaaS
Run the same API command with platform port binding:
uvicorn app.api.engine_analytics:app --host 0.0.0.0 --port $PORT
Useful public endpoints:
- GET /metrics/summary
- GET /metrics/inter-model
- GET /metrics/conditions-by-model
13) Reproducibility Freeze
Final reporting now supports reproducibility metadata using:
- Frozen manifest: configs/models.final.yaml
- Frozen seed: 42
Generate enriched report assets (with confidence intervals, qualitative error samples, and limitations):
uv run python -m app.cli.generate_report_assets \
--outputs-dir outputs \
--assets-dir docs/figures \
--frozen-manifest configs/models.final.yaml \
--frozen-seed 42
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sourcerers-0.1.0.tar.gz.
File metadata
- Download URL: sourcerers-0.1.0.tar.gz
- Upload date:
- Size: 43.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c27bcd4810990de0f19cc3366a83170ee4e93ef6ae2ce134c056e92ce31125dc
|
|
| MD5 |
065e144b2562dd88613bce559ccd84d8
|
|
| BLAKE2b-256 |
3ba344e56d8559b3874f2551f6b7931265a252320ea8e342564399a08f1c1d1d
|
Provenance
The following attestation bundles were made for sourcerers-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on amirhossein-razlighi/LLM-News-Bias-Analysis
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sourcerers-0.1.0.tar.gz -
Subject digest:
c27bcd4810990de0f19cc3366a83170ee4e93ef6ae2ce134c056e92ce31125dc - Sigstore transparency entry: 1338693901
- Sigstore integration time:
-
Permalink:
amirhossein-razlighi/LLM-News-Bias-Analysis@9c66389298fbf1ad697b98e99ae34527bb229607 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/amirhossein-razlighi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@9c66389298fbf1ad697b98e99ae34527bb229607 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file sourcerers-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sourcerers-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
864f97b9d2ecd2fa8e24bf301be9065c471407ec61a1112e6833382ce101a457
|
|
| MD5 |
1439466b092dbcfea31dbfade0fd65fd
|
|
| BLAKE2b-256 |
22fc4e005f789e6f3c92a776d38d91e5bda2c5456b653078ca4b86abbe8f2fde
|
Provenance
The following attestation bundles were made for sourcerers-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on amirhossein-razlighi/LLM-News-Bias-Analysis
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sourcerers-0.1.0-py3-none-any.whl -
Subject digest:
864f97b9d2ecd2fa8e24bf301be9065c471407ec61a1112e6833382ce101a457 - Sigstore transparency entry: 1338687001
- Sigstore integration time:
-
Permalink:
amirhossein-razlighi/LLM-News-Bias-Analysis@d2eef9bd2ff3ba3229118122067a4364f3b4df69 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/amirhossein-razlighi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@d2eef9bd2ff3ba3229118122067a4364f3b4df69 -
Trigger Event:
workflow_dispatch
-
Statement type: