Skip to main content

Local Ollama experiment pipeline for source-selection behavior analysis

Project description

Sourcerers: Source-Selection Robustness in LLMs

Streamlit Dashboard API Docs CI Docs GitHub Pages

A reproducible NLP experimentation framework for analyzing how LLMs choose among politically diverse news sources under controlled prompt conditions.

This repository supports:

  • Offline incident preparation from real news JSON files.
  • Condition-controlled prompt construction for source-selection experiments.
  • Multi-model evaluation via local Ollama models.
  • Analytics through Streamlit and FastAPI.
  • Report-ready artifacts (plots + summary tables) generated from saved runs.

1) Why This Project Exists

Modern LLMs can appear neutral while still exhibiting selection bias, source-identity overreliance, or prompt-format sensitivity. This project tests those risks directly by asking models to choose one article from left/center/right candidates across multiple controlled conditions.

Core research intent:

  • Measure robustness of model choices when source labels are manipulated.
  • Compare inter-model behavior under identical candidate sets.
  • Track reliability signals such as parse stability and latency.

2) Project Scope And Pipeline

Input Data

  • Real incidents are built from JSON articles in data/jsons.
  • The preparation step groups topic-level incidents that include left, center, and right coverage.

Experimental Conditions

  • headlines_only
  • headlines_with_sources
  • sources_only
  • headlines_with_manipulated_sources

These conditions isolate content effects vs source-identity effects.

Output Artifacts Per Run

Each run folder in outputs/run_YYYYMMDD_HHMMSS includes:

  • experiment_requests.jsonl
  • model_decisions.jsonl
  • raw_outputs.jsonl

These files are enough to fully reproduce downstream analytics and plots.

3) Repository Structure (Key Files)

  • dashboard.py: Streamlit interface for analytics and experiment execution.
  • app/cli/prepare_real_incidents.py: Converts raw article JSON data into experiment-ready incidents.
  • app/cli/run_experiments.py: Runs condition/model combinations and writes run artifacts.
  • app/cli/generate_report_assets.py: Builds report plots and summary tables from outputs.
  • app/cli/generate_llm_dashboard_summary.py: Generates a saved LLM executive summary JSON for the dashboard (offline, via Ollama).
  • app/api/engine_analytics.py: Ingestion + metrics engine used by API and dashboard.
  • configs/models.example.yaml: Manifest of Ollama models and decoding params.
  • docs/figures/: Generated report assets (plots and summary metrics).

4) Quickstart (Reproducible)

Prerequisites

  • Python 3.10
  • Ollama installed locally
  • uv package manager

Setup

uv venv --python 3.10
uv sync

Pull Models (example)

ollama pull qwen2.5:7b
ollama pull qwen3:8b
ollama pull gemma3:4b

Start Ollama

ollama serve

Run Dashboard

uv run streamlit run dashboard.py

Run Tests

uv run pytest -q

5) End-To-End CLI Workflow

A. Prepare incidents from raw JSON

uv run python -m app.cli.prepare_real_incidents \
  --json-dir data/jsons \
  --output data/real_incidents_all.jsonl \
  --min-per-leaning 3 \
  --max-articles-per-leaning 8

B. Run experiments

uv run python -m app.cli.run_experiments \
  --input data/real_incidents_all.jsonl \
  --models-manifest configs/models.example.yaml \
  --output-dir outputs \
  --conditions headlines_only headlines_with_sources sources_only headlines_with_manipulated_sources \
  --max-combinations 3 \
  --seed 42

Optional runtime optimization flags (disabled by default for reproducibility with existing outputs):

uv run python -m app.cli.run_experiments \
  --input data/real_incidents_all.jsonl \
  --models-manifest configs/models.example.yaml \
  --output-dir outputs \
  --enable-flash-attention \
  --enable-kv-cache \
  --kv-cache-type q8_0

Notes:

  • Default behavior is unchanged unless these flags are explicitly provided.
  • Runtime options are recorded in each request row under runtime_options for traceability.

C. Generate report assets from saved outputs

uv run python -m app.cli.generate_report_assets \
  --outputs-dir outputs \
  --assets-dir docs/figures

D. Generate offline LLM executive summary for dashboard

uv run python -m app.cli.generate_llm_dashboard_summary \
  --outputs-dir outputs \
  --model gemma4:latest \
  --summary-json outputs/llm_dashboard_summary.json

This writes a reusable summary file that the dashboard shows in the top "✨ LLM Summary" section.

E. Build technical documentation site

uv sync --group docs
uv run mkdocs serve

Static build check:

uv run mkdocs build --strict

The documentation site includes architecture, usage guides, and auto-generated API reference from source modules via mkdocstrings. GitHub Pages publishing is automated by .github/workflows/docs.yml.

F. Build and publish as a pip package

Distribution name:

  • sourcerers

Local package build + validation:

uv sync
uv run python -m build
uv run twine check dist/*

Install locally from built artifacts:

pip install dist/*.whl

Import examples:

from app import (
  OllamaClient,
  parse_model_response,
  build_condition_bundles,
  build_selection_prompt,
)

Automated PyPI publishing workflow:

  • .github/workflows/publish-pypi.yml

To enable publishing in your repo:

  1. Create GitHub environment pypi.
  2. In PyPI project settings, configure Trusted Publisher for this repository/workflow.
  3. Publish a GitHub Release (or run the workflow manually).

CI now validates package build health on every push/PR via .github/workflows/ci.yml.

6) Evaluation Protocol

Main Evaluation Signals

  • Parse reliability: success/fallback/failure rates from structured response parsing. (Strict JSON)
  • Latency: mean and p95 latency per model.
  • Selection distribution: left/center/right choice ratios.
  • Robustness proxy: sensitivity to manipulated source labels.
  • Position effect signal: selected candidate index distribution.
  • Counterfactual label sensitivity: change rate between real-source vs swapped-source conditions.
  • Cross-model agreement and instability: entropy-based disagreement across models on the same incident.

Baseline Included

The report assets include a candidate-mix random baseline for center selection:

  • Baseline center rate = mean proportion of center candidates offered to the model.
  • Model center selection rates are compared against this baseline.

This baseline is simple but useful to detect models selecting center above or below chance given candidate availability.

7) Current Empirical Snapshot (From outputs/)

Generated from existing run artifacts in this repository using app/cli/generate_report_assets.py.

Dataset coverage in current snapshot

  • Decisions: 9312
  • Runs: 6
  • Models: 7
  • Conditions: 4

Aggregate parser health

  • Parse success: 84.91%
  • Parse failure: 14.74%

Model summary table

model n parse_success_rate parse_fallback_rate parse_failure_rate avg_latency_ms p95_latency_ms center_selection_rate
qwen2.5:7b 1344 99.93% 0.00% 0.07% 11668 14992 38.35%
qwen3:8b 1344 99.93% 0.00% 0.07% 9494 11139 30.68%
mistral:latest 1248 99.60% 0.00% 0.40% 3036 3576 49.32%
gemma3:4b 1344 99.33% 0.00% 0.67% 5964 6831 40.97%
phi4-mini:3.8b 1344 96.58% 2.38% 1.04% 2189 2717 37.22%
llama3.2:3b 1344 92.93% 0.00% 7.07% 1572 1993 41.15%
gemma4:latest 1344 7.14% 0.00% 92.86% 2906 3490 50.00%

8) Generated Figures

Parse reliability by model

Parse reliability by model

Latency by model (average and p95)

Latency by model

Selection mix by condition

Selection mix by condition

Center delta heatmap (model x condition)

Center delta heatmap

Reliability-speed Pareto (bubble = instability)

Reliability-speed Pareto

Parse reliability calibration

Parse reliability calibration

Center selection vs baseline

Center selection vs baseline

Condition to selected-leaning Sankey (interactive)

Sankey flow snapshot (counts extracted from the generated interactive figure):

condition to_left to_center to_right total
headlines_only 833 779 636 2248
headlines_with_manipulated_sources 1000 442 834 2276
headlines_with_sources 1054 919 303 2276
sources_only 648 1471 171 2290

Condition to bucket Sankey snapshot

Additional generated analysis assets

Counterfactual effects (inline)

model n_pairs label_sensitivity_rate ci95_low ci95_high
llama3.2:3b 298 0.5268456375838926 0.47315436241610737 0.5838926174496645
mistral:latest 310 0.5258064516129032 0.46766129032258064 0.5806451612903226
gemma3:4b 332 0.5030120481927711 0.44879518072289154 0.5572289156626506
qwen2.5:7b 335 0.4955223880597015 0.4417910447761194 0.5492537313432836
gemma4:latest 311 0.4212218649517685 0.3664790996784566 0.47596463022508034
phi4-mini:3.8b 330 0.41515151515151516 0.3606060606060606 0.4636363636363636
qwen3:8b 336 0.40476190476190477 0.3482142857142857 0.45535714285714285

Cross-model agreement (inline)

condition n_groups mean_agreement_rate mean_normalized_entropy instability_score
headlines_only 424 0.6409946840371369 0.6328563709804557 0.6328563709804557
headlines_with_manipulated_sources 424 0.6239845387840671 0.6542016926693891 0.6542016926693891
headlines_with_sources 424 0.654039383048817 0.6167524886152344 0.6167524886152344
sources_only 424 0.7438978736148548 0.4654712053731028 0.4654712053731028

Failure taxonomy (inline)

model parse_status error_category count ratio_within_model
gemma3:4b success other 1335 0.9933035714285714
gemma3:4b failed invalid_or_missing_selected_article_id 9 0.006696428571428571
gemma4:latest success other 1247 0.999198717948718
gemma4:latest failed invalid_or_missing_selected_article_id 1 0.0008012820512820513
llama3.2:3b success other 1249 0.9293154761904762
llama3.2:3b failed invalid_or_missing_selected_article_id 95 0.07068452380952381
mistral:latest success other 1243 0.9959935897435898
mistral:latest failed invalid_or_missing_selected_article_id 5 0.004006410256410256
phi4-mini:3.8b success other 1298 0.9657738095238095
phi4-mini:3.8b fallback fallback_after_malformed_json 32 0.023809523809523808
phi4-mini:3.8b failed invalid_or_missing_selected_article_id 14 0.010416666666666666
qwen2.5:7b success other 1343 0.9992559523809523
qwen2.5:7b failed other 1 0.000744047619047619
qwen3:8b success other 1343 0.9992559523809523
qwen3:8b failed invalid_or_missing_selected_article_id 1 0.000744047619047619

Model instability (inline)

model n_incidents instability_score
qwen2.5:7b 112 0.6279761904761906
llama3.2:3b 112 0.625
phi4-mini:3.8b 112 0.6190476190476192
gemma3:4b 112 0.6130952380952381
qwen3:8b 112 0.6101190476190476
mistral:latest 104 0.592948717948718
gemma4:latest 104 0.5576923076923077

Qualitative error examples (inline)

Preview excerpt from generated qualitative errors:

run_20260416_171758 | qwen3:8b | headlines_only | topic_fake_news | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "7JFQGvJ0LKQOMe0t", "reason": "Offers a proactive approach to combating fake news."}

run_20260416_171758 | gemma3:4b | headlines_with_manipulated_sources | topic_technology | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "0xcOUPRRvmYf5mX1H", "reason": "This article discusses the potential role of big tech in radicalization ..."}

run_20260416_171758 | gemma3:4b | headlines_with_sources | topic_us_house | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "3", "reason": "The article from Vox provides a good overview of the situation ..."}

run_20260416_171758 | gemma3:4b | headlines_only | topic_epa | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "3", "reason": "The article detailing the Executive action to kill the Clean Power Plan ..."}

run_20260416_171758 | gemma3:4b | headlines_only | topic_business | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "1", "reason": "This article discusses a major leadership change at PepsiCo ..."}

9) FastAPI Analytics (Optional)

Run API locally:

uv run uvicorn app.api.engine_analytics:app --host 0.0.0.0 --port 8000 --reload

Useful endpoints:

  • GET /metrics/inter-model
  • GET /metrics/summary
  • GET /metrics/conditions-by-model
  • GET /metrics/compare-runs?run_a=...&run_b=...
  • POST /ingest/run
  • POST /ingest/runs

Docs:

10) CI (GitHub Actions)

This repository includes a reliability-focused CI pipeline under .github/workflows.

CI workflow

File: .github/workflows/ci.yml

Runs on every push to main and on pull requests. It enforces reliability by:

  • Installing dependencies with uv in a clean environment.
  • Running the test suite.
  • Regenerating analytics artifacts from outputs.
  • Validating summary.json schema and metric ranges.
  • Validating generated report artifacts (summary table, qualitative errors, limitations).
  • Uploading report assets as a workflow artifact.

11) Streamlit Dashboard Publishing

If you publish dashboard.py via Streamlit Community Cloud:

  • Keep dashboard.py as the app entrypoint.
  • Point Streamlit Cloud to this repository.
  • Use requirements.txt for dependency installation.

Live app:

12) Public FastAPI Deployment

The analytics API can be published separately so others can access your metrics endpoints.

Option A: Render (recommended quick path)

This repo includes render.yaml for one-click web service deployment.

Live API base URL:

Steps:

  1. Connect this repository in Render.
  2. Select Blueprint deploy (it will read render.yaml).
  3. After deploy, use:
  1. Optional safety defaults already set in render.yaml:
  • ENABLE_ANALYTICS_WRITE_ENDPOINTS=0
  • API_ALLOW_ORIGINS=*

Option B: Any container/PaaS

Run the same API command with platform port binding:

uvicorn app.api.engine_analytics:app --host 0.0.0.0 --port $PORT

Useful public endpoints:

  • GET /metrics/summary
  • GET /metrics/inter-model
  • GET /metrics/conditions-by-model

13) Reproducibility Freeze

Final reporting now supports reproducibility metadata using:

  • Frozen manifest: configs/models.final.yaml
  • Frozen seed: 42

Generate enriched report assets (with confidence intervals, qualitative error samples, and limitations):

uv run python -m app.cli.generate_report_assets \
  --outputs-dir outputs \
  --assets-dir docs/figures \
  --frozen-manifest configs/models.final.yaml \
  --frozen-seed 42

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sourcerers-0.1.0.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sourcerers-0.1.0-py3-none-any.whl (51.7 kB view details)

Uploaded Python 3

File details

Details for the file sourcerers-0.1.0.tar.gz.

File metadata

  • Download URL: sourcerers-0.1.0.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sourcerers-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c27bcd4810990de0f19cc3366a83170ee4e93ef6ae2ce134c056e92ce31125dc
MD5 065e144b2562dd88613bce559ccd84d8
BLAKE2b-256 3ba344e56d8559b3874f2551f6b7931265a252320ea8e342564399a08f1c1d1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for sourcerers-0.1.0.tar.gz:

Publisher: publish-pypi.yml on amirhossein-razlighi/LLM-News-Bias-Analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sourcerers-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sourcerers-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 51.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sourcerers-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 864f97b9d2ecd2fa8e24bf301be9065c471407ec61a1112e6833382ce101a457
MD5 1439466b092dbcfea31dbfade0fd65fd
BLAKE2b-256 22fc4e005f789e6f3c92a776d38d91e5bda2c5456b653078ca4b86abbe8f2fde

See more details on using hashes here.

Provenance

The following attestation bundles were made for sourcerers-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on amirhossein-razlighi/LLM-News-Bias-Analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page