Local Ollama experiment pipeline for source-selection behavior analysis

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Sourcerers: Source-Selection Robustness in LLMs

A reproducible NLP experimentation framework for analyzing how LLMs choose among politically diverse news sources under controlled prompt conditions.

This repository supports:

Offline incident preparation from real news JSON files.
Condition-controlled prompt construction for source-selection experiments.
Multi-model evaluation via local Ollama models.
Analytics through Streamlit and FastAPI.
Report-ready artifacts (plots + summary tables) generated from saved runs.

1) Why This Project Exists

Modern LLMs can appear neutral while still exhibiting selection bias, source-identity overreliance, or prompt-format sensitivity. This project tests those risks directly by asking models to choose one article from left/center/right candidates across multiple controlled conditions.

Core research intent:

Measure robustness of model choices when source labels are manipulated.
Compare inter-model behavior under identical candidate sets.
Track reliability signals such as parse stability and latency.

2) Project Scope And Pipeline

Input Data

Real incidents are built from JSON articles in data/jsons.
The preparation step groups topic-level incidents that include left, center, and right coverage.

Experimental Conditions

headlines_only
headlines_with_sources
sources_only
headlines_with_manipulated_sources

These conditions isolate content effects vs source-identity effects.

Output Artifacts Per Run

Each run folder in outputs/run_YYYYMMDD_HHMMSS includes:

experiment_requests.jsonl
model_decisions.jsonl
raw_outputs.jsonl

These files are enough to fully reproduce downstream analytics and plots.

3) Repository Structure (Key Files)

dashboard.py: Streamlit interface for analytics and experiment execution.
app/cli/prepare_real_incidents.py: Converts raw article JSON data into experiment-ready incidents.
app/cli/run_experiments.py: Runs condition/model combinations and writes run artifacts.
app/cli/generate_report_assets.py: Builds report plots and summary tables from outputs.
app/cli/generate_llm_dashboard_summary.py: Generates a saved LLM executive summary JSON for the dashboard (offline, via Ollama).
app/api/engine_analytics.py: Ingestion + metrics engine used by API and dashboard.
configs/models.example.yaml: Manifest of Ollama models and decoding params.
docs/figures/: Generated report assets (plots and summary metrics).

4) Quickstart (Reproducible)

Prerequisites

Python 3.10
Ollama installed locally
uv package manager

Setup

uv venv --python 3.10
uv sync

Pull Models (example)

ollama pull qwen2.5:7b
ollama pull qwen3:8b
ollama pull gemma3:4b

Start Ollama

ollama serve

Run Dashboard

uv run streamlit run dashboard.py

Run Tests

uv run pytest -q

5) End-To-End CLI Workflow

A. Prepare incidents from raw JSON

uv run python -m app.cli.prepare_real_incidents \
  --json-dir data/jsons \
  --output data/real_incidents_all.jsonl \
  --min-per-leaning 3 \
  --max-articles-per-leaning 8

B. Run experiments

uv run python -m app.cli.run_experiments \
  --input data/real_incidents_all.jsonl \
  --models-manifest configs/models.example.yaml \
  --output-dir outputs \
  --conditions headlines_only headlines_with_sources sources_only headlines_with_manipulated_sources \
  --max-combinations 3 \
  --seed 42

Optional runtime optimization flags (disabled by default for reproducibility with existing outputs):

uv run python -m app.cli.run_experiments \
  --input data/real_incidents_all.jsonl \
  --models-manifest configs/models.example.yaml \
  --output-dir outputs \
  --enable-flash-attention \
  --enable-kv-cache \
  --kv-cache-type q8_0

Notes:

Default behavior is unchanged unless these flags are explicitly provided.
Runtime options are recorded in each request row under runtime_options for traceability.

C. Generate report assets from saved outputs

uv run python -m app.cli.generate_report_assets \
  --outputs-dir outputs \
  --assets-dir docs/figures

D. Generate offline LLM executive summary for dashboard

uv run python -m app.cli.generate_llm_dashboard_summary \
  --outputs-dir outputs \
  --model gemma4:latest \
  --summary-json outputs/llm_dashboard_summary.json

This writes a reusable summary file that the dashboard shows in the top "✨ LLM Summary" section.

E. Build technical documentation site

uv sync --group docs
uv run mkdocs serve

Static build check:

uv run mkdocs build --strict

The documentation site includes architecture, usage guides, and auto-generated API reference from source modules via mkdocstrings. GitHub Pages publishing is automated by .github/workflows/docs.yml.

F. Build and publish as a pip package

Distribution name:

sourcerers

Local package build + validation:

uv sync
uv run python -m build
uv run twine check dist/*

Install locally from built artifacts:

pip install dist/*.whl

Import examples:

from app import (
  OllamaClient,
  parse_model_response,
  build_condition_bundles,
  build_selection_prompt,
)

Automated PyPI publishing workflow:

.github/workflows/publish-pypi.yml

To enable publishing in your repo:

Create GitHub environment pypi.
In PyPI project settings, configure Trusted Publisher for this repository/workflow.
Publish a GitHub Release (or run the workflow manually).

CI now validates package build health on every push/PR via .github/workflows/ci.yml.

6) Evaluation Protocol

Main Evaluation Signals

Parse reliability: success/fallback/failure rates from structured response parsing. (Strict JSON)
Latency: mean and p95 latency per model.
Selection distribution: left/center/right choice ratios.
Robustness proxy: sensitivity to manipulated source labels.
Position effect signal: selected candidate index distribution.
Counterfactual label sensitivity: change rate between real-source vs swapped-source conditions.
Cross-model agreement and instability: entropy-based disagreement across models on the same incident.

Baseline Included

The report assets include a candidate-mix random baseline for center selection:

Baseline center rate = mean proportion of center candidates offered to the model.
Model center selection rates are compared against this baseline.

This baseline is simple but useful to detect models selecting center above or below chance given candidate availability.

7) Current Empirical Snapshot (From outputs/)

Generated from existing run artifacts in this repository using app/cli/generate_report_assets.py.

Dataset coverage in current snapshot

Decisions: 9312
Runs: 6
Models: 7
Conditions: 4

Aggregate parser health

Parse success: 84.91%
Parse failure: 14.74%

Model summary table

model	n	parse_success_rate	parse_fallback_rate	parse_failure_rate	avg_latency_ms	p95_latency_ms	center_selection_rate
qwen2.5:7b	1344	99.93%	0.00%	0.07%	11668	14992	38.35%
qwen3:8b	1344	99.93%	0.00%	0.07%	9494	11139	30.68%
mistral:latest	1248	99.60%	0.00%	0.40%	3036	3576	49.32%
gemma3:4b	1344	99.33%	0.00%	0.67%	5964	6831	40.97%
phi4-mini:3.8b	1344	96.58%	2.38%	1.04%	2189	2717	37.22%
llama3.2:3b	1344	92.93%	0.00%	7.07%	1572	1993	41.15%
gemma4:latest	1344	7.14%	0.00%	92.86%	2906	3490	50.00%

8) Generated Figures

Parse reliability by model

Latency by model (average and p95)

Latency by model

Selection mix by condition

Center delta heatmap (model x condition)

Center delta heatmap

Reliability-speed Pareto (bubble = instability)

Reliability-speed Pareto

Parse reliability calibration

Center selection vs baseline

Condition to selected-leaning Sankey (interactive)

Open interactive Sankey

Sankey flow snapshot (counts extracted from the generated interactive figure):

condition	to_left	to_center	to_right	total
headlines_only	833	779	636	2248
headlines_with_manipulated_sources	1000	442	834	2276
headlines_with_sources	1054	919	303	2276
sources_only	648	1471	171	2290

Condition to bucket Sankey snapshot

Additional generated analysis assets

Counterfactual effects (inline)

model	n_pairs	label_sensitivity_rate	ci95_low	ci95_high
llama3.2:3b	298	0.5268456375838926	0.47315436241610737	0.5838926174496645
mistral:latest	310	0.5258064516129032	0.46766129032258064	0.5806451612903226
gemma3:4b	332	0.5030120481927711	0.44879518072289154	0.5572289156626506
qwen2.5:7b	335	0.4955223880597015	0.4417910447761194	0.5492537313432836
gemma4:latest	311	0.4212218649517685	0.3664790996784566	0.47596463022508034
phi4-mini:3.8b	330	0.41515151515151516	0.3606060606060606	0.4636363636363636
qwen3:8b	336	0.40476190476190477	0.3482142857142857	0.45535714285714285

Cross-model agreement (inline)

condition	n_groups	mean_agreement_rate	mean_normalized_entropy	instability_score
headlines_only	424	0.6409946840371369	0.6328563709804557	0.6328563709804557
headlines_with_manipulated_sources	424	0.6239845387840671	0.6542016926693891	0.6542016926693891
headlines_with_sources	424	0.654039383048817	0.6167524886152344	0.6167524886152344
sources_only	424	0.7438978736148548	0.4654712053731028	0.4654712053731028

Failure taxonomy (inline)

model	parse_status	error_category	count	ratio_within_model
gemma3:4b	success	other	1335	0.9933035714285714
gemma3:4b	failed	invalid_or_missing_selected_article_id	9	0.006696428571428571
gemma4:latest	success	other	1247	0.999198717948718
gemma4:latest	failed	invalid_or_missing_selected_article_id	1	0.0008012820512820513
llama3.2:3b	success	other	1249	0.9293154761904762
llama3.2:3b	failed	invalid_or_missing_selected_article_id	95	0.07068452380952381
mistral:latest	success	other	1243	0.9959935897435898
mistral:latest	failed	invalid_or_missing_selected_article_id	5	0.004006410256410256
phi4-mini:3.8b	success	other	1298	0.9657738095238095
phi4-mini:3.8b	fallback	fallback_after_malformed_json	32	0.023809523809523808
phi4-mini:3.8b	failed	invalid_or_missing_selected_article_id	14	0.010416666666666666
qwen2.5:7b	success	other	1343	0.9992559523809523
qwen2.5:7b	failed	other	1	0.000744047619047619
qwen3:8b	success	other	1343	0.9992559523809523
qwen3:8b	failed	invalid_or_missing_selected_article_id	1	0.000744047619047619

Model instability (inline)

model	n_incidents	instability_score
qwen2.5:7b	112	0.6279761904761906
llama3.2:3b	112	0.625
phi4-mini:3.8b	112	0.6190476190476192
gemma3:4b	112	0.6130952380952381
qwen3:8b	112	0.6101190476190476
mistral:latest	104	0.592948717948718
gemma4:latest	104	0.5576923076923077

Qualitative error examples (inline)

Preview excerpt from generated qualitative errors:

run_20260416_171758 | qwen3:8b | headlines_only | topic_fake_news | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "7JFQGvJ0LKQOMe0t", "reason": "Offers a proactive approach to combating fake news."}

run_20260416_171758 | gemma3:4b | headlines_with_manipulated_sources | topic_technology | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "0xcOUPRRvmYf5mX1H", "reason": "This article discusses the potential role of big tech in radicalization ..."}

run_20260416_171758 | gemma3:4b | headlines_with_sources | topic_us_house | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "3", "reason": "The article from Vox provides a good overview of the situation ..."}

run_20260416_171758 | gemma3:4b | headlines_only | topic_epa | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "3", "reason": "The article detailing the Executive action to kill the Clean Power Plan ..."}

run_20260416_171758 | gemma3:4b | headlines_only | topic_business | failed | selected_article_id missing or not in candidates
response: {"selected_article_id": "1", "reason": "This article discusses a major leadership change at PepsiCo ..."}

9) FastAPI Analytics (Optional)

Run API locally:

uv run uvicorn app.api.engine_analytics:app --host 0.0.0.0 --port 8000 --reload

Useful endpoints:

GET /metrics/inter-model
GET /metrics/summary
GET /metrics/conditions-by-model
GET /metrics/compare-runs?run_a=...&run_b=...
POST /ingest/run
POST /ingest/runs

Docs:

10) CI (GitHub Actions)

This repository includes a reliability-focused CI pipeline under .github/workflows.

CI workflow

File: .github/workflows/ci.yml

Runs on every push to main and on pull requests. It enforces reliability by:

Installing dependencies with uv in a clean environment.
Running the test suite.
Regenerating analytics artifacts from outputs.
Validating summary.json schema and metric ranges.
Validating generated report artifacts (summary table, qualitative errors, limitations).
Uploading report assets as a workflow artifact.

11) Streamlit Dashboard Publishing

If you publish dashboard.py via Streamlit Community Cloud:

Keep dashboard.py as the app entrypoint.
Point Streamlit Cloud to this repository.
Use requirements.txt for dependency installation.

Live app:

LLM News Bias Analysis Streamlit app

12) Public FastAPI Deployment

The analytics API can be published separately so others can access your metrics endpoints.

Option A: Render (recommended quick path)

This repo includes render.yaml for one-click web service deployment.

Live API base URL:

Sourcerers Analytics API

Steps:

Connect this repository in Render.
Select Blueprint deploy (it will read render.yaml).
After deploy, use:

Optional safety defaults already set in render.yaml:

ENABLE_ANALYTICS_WRITE_ENDPOINTS=0
API_ALLOW_ORIGINS=*

Option B: Any container/PaaS

Run the same API command with platform port binding:

uvicorn app.api.engine_analytics:app --host 0.0.0.0 --port $PORT

Useful public endpoints:

GET /metrics/summary
GET /metrics/inter-model
GET /metrics/conditions-by-model

13) Reproducibility Freeze

Final reporting now supports reproducibility metadata using:

Frozen manifest: configs/models.final.yaml
Frozen seed: 42

Generate enriched report assets (with confidence intervals, qualitative error samples, and limitations):

uv run python -m app.cli.generate_report_assets \
  --outputs-dir outputs \
  --assets-dir docs/figures \
  --frozen-manifest configs/models.final.yaml \
  --frozen-seed 42

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

arazlighi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sourcerers-0.1.0.tar.gz (43.3 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sourcerers-0.1.0-py3-none-any.whl (51.7 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file sourcerers-0.1.0.tar.gz.

File metadata

Download URL: sourcerers-0.1.0.tar.gz
Upload date: Apr 18, 2026
Size: 43.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sourcerers-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c27bcd4810990de0f19cc3366a83170ee4e93ef6ae2ce134c056e92ce31125dc`
MD5	`065e144b2562dd88613bce559ccd84d8`
BLAKE2b-256	`3ba344e56d8559b3874f2551f6b7931265a252320ea8e342564399a08f1c1d1d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sourcerers-0.1.0.tar.gz:

Publisher: publish-pypi.yml on amirhossein-razlighi/LLM-News-Bias-Analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sourcerers-0.1.0.tar.gz
- Subject digest: c27bcd4810990de0f19cc3366a83170ee4e93ef6ae2ce134c056e92ce31125dc
- Sigstore transparency entry: 1338693901
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: amirhossein-razlighi/LLM-News-Bias-Analysis@9c66389298fbf1ad697b98e99ae34527bb229607
- Branch / Tag: refs/heads/main
- Owner: https://github.com/amirhossein-razlighi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@9c66389298fbf1ad697b98e99ae34527bb229607
- Trigger Event: workflow_dispatch

File details

Details for the file sourcerers-0.1.0-py3-none-any.whl.

File metadata

Download URL: sourcerers-0.1.0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 51.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sourcerers-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`864f97b9d2ecd2fa8e24bf301be9065c471407ec61a1112e6833382ce101a457`
MD5	`1439466b092dbcfea31dbfade0fd65fd`
BLAKE2b-256	`22fc4e005f789e6f3c92a776d38d91e5bda2c5456b653078ca4b86abbe8f2fde`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sourcerers-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on amirhossein-razlighi/LLM-News-Bias-Analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sourcerers-0.1.0-py3-none-any.whl
- Subject digest: 864f97b9d2ecd2fa8e24bf301be9065c471407ec61a1112e6833382ce101a457
- Sigstore transparency entry: 1338687001
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: amirhossein-razlighi/LLM-News-Bias-Analysis@d2eef9bd2ff3ba3229118122067a4364f3b4df69
- Branch / Tag: refs/heads/main
- Owner: https://github.com/amirhossein-razlighi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@d2eef9bd2ff3ba3229118122067a4364f3b4df69
- Trigger Event: workflow_dispatch

sourcerers 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Sourcerers: Source-Selection Robustness in LLMs

1) Why This Project Exists

2) Project Scope And Pipeline

Input Data

Experimental Conditions

Output Artifacts Per Run

3) Repository Structure (Key Files)

4) Quickstart (Reproducible)

Prerequisites

Setup

Pull Models (example)

Start Ollama

Run Dashboard

Run Tests

5) End-To-End CLI Workflow

A. Prepare incidents from raw JSON

B. Run experiments

C. Generate report assets from saved outputs

D. Generate offline LLM executive summary for dashboard

E. Build technical documentation site

F. Build and publish as a pip package

6) Evaluation Protocol

Main Evaluation Signals

Baseline Included

7) Current Empirical Snapshot (From outputs/)

Dataset coverage in current snapshot

Aggregate parser health

Model summary table

8) Generated Figures

Parse reliability by model

Latency by model (average and p95)

Selection mix by condition

Center delta heatmap (model x condition)

Reliability-speed Pareto (bubble = instability)

Parse reliability calibration

Center selection vs baseline

Condition to selected-leaning Sankey (interactive)

Additional generated analysis assets

Counterfactual effects (inline)

Cross-model agreement (inline)

Failure taxonomy (inline)

Model instability (inline)

Qualitative error examples (inline)

9) FastAPI Analytics (Optional)

10) CI (GitHub Actions)

CI workflow

11) Streamlit Dashboard Publishing

12) Public FastAPI Deployment

Option A: Render (recommended quick path)

Option B: Any container/PaaS

13) Reproducibility Freeze

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes