Sousveillance infrastructure for state mandatory-disclosure portals — parliamentary questions, committee reports, budget data, and state assembly records.
Project description
commoner-probe
Sousveillance infrastructure for the state's mandatory disclosure systems.
A commoner probes the state's own paperwork — parliamentary questions, committee
reports, state assembly records — and turns it into evidence. commoner-probe
automates the acquisition so you can focus on the analysis.
pip install "commoner-probe[all]"
import commoner_probe as probe # alias used throughout CommonerLLP toolchain
Why this exists
Parliamentary questions, committee reports, state assembly records, CSR exports, and public mining-district disclosures are mandatory or official public disclosures. The data exists. The problem is that it lives across undocumented portals with inconsistent APIs, no bulk export, and PDFs that require extraction to read programmatically.
commoner-probe handles the entire acquisition pipeline:
public disclosure portals → manifest.jsonl → files/PDFs → extracted records → your analysis
(metadata) (raw source) (structured text)
Classification, topic modelling, and dossier generation are intentionally out of scope. This library does one thing: acquire public disclosure data into provenance-rich, schema-validated JSONL and source files.
Install
pip install "commoner-probe[all]" # requests + PDF extraction
pip install "commoner-probe[all,dev]" # + schema validation and tests
Five-minute quickstart
Step 1 — Write a topic profile
{
"name": "climate",
"description": "Climate change and environmental policy",
"search_groups": {
"climate": ["climate change", "global warming", "net zero"],
"air_quality": ["air pollution", "AQI", "particulate matter"]
},
"lok_sabha_ministries": ["ENVIRONMENT", "POWER", "PETROLEUM"],
"rajya_sabha_ministry_likes": ["ENVIRONMENT", "POWER", "PETROLEUM"]
}
Step 2 — Probe parliamentary questions
commoner-probe sansad \
--topic topic.json \
--out data/climate \
--house both \
--from-date 2019-01-01
Writes data/climate/manifest.jsonl — one record per question from both houses.
Step 3 — Probe committee reports
commoner-probe committees \
--topic topic.json \
--out data/climate-committees \
--house both
One record per standing committee report (LS and RS DRSCs).
Step 4 — Extract text from PDFs
commoner-probe extract-answers --out data/climate
commoner-probe extract-answers --out data/climate-committees
Parses downloaded PDFs into answers.jsonl: Q/A pairs, committee
recommendations, and government responses.
Step 5 — Load in Python
import commoner_probe as probe
c = probe.Corpus("data/climate")
for r in c.manifest_qa():
print(r.date, r.house, r.ministry, r.title)
for pair in c.join_qa():
if pair.answers:
print(pair.manifest.title)
print(pair.answers[0].question_text[:200])
What you can study
Parliamentary questions (Lok Sabha + Rajya Sabha)
Each record carries who asked (MP name, party, state), which ministry answered,
question number, type (starred / unstarred), date, session, and the full PDF.
After extract-answers — extracted question and answer text.
Typical research questions: ministry responsiveness rates, which MPs ask the most questions by topic, how the same policy question evolves across sessions, party-level questioning patterns.
import commoner_probe as probe
from collections import Counter
c = probe.Corpus("data/climate")
ministry_counts = Counter(r.ministry for r in c.manifest_qa())
for ministry, n in ministry_counts.most_common(10):
print(f"{ministry}: {n}")
Standing committee reports (LS + RS DRSCs)
Committee reports come in four shapes:
report_type |
What it is |
|---|---|
demands_for_grants |
Annual budget scrutiny — the committee dissects ministry spending |
bill |
The committee's examination of a pending bill before it passes |
subject |
Own-initiative policy investigation — deepest substantive record |
action_taken |
The government's formal response to the committee's recommendations |
Action Taken Reports (ATRs) are the government's formal written responses to
committee recommendations. The atr-linkage command connects each ATR back
to the original report, enabling lifecycle analysis:
recommendation → government rejection/acceptance → follow-up.
import commoner_probe as probe
c = probe.Corpus("data/climate-committees")
for chain in c.join_atr_chain():
print(f"Report: {chain.original and chain.original.title}")
print(f" Recommendations: {len(chain.original_observations)}")
print(f" Government responses: {len(chain.atr_answers)}")
State assembly records (NeVA portals)
From 2020, sub-national governments have been adopting NIC's NeVA (National
e-Vidhan Application) infrastructure under a centrally sponsored scheme run
by the Ministry of Parliamentary Affairs. Most state assemblies are onboarding,
though coverage varies. The state-assembly command probes any NeVA portal:
commoner-probe state-assembly \
--portal gujarat \
--state GJ \
--out data/gujarat-assembly \
--assemblies 15
MCA CSR company-spend exports
The Ministry of Corporate Affairs CDM CSR data page exposes downloadable CSV exports by financial year. These records compare reporting/spending companies and project-sector amounts. They do not identify CSR consultants or implementing agencies unless MCA publishes that in the source export.
commoner-probe mca-csr \
--out data/mca-csr \
--years 2022-23,2021-22
import commoner_probe as probe
c = probe.Corpus("data/mca-csr")
for r in c.manifest_mca_csr():
print(r.financial_year, r.status, r.filename)
Mines DMFT / PMKKKY disclosures
mines-dmft acquires raw Ministry of Mines and Odisha DMFT public disclosure
files. Ministry CSVs are current cumulative snapshots timestamped by the
source; treat them as snapshots, not fiscal-year series.
commoner-probe mines-dmft \
--out data/mines-dmft \
--sources mines-gov-in,odisha
Pair the executive disclosure snapshots with Sansad oversight records without flattening the source families:
commoner-probe evidence dmft \
--mines-dmft-dir data/mines-dmft \
--sansad-dir data/sansad/mines-dmft-pmkkky \
--out data/evidence/dmft.json
All commands
commoner-probe sansad — parliamentary questions
commoner-probe sansad \
--topic topic.json \
--out data/climate \
--house both \
--from-date 2019-01-01 \
--to-date 2026-01-01
| Flag | Default | What it does |
|---|---|---|
--topic |
required | Path to topic profile JSON |
--out |
required | Output corpus directory |
--house |
both |
ls, rs, or both |
--from-date |
— | Earliest question date (YYYY-MM-DD) |
--to-date |
— | Latest question date |
--qtype |
both |
starred, unstarred, or both |
--sessions |
1-267 |
Rajya Sabha session range |
--no-download |
off | Skip PDF downloads; metadata only |
--with-entities |
off | Resolve asker names to stable entity IDs |
--max-records N |
— | Stop after N new records per house (smoke-test) |
--max-buckets N |
— | Only run the first N search/ministry combos |
--reset |
off | Wipe existing manifest and start fresh |
commoner-probe committees — standing committee reports
commoner-probe committees \
--topic topic.json \
--out data/committees \
--house both \
--committees finance,education
| Flag | Default | What it does |
|---|---|---|
--committees |
all | Comma-separated committee slugs |
--lok-sabha-no |
18 |
LS number for LS reports |
--from-date / --to-date |
— | Date range filter |
--no-download |
off | Skip PDF downloads |
Available LS committees (16 DRSCs):
agriculture, chemicals, coal, communications, consumer_affairs,
defence, energy, external_affairs, finance, housing, labour,
petroleum, railways, rural_development, social_justice, water_resources
Available RS committees (8 DRSCs):
commerce, education, health, home_affairs, industry, personnel,
science, transport
commoner-probe extract-answers — PDF text extraction
commoner-probe extract-answers --out data/climate
commoner-probe extract-answers --out data/climate --refresh
Reads manifest.jsonl and downloaded PDFs; writes answers.jsonl with:
qa_response— (question_text, answer_text) pairs from Q/A PDFsatr_response— (recommendation_no, recommendation_text, response_text) triples from ATR PDFsdfg_recommendation— numbered observation paragraphs from DFG/Bill/Subject PDFs
Requires pip install "commoner-probe[pdf]".
commoner-probe atr-linkage — ATR → original report
commoner-probe atr-linkage --out data/committees
Writes atr_linkage.jsonl — each ATR linked back to the report it responds to.
Safe to re-run (idempotent overwrite).
commoner-probe state-assembly — state legislature records
commoner-probe state-assembly \
--portal gujarat \
--state GJ \
--out data/gujarat \
--assemblies 15
commoner-probe mca-csr — MCA CSR company-spend exports
commoner-probe mca-csr \
--out data/mca-csr \
--years 2022-23
Downloads CSV exports from the MCA CDM CSR data page and writes one
manifest.jsonl record per financial year. Use --dry-run to print manifest
records without opening a network session.
commoner-probe mines-dmft — Ministry of Mines / DMFT files
commoner-probe mines-dmft \
--out data/mines-dmft \
--sources mines-gov-in,odisha
Downloads raw Ministry of Mines static CSV snapshots and Odisha DMFT public
JSON/report surfaces. Use --dry-run to print manifest records without opening
network sessions.
commoner-probe evidence dmft — cross-source evidence bundle
commoner-probe evidence dmft \
--mines-dmft-dir data/mines-dmft \
--sansad-dir data/sansad/mines-dmft-pmkkky \
--out data/evidence/dmft.json
Builds a JSON bundle with separate executive_disclosure and
parliamentary_oversight sections. It does not merge unlike source families
into one table.
commoner-probe stats — corpus health
commoner-probe stats --out data/climate
commoner-probe stats --out data/climate --json
commoner-probe validate — schema validation
commoner-probe validate --out data/climate
Validates every JSONL file against its JSON Schema. Exits 1 on errors.
Requires [dev] extra.
Topic profile
Controls what the probe acquires:
{
"name": "libraries",
"description": "Public library infrastructure and policy",
"search_groups": {
"public_libraries": ["public library", "rural library"],
"policy": ["National Mission on Libraries", "RRRLF"]
},
"lok_sabha_ministries": ["CULTURE", "EDUCATION"],
"rajya_sabha_ministry_likes": ["CULTURE", "EDUCATION"]
}
search_groups— keyword groups for LS full-text search. Each query runs independently; results are union-deduped onkey.lok_sabha_ministries— exact ministry filter for LS (case-sensitive).rajya_sabha_ministry_likes— ministry LIKE filter for RS (prefix match).
See examples/topics/ for working examples.
Output files
| File | Contents |
|---|---|
manifest.jsonl |
One record per question or committee report |
_runs.jsonl |
Audit log: scope, topic hash, errors, per-bucket counts |
answers.jsonl |
Extracted Q/A and recommendation/response pairs |
atr_linkage.jsonl |
ATR → original report linkages |
| source CSV/JSON/HTML files | Raw source files for source-specific probes such as MCA CSR and DMFT |
pdfs/ls/ |
Downloaded LS PDFs |
pdfs/rs/ |
Downloaded RS PDFs |
probe.log |
Human-readable probe progress log |
For complete field-level documentation see docs/SCHEMAS.md.
Entity resolution (--with-entities)
Pass --with-entities to commoner-probe sansad to resolve asker names to
stable entity_id values. On first run the entity store is populated from
the sansad.in MP roster; subsequent runs reuse the local cache.
Resolved entity IDs join across corpora and sessions — useful for studying the same MP's questioning behaviour over time or across houses.
Python API
import commoner_probe as probe
c = probe.Corpus("data/climate")
# Typed iterators
for r in c.manifest_qa(): # ManifestQaRecord
...
for r in c.manifest_committee_reports(): # ManifestCommitteeReportRecord
...
for r in c.answers_qa(): # AnswerQaResponse
...
for r in c.answers_atr(): # AnswerAtrResponse
...
for r in c.answers_dfg(): # AnswerDfgRecommendation
...
for r in c.atr_linkages(): # AtrLinkageRecord
...
for r in c.manifest_mca_csr(): # ManifestMcaCsrRecord
...
for r in c.manifest_mines_dmft(): # ManifestMinesDmftRecord
...
for r in c.runs(): # RunRecord
...
# Join helpers
for pair in c.join_qa(): # manifest + extracted answers
...
for chain in c.join_atr_chain(): # ATR + original report + observations
...
# pandas (pip install commoner-probe[pandas])
df = c.to_dataframe("manifest_committee_reports")
See examples/usage.py for a runnable walkthrough.
See docs/ENDPOINTS.md for source-family endpoint notes.
License
MIT License — see LICENSE.
commoner-probe is sousveillance infrastructure, built for the commons. It is
released under the permissive MIT license so it can serve as a shared
acquisition floor that any downstream project — including the other repos in the
CommonerLLP federation, whatever their own licenses — can build on without
copyleft friction.
Upcoming
Floor debates
sansad.in exposes full debate proceedings via api_ls/debate/text-of-debate
(structured JSON, 17th Lok Sabha onwards). Each record covers a single day:
type of business, member who spoke, and verbatim text. The richest longitudinal
record of what MPs say on the floor.
Bills and legislation
sansad.in/ls/legislation/bills lists every bill since independence with
introduction date, debate dates, and status at each stage. Enables tracking
legislative velocity, committee scrutiny rates, and private member bill outcomes.
MP profiles and career timelines
Structured biographical data for each member: constituency, state, party, terms served, educational background, declared profession. Pairs with the Q/A corpus for studies of how MP background predicts parliamentary participation.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file commoner_probe-0.5.0.tar.gz.
File metadata
- Download URL: commoner_probe-0.5.0.tar.gz
- Upload date:
- Size: 206.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
119fd0cce7ca01493663dfd25ee924e0f26277a4074bea83895e984438e89e12
|
|
| MD5 |
34455dcfdc25e7260b9cf15635a63973
|
|
| BLAKE2b-256 |
37e9ba96e001e19f8bf4e14cc441fd189694e1b3043c01654b965cf85fdf677d
|
Provenance
The following attestation bundles were made for commoner_probe-0.5.0.tar.gz:
Publisher:
release.yml on CommonerLLP/commoner-probe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
commoner_probe-0.5.0.tar.gz -
Subject digest:
119fd0cce7ca01493663dfd25ee924e0f26277a4074bea83895e984438e89e12 - Sigstore transparency entry: 1952309332
- Sigstore integration time:
-
Permalink:
CommonerLLP/commoner-probe@6d2f9725aa8ab4c2aa8b465aef006a4ef8bd4898 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/CommonerLLP
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6d2f9725aa8ab4c2aa8b465aef006a4ef8bd4898 -
Trigger Event:
push
-
Statement type:
File details
Details for the file commoner_probe-0.5.0-py3-none-any.whl.
File metadata
- Download URL: commoner_probe-0.5.0-py3-none-any.whl
- Upload date:
- Size: 168.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3367677a5ca1dc21ba9accbffcd07c2bd58ee838ea11760cfef6f0128cf0c6d
|
|
| MD5 |
520b9075c0ef1ca7693e48785c58088b
|
|
| BLAKE2b-256 |
bf2bd51528adb86c19c7b6c9b76329ea21e7f0ca417f84d7d6ece6bc2adac0f1
|
Provenance
The following attestation bundles were made for commoner_probe-0.5.0-py3-none-any.whl:
Publisher:
release.yml on CommonerLLP/commoner-probe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
commoner_probe-0.5.0-py3-none-any.whl -
Subject digest:
a3367677a5ca1dc21ba9accbffcd07c2bd58ee838ea11760cfef6f0128cf0c6d - Sigstore transparency entry: 1952309426
- Sigstore integration time:
-
Permalink:
CommonerLLP/commoner-probe@6d2f9725aa8ab4c2aa8b465aef006a4ef8bd4898 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/CommonerLLP
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6d2f9725aa8ab4c2aa8b465aef006a4ef8bd4898 -
Trigger Event:
push
-
Statement type: