Skip to main content

Speech-level dataset from Korean National Assembly committee proceedings (16th-22nd Assembly)

Project description

kr-hearings-data

Speech-level dataset from Korean National Assembly committee proceedings (16th-22nd Assembly, 2000-2025).

Data

  • 9.9M speeches classified into 33 speaker roles
  • 7.4M legislator-witness dyads (consecutive Q&A pairs)
  • 20+ committees harmonized across 94+ raw names and 7 legislative terms
  • 6 hearing types: standing committee (상임위원회), national audit (국정감사), confirmation hearing (인사청문특별위원회), parliamentary investigation (국정조사), budget committee (예산결산특별위원회), plenary session (국회본회의)
  • 16,830 meetings covering all major National Assembly proceedings (16th-22nd Assembly)

Quick start

pip install kr-hearings-data
import kr_hearings_data as kh

# Load speeches
speeches = kh.load_speeches()

# Load dyads
dyads = kh.load_dyads()

# Filter by term and hearing type
audit_20 = kh.load_dyads(term=20, hearing_type="국정감사")

CLI

# Download data
kr-hearings download

# Summary statistics
kr-hearings info

# Export filtered subset
kr-hearings export --term 20 --hearing-type 국정감사 --format csv -o output.csv

Files

Data files are available under GitHub Releases.

File Rows Columns Description
all_speeches_16_22_v9.parquet 9,906,444 28 All speeches + minister panel metadata
dyads_16_22_v9.parquet 7,429,413 25 Dyads with legislator + minister metadata
all_speeches_16_22_v8.parquet 9,906,444 24 Previous version (no minister metadata)

Columns

speeches

Column Type Description
meeting_id str Meeting identifier
term int Assembly term (16-22)
committee str Original committee name
committee_key str Harmonized committee key (20 categories)
hearing_type str 6 types: 상임위원회, 국정감사, 인사청문특별위원회, 예산결산특별위원회, 국회본회의, 국정조사
session str Session number (e.g., 제212회)
sub_session str Sub-session number (e.g., 제1차)
date str Meeting date (YYYY-MM-DD)
agenda str Agenda item
speaker str Raw speaker field (title + name)
member_id str Legislator ID from source data (null for non-legislators)
member_uid str Disambiguated legislator ID (resolves 4 homonymous member_ids)
speech_order str Speech sequence number within meeting
role str Classified speaker role (33 categories)
person_name str Extracted person name
person_title str Acting/deputy title if applicable (e.g., 대리, 직무대행)
affiliation_raw str Raw affiliation or institutional title
speech_text str Full speech text
name_clean str Legislator name (from National Assembly DB, legislators only)
party str Party affiliation (legislators only)
ruling_status str Ruling/opposition status (legislators only)
seniority float Number of terms served (legislators only)
gender str Gender (legislators only)
naas_cd str National Assembly unique code (legislators only)
ministry_normalized str Standardized ministry/agency name (v9, govt officials only)
dual_office bool Minister simultaneously held NA seat (v9, ministers only)
admin str Presidential administration name (v9, ministers only)
admin_ideology str Progressive or Conservative (v9, ministers only)

dyads

Column Type Description
meeting_id str Meeting identifier
term int Assembly term
committee str Original committee name
committee_key str Harmonized committee key
hearing_type str 6 types: 상임위원회, 국정감사, 인사청문특별위원회, 예산결산특별위원회, 국회본회의, 국정조사
date str Meeting date (YYYY-MM-DD)
agenda str Agenda item
leg_name str Legislator name
leg_speaker_raw str Legislator raw speaker field
leg_member_uid str Legislator disambiguated ID
leg_party str Legislator party (v9, 99.9% coverage)
leg_ruling_status str Ruling/opposition/independent (v9, 97.1%)
leg_seniority float Terms served (v9)
leg_gender str Gender (v9)
witness_name str Non-legislator name
witness_speaker_raw str Non-legislator raw speaker field
witness_role str Non-legislator classified role
witness_affiliation str Non-legislator raw affiliation
witness_ministry_normalized str Standardized ministry name (v9)
witness_dual_office bool Minister held NA seat simultaneously (v9)
witness_admin str Presidential administration (v9)
witness_admin_ideology str Progressive or Conservative (v9)
direction str question (legislator first) or answer (witness first)
leg_speech str Legislator speech text
witness_speech str Non-legislator speech text

Speaker roles

33 categories organized in 3 tiers:

Legislator (form one side of dyads): legislator, chair

Non-legislator (form the other side):

  • Executive: minister, vice_minister, prime_minister, agency_head, senior_bureaucrat, mid_bureaucrat, minister_acting
  • Hearing witnesses: witness, testifier, expert_witness, nominee, minister_nominee
  • Organizational: public_corp_head, org_head, financial_regulator, research_head, broadcasting, cooperative_head
  • Other: local_gov_head, military, police, audit_official, election_official, constitutional_court, assembly_official, independent_official, private_sector, cultural_institution_head, other_official

Excluded from dyads: committee_staff, other, unknown

Documentation

  • docs/CODEBOOK.md - Full codebook with column definitions, role taxonomy, committee mapping, and value distributions
  • docs/PIPELINE.md - Data pipeline documentation (XLSX parsing through v5 integrity fixes)

Validation

52 automated checks across speeches, dyads, speaker classification, committee harmonization, and cross-dataset consistency. See validation/ for the test suite and report_v8.json for the latest results.

Known limitations

  • Short speeches (15.9% under 10 chars): Procedural utterances like "예", "동의합니다". Valid speech acts in parliamentary proceedings.
  • Self-pairing dyads (637): Same person name on both sides, confirmed as different people (homonyms). e.g., legislator 김영환 and minister 김영환.
  • Empty witness names (919 dyads): Cases where the speaker field contains only a title without a personal name (e.g., "여성가족부 장관", "산업통상자원부 제1차관").
  • Remaining other role (6,472 speeches, 0.07%): Speakers whose titles do not match any classification pattern. These are excluded from dyad formation.
  • member_id on non-legislators (29,182 speeches): Former legislators appearing as ministers or other officials retain their member_id from legislative service.
  • Homonymous member_ids: 4 member_ids (7407, 6182, 806, 878) each represent two different legislators with the same name across different Assembly terms. Use member_uid for disambiguation.

Version history

Version Speeches Dyads Changes
v9 9,906,444 7,429,413 Minister panel enrichment (dual_office, admin, admin_ideology). Legislator metadata in dyads (party, ruling_status). ruling_status cleanup. Full dyad rebuild across 6 hearing types
v8 9,906,444 7,894,147 +국정조사 191건, 예산결산특별위원회 832건, 국회본회의 1,058건 (1.17M speeches). Hybrid XML viewer + PDF parsing. Dyads rebuilt for all 6 hearing types
v7 8,740,779 - +228 인사청문특별위원회 meetings from PDF parsing (111K speeches). Hanja name conversion, mp_metadata enrichment (99.9% legislator party coverage)
v6 8,629,431 7,225,737 +42 인사청문특별위원회 meetings from HTML scraping (32K speeches). New hearing_type value: 인사청문특별위원회
v5 8,597,178 7,225,737 member_id null fix, person_title cleanup, member_uid disambiguation, minister 직무대리 reclassification, additional 'other' reclassification, non-legislator person_name cleanup
v4 8,597,178 7,221,024 person_title extraction, person_name cleanup, 'other' reclassification, text normalization, date normalization
v3 8,597,178 7,185,949 Speaker classification fix (소위원장), deduplication, dyad rebuild

Source

Raw data: National Assembly proceeding XLSX datasets (의안정보시스템), PDF transcripts, and structured HTML from 국회회의록시스템 (record.assembly.go.kr, likms.assembly.go.kr).

Author

Kyusik Yang, New York University

License

CC BY 4.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kr_hearings_data-0.1.0.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kr_hearings_data-0.1.0-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file kr_hearings_data-0.1.0.tar.gz.

File metadata

  • Download URL: kr_hearings_data-0.1.0.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for kr_hearings_data-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ad88f6875fbc7b1a9034a6ed31a03c28a0b2fc27ddd4ecb81e0db1f53e8536f8
MD5 f074573479810dfec5803878559b683c
BLAKE2b-256 9dce42ec9c021b89c7f5672201f2fb7875c4381b9d82827cf6e04f6a90569dd8

See more details on using hashes here.

File details

Details for the file kr_hearings_data-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for kr_hearings_data-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4facbe6e15361be0462d75c9c7a3f4a3a7e179b5fb91895f01ab80532cef19f8
MD5 7dd11a7ea06c914f73e31c2ce90d9414
BLAKE2b-256 a6fcd79331687bf0b9f331d401cca530694b9aaaa957c1dc09a548c643e8d220

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page