Japanese IDWR infectious disease database and analytics toolkit built on Polars.
Project description
jp-idwr-db
jp-idwr-db publishes Japan’s infectious disease surveillance data (NIID/JIHS IDWR) as a
versioned, language-agnostic data product: Parquet tables plus a machine-readable
manifest.json (and an optional DuckDB file with views).
The Python package adds a convenient API and local caching on top of those release assets. Internally, data wrangling is Polars-first for speed and consistent transforms.
The goal is to skip the usual work of chasing week-by-week files across changing archives and formats, so you can get straight to building time series and doing epidemiology instead of spending hours on data munging.
The package provides an easier interface to the data, but you can also query the Parquet files directly with any tool that supports them (DuckDB, Arrow, Spark, etc.) using the manifest.json for file locations and schema. Direct-access examples are included below.
Python Install
pip install jp-idwr-db
Quick Start
To fetch the full unified dataset with a single call:
import jp_idwr_db as jp
import polars as pl
df = (
jp.load("unified", version="latest")
.select(["date", "prefecture", "category", "disease", "count", "source"])
)
print(df)
shape: (5_370_477, 6)
┌────────────┬────────────┬──────────┬─────────────────────────────┬───────┬────────────────────┐
│ date ┆ prefecture ┆ category ┆ disease ┆ count ┆ source │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ str ┆ str ┆ str ┆ f64 ┆ str │
╞════════════╪════════════╪══════════╪═════════════════════════════╪═══════╪════════════════════╡
│ 1999-04-11 ┆ Aichi ┆ total ┆ AIDS ┆ 0.0 ┆ Confirmed cases │
│ 1999-04-11 ┆ Aichi ┆ total ┆ Acute poliomyelitis ┆ 0.0 ┆ Confirmed cases │
│ 1999-04-11 ┆ Aichi ┆ total ┆ Acute viral hepatitis ┆ 4.0 ┆ Confirmed cases │
│ 1999-04-11 ┆ Aichi ┆ total ┆ Amebiasis ┆ 0.0 ┆ Confirmed cases │
│ 1999-04-11 ┆ Aichi ┆ total ┆ Anthrax ┆ 0.0 ┆ Confirmed cases │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 2026-02-09 ┆ Yamanashi ┆ total ┆ Viral hepatitis(excluding ┆ 0.0 ┆ All-case reporting │
│ ┆ ┆ ┆ hepa… ┆ ┆ │
│ 2026-02-09 ┆ Yamanashi ┆ total ┆ West Nile fever ┆ 0.0 ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi ┆ total ┆ Western equine encephalitis ┆ 0.0 ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi ┆ total ┆ Yellow fever ┆ 0.0 ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi ┆ total ┆ Zika virus infection ┆ 0.0 ┆ All-case reporting │
└────────────┴────────────┴──────────┴─────────────────────────────┴───────┴────────────────────┘
You can also filter at the source with jp.get_data(...):
# Fetch only tuberculosis data for 2024 in Tokyo, Osaka, and Hokkaido
tb = (
jp.get_data(
disease="Tuberculosis",
year=2024,
prefecture=["Tokyo", "Osaka", "Hokkaido"],
version="latest")
.select(["date", "prefecture", "disease", "count", "source"])
)
print(tb)
shape: (156, 5)
┌────────────┬────────────┬──────────────┬───────┬────────────────────┐
│ date ┆ prefecture ┆ disease ┆ count ┆ source │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ str ┆ str ┆ f64 ┆ str │
╞════════════╪════════════╪══════════════╪═══════╪════════════════════╡
│ 2024-01-01 ┆ Hokkaido ┆ Tuberculosis ┆ 2.0 ┆ All-case reporting │
│ 2024-01-01 ┆ Osaka ┆ Tuberculosis ┆ 3.0 ┆ All-case reporting │
│ 2024-01-01 ┆ Tokyo ┆ Tuberculosis ┆ 15.0 ┆ All-case reporting │
│ 2024-01-08 ┆ Hokkaido ┆ Tuberculosis ┆ 4.0 ┆ All-case reporting │
│ 2024-01-08 ┆ Osaka ┆ Tuberculosis ┆ 17.0 ┆ All-case reporting │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 2024-12-16 ┆ Osaka ┆ Tuberculosis ┆ 17.0 ┆ All-case reporting │
│ 2024-12-16 ┆ Tokyo ┆ Tuberculosis ┆ 41.0 ┆ All-case reporting │
│ 2024-12-23 ┆ Hokkaido ┆ Tuberculosis ┆ 5.0 ┆ All-case reporting │
│ 2024-12-23 ┆ Osaka ┆ Tuberculosis ┆ 16.0 ┆ All-case reporting │
│ 2024-12-23 ┆ Tokyo ┆ Tuberculosis ┆ 53.0 ┆ All-case reporting │
└────────────┴────────────┴──────────────┴───────┴────────────────────┘
# Sentinel-only diseases from recent years in Tokyo prefecture
sentinel_df = (
jp.get_data(
source="sentinel",
prefecture="Tokyo",
year=(2024, 2026),
version="latest")
.select(["date", "prefecture", "disease", "count", "per_sentinel"])
)
print(sentinel_df)
shape: (2_052, 5)
┌────────────┬────────────┬─────────────────────────────────┬─────────┬──────────────┐
│ date ┆ prefecture ┆ disease ┆ count ┆ per_sentinel │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ str ┆ str ┆ f64 ┆ f64 │
╞════════════╪════════════╪═════════════════════════════════╪═════════╪══════════════╡
│ 2024-01-07 ┆ Tokyo ┆ Acute hemorrhagic conjunctivit… ┆ null ┆ null │
│ 2024-01-07 ┆ Tokyo ┆ Aseptic meningitis ┆ null ┆ null │
│ 2024-01-07 ┆ Tokyo ┆ Bacterial meningitis ┆ null ┆ null │
│ 2024-01-07 ┆ Tokyo ┆ COVID-19 ┆ 1365.0 ┆ 3.38 │
│ 2024-01-07 ┆ Tokyo ┆ Chickenpox ┆ 31.0 ┆ 0.12 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 2026-01-25 ┆ Tokyo ┆ Influenza(excld. avian influen… ┆ 13082.0 ┆ 34.07 │
│ 2026-01-25 ┆ Tokyo ┆ Mumps ┆ 30.0 ┆ 0.12 │
│ 2026-01-25 ┆ Tokyo ┆ Mycoplasma pneumonia ┆ 32.0 ┆ 1.28 │
│ 2026-01-25 ┆ Tokyo ┆ Pharyngoconjunctival fever ┆ 115.0 ┆ 0.47 │
│ 2026-01-25 ┆ Tokyo ┆ Respiratory syncytial virus in… ┆ 242.0 ┆ 1.0 │
└────────────┴────────────┴─────────────────────────────────┴─────────┴──────────────┘
Data Download Model
- Package wheels do not ship the large parquet tables.
- On first call to
jp.load(..., version="latest")(orjp.get_data(..., version="latest")), the package downloads parquet assets listed in the latest published releasemanifest.json. - By default, the package uses the packaged data version that matches the installed wheel. Use
version="latest"when you want the freshest published snapshot. - Cache path defaults to:
- macOS:
~/Library/Caches/jp_idwr_db/data/<version>/ - Linux:
~/.cache/jp_idwr_db/data/<version>/ - Windows:
%LOCALAPPDATA%\\jp_idwr_db\\Cache\\data\\<version>\\
- macOS:
Prefetch explicitly:
python -m jp_idwr_db data download
python -m jp_idwr_db data download --version latest --force
Environment overrides:
JPINFECT_DATA_VERSION: choose a specific release tag orlatest(example:latest)JPINFECT_DATA_BASE_URL: override asset host base URLJPINFECT_CACHE_DIR: override local cache root
Language-independent data access
Release data assets are published as:
manifest.json- one or more
.parquettables (includingunified.parquet) - optional
jp_idwr_db.duckdb(views over the parquet files)
Manifest schema reference: docs/manifest.schema.json.
Fetch the manifest:
curl -L "https://github.com/AlFontal/jp-idwr-db/releases/latest/download/manifest.json"
Query with DuckDB CLI (when jp_idwr_db.duckdb and parquet files are in the same directory):
duckdb jp_idwr_db.duckdb -c "SELECT year, week, COUNT(*) AS rows FROM unified GROUP BY 1,2 ORDER BY 1 DESC, 2 DESC LIMIT 5;"
Download assets for any language
BASE="https://github.com/AlFontal/jp-idwr-db/releases/latest/download"
mkdir -p jp-idwr-assets
cd jp-idwr-assets
curl -L -O "${BASE}/manifest.json"
curl -L -O "${BASE}/unified.parquet"
curl -L -O "${BASE}/jp_idwr_db.duckdb"
R example (DuckDB, local)
This example opens the local jp_idwr_db.duckdb artifact (downloaded with the parquet files)
and queries the unified view. Run it from the directory where jp_idwr_db.duckdb
and the parquet files are located:
con <- DBI::dbConnect(duckdb::duckdb(), "jp_idwr_db.duckdb", read_only = TRUE)
tb <- DBI::dbGetQuery(
con,
"SELECT date, prefecture, disease, count, source
FROM unified
WHERE year = 2024 AND disease = 'Tuberculosis'
ORDER BY date, prefecture
LIMIT 20"
)
print(tb)
DBI::dbDisconnect(con, shutdown = TRUE)
date prefecture disease count source
1 2024-01-01 Aichi Tuberculosis 5 All-case reporting
2 2024-01-01 Akita Tuberculosis 1 All-case reporting
3 2024-01-01 Aomori Tuberculosis 0 All-case reporting
4 2024-01-01 Chiba Tuberculosis 7 All-case reporting
5 2024-01-01 Ehime Tuberculosis 1 All-case reporting
6 2024-01-01 Fukui Tuberculosis 1 All-case reporting
...
R example (Arrow, remote)
You can also query the parquet files directly from the GitHub Release URL without downloading first:
library(magrittr)
url <- "https://github.com/AlFontal/jp-idwr-db/releases/latest/download/unified.parquet"
tb <- arrow::read_parquet(url) %>%
dplyr::filter(year == 2024, disease == "Tuberculosis") %>%
dplyr::select(date, prefecture, disease, count, source) %>%
dplyr::arrange(date, prefecture)
print(as.data.frame(tb))
date prefecture disease count source
1 2024-01-01 Aichi Tuberculosis 5 All-case reporting
2 2024-01-01 Akita Tuberculosis 1 All-case reporting
3 2024-01-01 Aomori Tuberculosis 0 All-case reporting
4 2024-01-01 Chiba Tuberculosis 7 All-case reporting
5 2024-01-01 Ehime Tuberculosis 1 All-case reporting
6 2024-01-01 Fukui Tuberculosis 1 All-case reporting
...
Main API
Top-level API exported by jp_idwr_db:
load(name)get_data(...)list_diseases(source="all")list_prefectures()get_latest_week()prefecture_map()attach_prefecture_id(df, prefecture_col="prefecture", id_col="prefecture_id")merge(...),pivot(...)configure(...),get_config()
Datasets
Use jp.load(...) with:
"sex": historical sex-disaggregated surveillance"place": historical place-category surveillance"bullet": modern all-case weekly reports (rapid zensu)"sentinel": sentinel reports (teitenrui; 2012+ in release data assets)"unified": deduplicated combined dataset (sex-total + modern bullet/sentinel, recommended)
Note: teitenrui CSVs report year-to-date cumulative counts. jp-idwr-db converts these to
weekly incidence (count_t - count_{t-1} within year/prefecture/disease; first week kept as-is).
Detailed schema and coverage are documented in DATASETS.md.
Raw Download and Parsing
Raw file workflows are available in jp_idwr_db.io:
jp_idwr_db.io.download(...)jp_idwr_db.io.download_recent(...)jp_idwr_db.io.read(...)
These are useful for refreshing local raw weekly files or debugging parser behavior.
Data Wrangling Examples
See EXAMPLES.md for data wrangling recipes (grouping, trends, regional slices, source-aware filtering).
Disease-by-disease temporal coverage is documented in DISEASES.md.
Data Source
NIID/JIHS infectious disease surveillance publications:
- Historical annual archive files (
Syu_01_1,Syu_02_1) - Rapid weekly CSV reports (
zensuXX.csv,teitenruiXX.csv)
Development
uv sync --all-extras --dev
uv run ruff check .
uv run mypy src
uv run pytest
# Build release data assets (manifest + duckdb + parquet metadata)
uv run --with duckdb --with jsonschema jp-idwr-db-build-assets \
--data-dir data/parquet \
--release-tag vYYYY.M.D \
--base-url https://github.com/AlFontal/jp-idwr-db/releases/download/vYYYY.M.D \
--schema-path docs/manifest.schema.json
Security and Integrity
- Release assets include a
manifest.jsonwith SHA256 checksums and file sizes. ensure_data()verifies each downloaded parquet checksum and size before marking cache complete.- For PyPI publishing, prefer Trusted Publishing (OIDC) over long-lived API tokens.
License
GPL-3.0-or-later. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jp_idwr_db-2026.5.13.tar.gz.
File metadata
- Download URL: jp_idwr_db-2026.5.13.tar.gz
- Upload date:
- Size: 53.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd04437fdf9aaf2da0178d323bfccb0a4074d5ab37a715a95831d68f3f2bf329
|
|
| MD5 |
5b7b9b824d5e11ca4054e6be1f83be97
|
|
| BLAKE2b-256 |
85303bea3a483dc2f5be5ba7f33041f6ba609f0c4293a3ef65c51dc4ae3573d9
|
File details
Details for the file jp_idwr_db-2026.5.13-py3-none-any.whl.
File metadata
- Download URL: jp_idwr_db-2026.5.13-py3-none-any.whl
- Upload date:
- Size: 60.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a849040b767e459e68807c9ac503aa319245aa3ac183163a541dc2e49bdf0166
|
|
| MD5 |
2036f9afc894e8f7c440573c5c8f34a1
|
|
| BLAKE2b-256 |
601cb6441659b3c3da9ea23bd0f9be10eb6e69268475a71d3bb5e1b15b38183a
|