Skip to main content

Japanese IDWR infectious disease database and analytics toolkit built on Polars.

Project description

jp-idwr-db

PyPI version Python versions CI License: GPL-3.0-or-later

jp-idwr-db publishes Japan’s infectious disease surveillance data (NIID/JIHS IDWR) as a versioned, language-agnostic data product: Parquet tables plus a machine-readable manifest.json (and an optional DuckDB file with views).

The Python package adds a convenient API and local caching on top of those release assets. Internally, data wrangling is Polars-first for speed and consistent transforms.

The goal is to skip the usual work of chasing week-by-week files across changing archives and formats, so you can get straight to building time series and doing epidemiology instead of spending hours on data munging.

The package provides an easier interface to the data, but you can also query the Parquet files directly with any tool that supports them (DuckDB, Arrow, Spark, etc.) using the manifest.json for file locations and schema. Direct-access examples are included below.

Python Install

pip install jp-idwr-db

Quick Start

To fetch the full unified dataset with a single call:

import jp_idwr_db as jp
import polars as pl

df = (
    jp.load("unified", version="latest")
    .select(["date", "prefecture", "category", "disease", "count", "source"])
)
print(df)
shape: (5_370_477, 6)
┌────────────┬────────────┬──────────┬─────────────────────────────┬───────┬────────────────────┐
│ date       ┆ prefecture ┆ category ┆ disease                     ┆ count ┆ source             │
│ ---        ┆ ---        ┆ ---      ┆ ---                         ┆ ---   ┆ ---                │
│ date       ┆ str        ┆ str      ┆ str                         ┆ f64   ┆ str                │
╞════════════╪════════════╪══════════╪═════════════════════════════╪═══════╪════════════════════╡
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ AIDS                        ┆ 0.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Acute poliomyelitis         ┆ 0.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Acute viral hepatitis       ┆ 4.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Amebiasis                   ┆ 0.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Anthrax                     ┆ 0.0   ┆ Confirmed cases    │
│ …          ┆ …          ┆ …        ┆ …                           ┆ …     ┆ …                  │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Viral hepatitis(excluding   ┆ 0.0   ┆ All-case reporting │
│            ┆            ┆          ┆ hepa…                       ┆       ┆                    │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ West Nile fever             ┆ 0.0   ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Western equine encephalitis ┆ 0.0   ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Yellow fever                ┆ 0.0   ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Zika virus infection        ┆ 0.0   ┆ All-case reporting │
└────────────┴────────────┴──────────┴─────────────────────────────┴───────┴────────────────────┘

You can also filter at the source with jp.get_data(...):

# Fetch only tuberculosis data for 2024 in Tokyo, Osaka, and Hokkaido
tb = (
    jp.get_data(
        disease="Tuberculosis",
        year=2024,
        prefecture=["Tokyo", "Osaka", "Hokkaido"],
        version="latest")
    .select(["date", "prefecture", "disease", "count", "source"])
)
print(tb)
shape: (156, 5)
┌────────────┬────────────┬──────────────┬───────┬────────────────────┐
│ date       ┆ prefecture ┆ disease      ┆ count ┆ source             │
│ ---        ┆ ---        ┆ ---          ┆ ---   ┆ ---                │
│ date       ┆ str        ┆ str          ┆ f64   ┆ str                │
╞════════════╪════════════╪══════════════╪═══════╪════════════════════╡
│ 2024-01-01 ┆ Hokkaido   ┆ Tuberculosis ┆ 2.0   ┆ All-case reporting │
│ 2024-01-01 ┆ Osaka      ┆ Tuberculosis ┆ 3.0   ┆ All-case reporting │
│ 2024-01-01 ┆ Tokyo      ┆ Tuberculosis ┆ 15.0  ┆ All-case reporting │
│ 2024-01-08 ┆ Hokkaido   ┆ Tuberculosis ┆ 4.0   ┆ All-case reporting │
│ 2024-01-08 ┆ Osaka      ┆ Tuberculosis ┆ 17.0  ┆ All-case reporting │
│ …          ┆ …          ┆ …            ┆ …     ┆ …                  │
│ 2024-12-16 ┆ Osaka      ┆ Tuberculosis ┆ 17.0  ┆ All-case reporting │
│ 2024-12-16 ┆ Tokyo      ┆ Tuberculosis ┆ 41.0  ┆ All-case reporting │
│ 2024-12-23 ┆ Hokkaido   ┆ Tuberculosis ┆ 5.0   ┆ All-case reporting │
│ 2024-12-23 ┆ Osaka      ┆ Tuberculosis ┆ 16.0  ┆ All-case reporting │
│ 2024-12-23 ┆ Tokyo      ┆ Tuberculosis ┆ 53.0  ┆ All-case reporting │
└────────────┴────────────┴──────────────┴───────┴────────────────────┘
# Sentinel-only diseases from recent years in Tokyo prefecture
sentinel_df = (
    jp.get_data(
        source="sentinel",
        prefecture="Tokyo",
        year=(2024, 2026),
        version="latest")
    .select(["date", "prefecture", "disease", "count", "per_sentinel"])
)
print(sentinel_df)
shape: (2_052, 5)
┌────────────┬────────────┬─────────────────────────────────┬─────────┬──────────────┐
│ date       ┆ prefecture ┆ disease                         ┆ count   ┆ per_sentinel │
│ ---        ┆ ---        ┆ ---                             ┆ ---     ┆ ---          │
│ date       ┆ str        ┆ str                             ┆ f64     ┆ f64          │
╞════════════╪════════════╪═════════════════════════════════╪═════════╪══════════════╡
│ 2024-01-07 ┆ Tokyo      ┆ Acute hemorrhagic conjunctivit… ┆ null    ┆ null         │
│ 2024-01-07 ┆ Tokyo      ┆ Aseptic meningitis              ┆ null    ┆ null         │
│ 2024-01-07 ┆ Tokyo      ┆ Bacterial meningitis            ┆ null    ┆ null         │
│ 2024-01-07 ┆ Tokyo      ┆ COVID-19                        ┆ 1365.0  ┆ 3.38         │
│ 2024-01-07 ┆ Tokyo      ┆ Chickenpox                      ┆ 31.0    ┆ 0.12         │
│ …          ┆ …          ┆ …                               ┆ …       ┆ …            │
│ 2026-01-25 ┆ Tokyo      ┆ Influenza(excld. avian influen… ┆ 13082.0 ┆ 34.07        │
│ 2026-01-25 ┆ Tokyo      ┆ Mumps                           ┆ 30.0    ┆ 0.12         │
│ 2026-01-25 ┆ Tokyo      ┆ Mycoplasma pneumonia            ┆ 32.0    ┆ 1.28         │
│ 2026-01-25 ┆ Tokyo      ┆ Pharyngoconjunctival fever      ┆ 115.0   ┆ 0.47         │
│ 2026-01-25 ┆ Tokyo      ┆ Respiratory syncytial virus in… ┆ 242.0   ┆ 1.0          │
└────────────┴────────────┴─────────────────────────────────┴─────────┴──────────────┘
Data Download Model
  • Package wheels do not ship the large parquet tables.
  • On first call to jp.load(..., version="latest") (or jp.get_data(..., version="latest")), the package downloads parquet assets listed in the latest published release manifest.json.
  • By default, the package uses the packaged data version that matches the installed wheel. Use version="latest" when you want the freshest published snapshot.
  • Cache path defaults to:
    • macOS: ~/Library/Caches/jp_idwr_db/data/<version>/
    • Linux: ~/.cache/jp_idwr_db/data/<version>/
    • Windows: %LOCALAPPDATA%\\jp_idwr_db\\Cache\\data\\<version>\\

Prefetch explicitly:

python -m jp_idwr_db data download
python -m jp_idwr_db data download --version latest --force

Environment overrides:

  • JPINFECT_DATA_VERSION: choose a specific release tag or latest (example: latest)
  • JPINFECT_DATA_BASE_URL: override asset host base URL
  • JPINFECT_CACHE_DIR: override local cache root

Language-independent data access

Release data assets are published as:

  • manifest.json
  • one or more .parquet tables (including unified.parquet)
  • optional jp_idwr_db.duckdb (views over the parquet files)

Manifest schema reference: docs/manifest.schema.json.

Fetch the manifest:

curl -L "https://github.com/AlFontal/jp-idwr-db/releases/latest/download/manifest.json"

Query with DuckDB CLI (when jp_idwr_db.duckdb and parquet files are in the same directory):

duckdb jp_idwr_db.duckdb -c "SELECT year, week, COUNT(*) AS rows FROM unified GROUP BY 1,2 ORDER BY 1 DESC, 2 DESC LIMIT 5;"

Download assets for any language

BASE="https://github.com/AlFontal/jp-idwr-db/releases/latest/download"

mkdir -p jp-idwr-assets
cd jp-idwr-assets
curl -L -O "${BASE}/manifest.json"
curl -L -O "${BASE}/unified.parquet"
curl -L -O "${BASE}/jp_idwr_db.duckdb"

R example (DuckDB, local)

This example opens the local jp_idwr_db.duckdb artifact (downloaded with the parquet files) and queries the unified view. Run it from the directory where jp_idwr_db.duckdb and the parquet files are located:

con <- DBI::dbConnect(duckdb::duckdb(), "jp_idwr_db.duckdb", read_only = TRUE)

tb <- DBI::dbGetQuery(
  con,
  "SELECT date, prefecture, disease, count, source
   FROM unified
   WHERE year = 2024 AND disease = 'Tuberculosis'
   ORDER BY date, prefecture
   LIMIT 20"
)

print(tb)
DBI::dbDisconnect(con, shutdown = TRUE)
        date prefecture      disease count             source
1 2024-01-01      Aichi Tuberculosis     5 All-case reporting
2 2024-01-01      Akita Tuberculosis     1 All-case reporting
3 2024-01-01     Aomori Tuberculosis     0 All-case reporting
4 2024-01-01      Chiba Tuberculosis     7 All-case reporting
5 2024-01-01      Ehime Tuberculosis     1 All-case reporting
6 2024-01-01      Fukui Tuberculosis     1 All-case reporting
...

R example (Arrow, remote)

You can also query the parquet files directly from the GitHub Release URL without downloading first:

library(magrittr)

url <- "https://github.com/AlFontal/jp-idwr-db/releases/latest/download/unified.parquet"

tb <- arrow::read_parquet(url) %>%
  dplyr::filter(year == 2024, disease == "Tuberculosis") %>%
  dplyr::select(date, prefecture, disease, count, source) %>%
  dplyr::arrange(date, prefecture)

print(as.data.frame(tb))
        date prefecture      disease count             source
1 2024-01-01      Aichi Tuberculosis     5 All-case reporting
2 2024-01-01      Akita Tuberculosis     1 All-case reporting
3 2024-01-01     Aomori Tuberculosis     0 All-case reporting
4 2024-01-01      Chiba Tuberculosis     7 All-case reporting
5 2024-01-01      Ehime Tuberculosis     1 All-case reporting
6 2024-01-01      Fukui Tuberculosis     1 All-case reporting
...

Main API

Top-level API exported by jp_idwr_db:

  • load(name)
  • get_data(...)
  • list_diseases(source="all")
  • list_prefectures()
  • get_latest_week()
  • prefecture_map()
  • attach_prefecture_id(df, prefecture_col="prefecture", id_col="prefecture_id")
  • merge(...), pivot(...)
  • configure(...), get_config()

Datasets

Use jp.load(...) with:

  • "sex": historical sex-disaggregated surveillance
  • "place": historical place-category surveillance
  • "bullet": modern all-case weekly reports (rapid zensu)
  • "sentinel": sentinel reports (teitenrui; 2012+ in release data assets)
  • "unified": deduplicated combined dataset (sex-total + modern bullet/sentinel, recommended)

Note: teitenrui CSVs report year-to-date cumulative counts. jp-idwr-db converts these to weekly incidence (count_t - count_{t-1} within year/prefecture/disease; first week kept as-is).

Detailed schema and coverage are documented in DATASETS.md.

Raw Download and Parsing

Raw file workflows are available in jp_idwr_db.io:

  • jp_idwr_db.io.download(...)
  • jp_idwr_db.io.download_recent(...)
  • jp_idwr_db.io.read(...)

These are useful for refreshing local raw weekly files or debugging parser behavior.

Data Wrangling Examples

See EXAMPLES.md for data wrangling recipes (grouping, trends, regional slices, source-aware filtering).

Disease-by-disease temporal coverage is documented in DISEASES.md.

Data Source

NIID/JIHS infectious disease surveillance publications:

  • Historical annual archive files (Syu_01_1, Syu_02_1)
  • Rapid weekly CSV reports (zensuXX.csv, teitenruiXX.csv)

Development

uv sync --all-extras --dev
uv run ruff check .
uv run mypy src
uv run pytest

# Build release data assets (manifest + duckdb + parquet metadata)
  uv run --with duckdb --with jsonschema jp-idwr-db-build-assets \
  --data-dir data/parquet \
  --release-tag vYYYY.M.D \
  --base-url https://github.com/AlFontal/jp-idwr-db/releases/download/vYYYY.M.D \
  --schema-path docs/manifest.schema.json

Security and Integrity

  • Release assets include a manifest.json with SHA256 checksums and file sizes.
  • ensure_data() verifies each downloaded parquet checksum and size before marking cache complete.
  • For PyPI publishing, prefer Trusted Publishing (OIDC) over long-lived API tokens.

License

GPL-3.0-or-later. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jp_idwr_db-2026.5.13.tar.gz (53.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jp_idwr_db-2026.5.13-py3-none-any.whl (60.7 kB view details)

Uploaded Python 3

File details

Details for the file jp_idwr_db-2026.5.13.tar.gz.

File metadata

  • Download URL: jp_idwr_db-2026.5.13.tar.gz
  • Upload date:
  • Size: 53.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_idwr_db-2026.5.13.tar.gz
Algorithm Hash digest
SHA256 dd04437fdf9aaf2da0178d323bfccb0a4074d5ab37a715a95831d68f3f2bf329
MD5 5b7b9b824d5e11ca4054e6be1f83be97
BLAKE2b-256 85303bea3a483dc2f5be5ba7f33041f6ba609f0c4293a3ef65c51dc4ae3573d9

See more details on using hashes here.

File details

Details for the file jp_idwr_db-2026.5.13-py3-none-any.whl.

File metadata

  • Download URL: jp_idwr_db-2026.5.13-py3-none-any.whl
  • Upload date:
  • Size: 60.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_idwr_db-2026.5.13-py3-none-any.whl
Algorithm Hash digest
SHA256 a849040b767e459e68807c9ac503aa319245aa3ac183163a541dc2e49bdf0166
MD5 2036f9afc894e8f7c440573c5c8f34a1
BLAKE2b-256 601cb6441659b3c3da9ea23bd0f9be10eb6e69268475a71d3bb5e1b15b38183a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page