Skip to main content

Japanese IDWR infectious disease database and analytics toolkit built on Polars.

Project description

jp-idwr-db

PyPI version Python versions CI License: GPL-3.0-or-later

jp-idwr-db publishes Japan’s infectious disease surveillance data (NIID/JIHS IDWR) as a versioned, language-agnostic data product: Parquet tables plus a machine-readable manifest.json (and an optional DuckDB file with views).

The Python package adds a convenient API and local caching on top of those release assets. Internally, data wrangling is Polars-first for speed and consistent transforms.

The goal is to skip the usual work of chasing week-by-week files across changing archives and formats, so you can get straight to building time series and doing epidemiology instead of spending hours on data munging.

The package provides an easier interface to the data, but you can also query the Parquet files directly with any tool that supports them (DuckDB, Arrow, Spark, etc.) using the manifest.json for file locations and schema. Direct-access examples are included below.

Python Install

pip install jp-idwr-db

Quick Start

To fetch the full unified dataset with a single call:

import jp_idwr_db as jp
import polars as pl

df = (
    jp.load("unified", version="latest")
    .select(["date", "prefecture", "category", "disease", "count", "source"])
)
print(df)
shape: (5_370_477, 6)
┌────────────┬────────────┬──────────┬─────────────────────────────┬───────┬────────────────────┐
│ date       ┆ prefecture ┆ category ┆ disease                     ┆ count ┆ source             │
│ ---        ┆ ---        ┆ ---      ┆ ---                         ┆ ---   ┆ ---                │
│ date       ┆ str        ┆ str      ┆ str                         ┆ f64   ┆ str                │
╞════════════╪════════════╪══════════╪═════════════════════════════╪═══════╪════════════════════╡
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ AIDS                        ┆ 0.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Acute poliomyelitis         ┆ 0.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Acute viral hepatitis       ┆ 4.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Amebiasis                   ┆ 0.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Anthrax                     ┆ 0.0   ┆ Confirmed cases    │
│ …          ┆ …          ┆ …        ┆ …                           ┆ …     ┆ …                  │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Viral hepatitis(excluding   ┆ 0.0   ┆ All-case reporting │
│            ┆            ┆          ┆ hepa…                       ┆       ┆                    │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ West Nile fever             ┆ 0.0   ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Western equine encephalitis ┆ 0.0   ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Yellow fever                ┆ 0.0   ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Zika virus infection        ┆ 0.0   ┆ All-case reporting │
└────────────┴────────────┴──────────┴─────────────────────────────┴───────┴────────────────────┘

You can also filter at the source with jp.get_data(...):

# Fetch only tuberculosis data for 2024 in Tokyo, Osaka, and Hokkaido
tb = (
    jp.get_data(
        disease="Tuberculosis",
        year=2024,
        prefecture=["Tokyo", "Osaka", "Hokkaido"],
        version="latest")
    .select(["date", "prefecture", "disease", "count", "source"])
)
print(tb)
shape: (156, 5)
┌────────────┬────────────┬──────────────┬───────┬────────────────────┐
│ date       ┆ prefecture ┆ disease      ┆ count ┆ source             │
│ ---        ┆ ---        ┆ ---          ┆ ---   ┆ ---                │
│ date       ┆ str        ┆ str          ┆ f64   ┆ str                │
╞════════════╪════════════╪══════════════╪═══════╪════════════════════╡
│ 2024-01-01 ┆ Hokkaido   ┆ Tuberculosis ┆ 2.0   ┆ All-case reporting │
│ 2024-01-01 ┆ Osaka      ┆ Tuberculosis ┆ 3.0   ┆ All-case reporting │
│ 2024-01-01 ┆ Tokyo      ┆ Tuberculosis ┆ 15.0  ┆ All-case reporting │
│ 2024-01-08 ┆ Hokkaido   ┆ Tuberculosis ┆ 4.0   ┆ All-case reporting │
│ 2024-01-08 ┆ Osaka      ┆ Tuberculosis ┆ 17.0  ┆ All-case reporting │
│ …          ┆ …          ┆ …            ┆ …     ┆ …                  │
│ 2024-12-16 ┆ Osaka      ┆ Tuberculosis ┆ 17.0  ┆ All-case reporting │
│ 2024-12-16 ┆ Tokyo      ┆ Tuberculosis ┆ 41.0  ┆ All-case reporting │
│ 2024-12-23 ┆ Hokkaido   ┆ Tuberculosis ┆ 5.0   ┆ All-case reporting │
│ 2024-12-23 ┆ Osaka      ┆ Tuberculosis ┆ 16.0  ┆ All-case reporting │
│ 2024-12-23 ┆ Tokyo      ┆ Tuberculosis ┆ 53.0  ┆ All-case reporting │
└────────────┴────────────┴──────────────┴───────┴────────────────────┘
# Sentinel-only diseases from recent years in Tokyo prefecture
sentinel_df = (
    jp.get_data(
        source="sentinel",
        prefecture="Tokyo",
        year=(2024, 2026),
        version="latest")
    .select(["date", "prefecture", "disease", "count", "per_sentinel"])
)
print(sentinel_df)
shape: (2_052, 5)
┌────────────┬────────────┬─────────────────────────────────┬─────────┬──────────────┐
│ date       ┆ prefecture ┆ disease                         ┆ count   ┆ per_sentinel │
│ ---        ┆ ---        ┆ ---                             ┆ ---     ┆ ---          │
│ date       ┆ str        ┆ str                             ┆ f64     ┆ f64          │
╞════════════╪════════════╪═════════════════════════════════╪═════════╪══════════════╡
│ 2024-01-07 ┆ Tokyo      ┆ Acute hemorrhagic conjunctivit… ┆ null    ┆ null         │
│ 2024-01-07 ┆ Tokyo      ┆ Aseptic meningitis              ┆ null    ┆ null         │
│ 2024-01-07 ┆ Tokyo      ┆ Bacterial meningitis            ┆ null    ┆ null         │
│ 2024-01-07 ┆ Tokyo      ┆ COVID-19                        ┆ 1365.0  ┆ 3.38         │
│ 2024-01-07 ┆ Tokyo      ┆ Chickenpox                      ┆ 31.0    ┆ 0.12         │
│ …          ┆ …          ┆ …                               ┆ …       ┆ …            │
│ 2026-01-25 ┆ Tokyo      ┆ Influenza(excld. avian influen… ┆ 13082.0 ┆ 34.07        │
│ 2026-01-25 ┆ Tokyo      ┆ Mumps                           ┆ 30.0    ┆ 0.12         │
│ 2026-01-25 ┆ Tokyo      ┆ Mycoplasma pneumonia            ┆ 32.0    ┆ 1.28         │
│ 2026-01-25 ┆ Tokyo      ┆ Pharyngoconjunctival fever      ┆ 115.0   ┆ 0.47         │
│ 2026-01-25 ┆ Tokyo      ┆ Respiratory syncytial virus in… ┆ 242.0   ┆ 1.0          │
└────────────┴────────────┴─────────────────────────────────┴─────────┴──────────────┘
Data Download Model
  • Package wheels do not ship the large parquet tables.
  • On first call to jp.load(..., version="latest") (or jp.get_data(..., version="latest")), the package downloads parquet assets listed in the latest published release manifest.json.
  • By default, the package uses the packaged data version that matches the installed wheel. Use version="latest" when you want the freshest published snapshot.
  • Cache path defaults to:
    • macOS: ~/Library/Caches/jp_idwr_db/data/<version>/
    • Linux: ~/.cache/jp_idwr_db/data/<version>/
    • Windows: %LOCALAPPDATA%\\jp_idwr_db\\Cache\\data\\<version>\\

Prefetch explicitly:

python -m jp_idwr_db data download
python -m jp_idwr_db data download --version latest --force

Environment overrides:

  • JPINFECT_DATA_VERSION: choose a specific release tag or latest (example: latest)
  • JPINFECT_DATA_BASE_URL: override asset host base URL
  • JPINFECT_CACHE_DIR: override local cache root

Language-independent data access

Release data assets are published as:

  • manifest.json
  • one or more .parquet tables (including unified.parquet)
  • optional jp_idwr_db.duckdb (views over the parquet files)

Manifest schema reference: docs/manifest.schema.json.

Fetch the manifest:

curl -L "https://github.com/AlFontal/jp-idwr-db/releases/latest/download/manifest.json"

Query with DuckDB CLI (when jp_idwr_db.duckdb and parquet files are in the same directory):

duckdb jp_idwr_db.duckdb -c "SELECT year, week, COUNT(*) AS rows FROM unified GROUP BY 1,2 ORDER BY 1 DESC, 2 DESC LIMIT 5;"

Download assets for any language

BASE="https://github.com/AlFontal/jp-idwr-db/releases/latest/download"

mkdir -p jp-idwr-assets
cd jp-idwr-assets
curl -L -O "${BASE}/manifest.json"
curl -L -O "${BASE}/unified.parquet"
curl -L -O "${BASE}/jp_idwr_db.duckdb"

R example (DuckDB, local)

This example opens the local jp_idwr_db.duckdb artifact (downloaded with the parquet files) and queries the unified view. Run it from the directory where jp_idwr_db.duckdb and the parquet files are located:

con <- DBI::dbConnect(duckdb::duckdb(), "jp_idwr_db.duckdb", read_only = TRUE)

tb <- DBI::dbGetQuery(
  con,
  "SELECT date, prefecture, disease, count, source
   FROM unified
   WHERE year = 2024 AND disease = 'Tuberculosis'
   ORDER BY date, prefecture
   LIMIT 20"
)

print(tb)
DBI::dbDisconnect(con, shutdown = TRUE)
        date prefecture      disease count             source
1 2024-01-01      Aichi Tuberculosis     5 All-case reporting
2 2024-01-01      Akita Tuberculosis     1 All-case reporting
3 2024-01-01     Aomori Tuberculosis     0 All-case reporting
4 2024-01-01      Chiba Tuberculosis     7 All-case reporting
5 2024-01-01      Ehime Tuberculosis     1 All-case reporting
6 2024-01-01      Fukui Tuberculosis     1 All-case reporting
...

R example (Arrow, remote)

You can also query the parquet files directly from the GitHub Release URL without downloading first:

library(magrittr)

url <- "https://github.com/AlFontal/jp-idwr-db/releases/latest/download/unified.parquet"

tb <- arrow::read_parquet(url) %>%
  dplyr::filter(year == 2024, disease == "Tuberculosis") %>%
  dplyr::select(date, prefecture, disease, count, source) %>%
  dplyr::arrange(date, prefecture)

print(as.data.frame(tb))
        date prefecture      disease count             source
1 2024-01-01      Aichi Tuberculosis     5 All-case reporting
2 2024-01-01      Akita Tuberculosis     1 All-case reporting
3 2024-01-01     Aomori Tuberculosis     0 All-case reporting
4 2024-01-01      Chiba Tuberculosis     7 All-case reporting
5 2024-01-01      Ehime Tuberculosis     1 All-case reporting
6 2024-01-01      Fukui Tuberculosis     1 All-case reporting
...

Main API

Top-level API exported by jp_idwr_db:

  • load(name)
  • get_data(...)
  • list_diseases(source="all")
  • list_prefectures()
  • get_latest_week()
  • prefecture_map()
  • attach_prefecture_id(df, prefecture_col="prefecture", id_col="prefecture_id")
  • merge(...), pivot(...)
  • configure(...), get_config()

Datasets

Use jp.load(...) with:

  • "sex": historical sex-disaggregated surveillance
  • "place": historical place-category surveillance
  • "bullet": modern all-case weekly reports (rapid zensu)
  • "sentinel": sentinel reports (teitenrui; 2012+ in release data assets)
  • "unified": deduplicated combined dataset (sex-total + modern bullet/sentinel, recommended)

Note: teitenrui CSVs report year-to-date cumulative counts. jp-idwr-db converts these to weekly incidence (count_t - count_{t-1} within year/prefecture/disease; first week kept as-is).

Detailed schema and coverage are documented in DATASETS.md.

Raw Download and Parsing

Raw file workflows are available in jp_idwr_db.io:

  • jp_idwr_db.io.download(...)
  • jp_idwr_db.io.download_recent(...)
  • jp_idwr_db.io.read(...)

These are useful for refreshing local raw weekly files or debugging parser behavior.

Data Wrangling Examples

See EXAMPLES.md for data wrangling recipes (grouping, trends, regional slices, source-aware filtering).

Disease-by-disease temporal coverage is documented in DISEASES.md.

Data Source

NIID/JIHS infectious disease surveillance publications:

  • Historical annual archive files (Syu_01_1, Syu_02_1)
  • Rapid weekly CSV reports (zensuXX.csv, teitenruiXX.csv)

Development

uv sync --all-extras --dev
uv run ruff check .
uv run mypy src
uv run pytest

# Build release data assets (manifest + duckdb + parquet metadata)
  uv run --with duckdb --with jsonschema jp-idwr-db-build-assets \
  --data-dir data/parquet \
  --release-tag vYYYY.M.D \
  --base-url https://github.com/AlFontal/jp-idwr-db/releases/download/vYYYY.M.D \
  --schema-path docs/manifest.schema.json

Security and Integrity

  • Release assets include a manifest.json with SHA256 checksums and file sizes.
  • ensure_data() verifies each downloaded parquet checksum and size before marking cache complete.
  • For PyPI publishing, prefer Trusted Publishing (OIDC) over long-lived API tokens.

License

GPL-3.0-or-later. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jp_idwr_db-2026.4.15.tar.gz (53.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jp_idwr_db-2026.4.15-py3-none-any.whl (60.7 kB view details)

Uploaded Python 3

File details

Details for the file jp_idwr_db-2026.4.15.tar.gz.

File metadata

  • Download URL: jp_idwr_db-2026.4.15.tar.gz
  • Upload date:
  • Size: 53.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_idwr_db-2026.4.15.tar.gz
Algorithm Hash digest
SHA256 e95b575e4852a2ea695b6de1dee1157c918a951107b2bca458e1fcdaf87f1eb4
MD5 76b53ccc06865878f8315179f174f8bc
BLAKE2b-256 13d1fae0c54ca0511d8070188eff6d367e93716c1ebb6f80ee6680121bb863f2

See more details on using hashes here.

File details

Details for the file jp_idwr_db-2026.4.15-py3-none-any.whl.

File metadata

  • Download URL: jp_idwr_db-2026.4.15-py3-none-any.whl
  • Upload date:
  • Size: 60.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_idwr_db-2026.4.15-py3-none-any.whl
Algorithm Hash digest
SHA256 e98760d6351dcebd832f6ed93abfcdef7959214ee47dd05a307f7dc509997108
MD5 6e55215630682ae4e238cc9248dcb2f6
BLAKE2b-256 8b8334f80710655715f7bfad00aa1bd55f2d398706c9edccfd3896101451bbd8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page