jpcorpreg is a Python library that downloads corporate registry which is published in the Corporate Number Publication Site as a data frame.
Project description
jpcorpreg
jpcorpreg is a Python library that downloads corporate registry which is published in the Corporate Number Publication Site as a data frame.
Installation
jpcorpreg is available on pip installation.
$ python -m pip install jpcorpreg
GitHub Install
Installing the latest version from GitHub:
$ git clone https://github.com/new-village/jpcorpreg
$ cd jpcorpreg
$ pip install -e .
Usage
This section demonstrates how to use this library to load and process data from the National Tax Agency's Corporate Number Publication Site.
Starting with version 2.0.0, jpcorpreg provides a robust object-oriented client (CorporateRegistryClient) optimized for reading large datasets and native Parquet partitioning.
Initializing the Client
First, import and initialize the client:
from jpcorpreg import CorporateRegistryClient
client = CorporateRegistryClient()
Direct Data Loading
To download data for a specific prefecture as a pandas DataFrame, use the fetch method. By passing the prefecture name in as an argument, it will perform streaming fetch from the National Tax site:
>>> df = client.fetch("Shimane")
To execute the download across all prefectures across Japan, simply leave the parameter empty or pass "All":
>>> df = client.fetch()
Differential Data Loading
If you want to download only the daily differential updates (sabun), use the fetch_diff function. By passing a date in YYYYMMDD format, you can download the diff for that specific date. If no date is provided, the latest available diff is returned.
>>> df = client.fetch_diff("20260220")
Parquet Output and Partitioning
If you prefer to save the downloaded data for data lakes explicitly, pass format="parquet". You can also supply the partition_cols argument so that the dataset is written in partitioned directories on disk. The function returns the output base directory path.
Partitioning Context Notes:
- For
fetch()(full wash dataset), use something likepartition_cols=["prefecture_name"]. Avoid using "update_date" on a full data wash to prevent query fragmentation. - For
fetch_diff()(daily diff data), usepartition_cols=["update_date"]to append daily updates seamlessly into your data lake structure.
>>> # Example: Output differential data partitioned by update_date
>>> out_dir = client.fetch_diff(format="parquet", partition_cols=["update_date"])
You can then read the dynamically generated Parquet Dataset efficiently with pandas or PyArrow:
>>> import pandas as pd
>>> df = pd.read_parquet(out_dir)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jpcorpreg-2.0.1.tar.gz.
File metadata
- Download URL: jpcorpreg-2.0.1.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
634b8a27e792bdfc24b407c6ea941ec5e300686d97ed3c0ec92af79c13dbbb75
|
|
| MD5 |
f044218c85e68fa5d6c00401640f29ef
|
|
| BLAKE2b-256 |
a5be9b8cc4184612f354ec6f9e3d5ee0c7a11fdc9c783f07cb168bf6320e7be8
|
File details
Details for the file jpcorpreg-2.0.1-py3-none-any.whl.
File metadata
- Download URL: jpcorpreg-2.0.1-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a59c226d3e0ca15519ad74be635c41d9a49e08b234af49024c63983ca0a35f9
|
|
| MD5 |
f288065d26e063f924041a16524add80
|
|
| BLAKE2b-256 |
03aee6b466bbd4b5027d2f401bcfc784b069098c7d57cacc14d36edfffb08a2e
|