Skip to main content

A Python library for crawling and retrieving all notices published under Japan’s Furikome Sagi Relief Act, with support for both full data extraction and incremental updates.

Project description

sagikoza

PyPI - Version

A Python library for automatically crawling and retrieving all public notices under Japan’s Furikome Sagi Relief Act. Supports both full and incremental data extraction, returning results as a list of dictionaries.

日本語の説明はこちらを参照して下さい


Features

  • Automatically retrieves public notices under the Furikome Sagi Relief Act
  • Supports fetching by year or for the latest 3 months
  • Incremental (diff) data retrieval
  • Returns data as a list of dictionaries

Supported Environments

  • Python 3.8 or later

Installation

Install from PyPI:

python -m pip install sagikoza

Latest from GitHub:

git clone https://github.com/new-village/sagikoza
cd sagikoza
python setup.py install

Usage

Fetch notices for a specific year

Retrieve notices published since 2008 for a given year (e.g., '2025').

import sagikoza
accounts = sagikoza.fetch('2025')
print(accounts)
# [{'doc_id': '12345', 'link': '/pubs_basic_frame.php?...', 'id': '...', ...}, ...]

Fetch notices for the last 3 months

Call without arguments to get notices from the latest 3 months.

import sagikoza
accounts = sagikoza.fetch()
print(accounts)

Save data example

Save the retrieved data in Parquet format.

import pandas as pd
import sagikoza
accounts = sagikoza.fetch()
df = pd.DataFrame(accounts)
df.to_parquet('accounts.parquet', index=False)

Function Specification

  • fetch(year: str = "near3") -> list[dict]
    • Specify a year (YYYY) or "near3" for the latest 3 months
    • Raises an exception on failure

Internal Workflow

  1. Fetch notice list (POST: sel_pubs.php)
  2. Fetch notice details (POST: pubs_dispatcher.php)
  3. Fetch basic info (GET: pubs_basic_frame.php)
  4. Fetch account details (POST: k_pubstype_00_detail.php, etc.)

Parameters required for each step are extracted from the HTML and used for subsequent page transitions.

Logging

Uses Python's standard logging module. For detailed logs:

import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(name)s %(message)s')
import sagikoza
sagikoza.fetch()

By default, only WARNING and above are shown. For more detail, set level=logging.DEBUG.

Error Handling

  • Network, HTTP, and timeout errors raise a FetchError exception
  • If no records are found, a WARNING log is output

Notes

  • This library retrieves data from public sources. Changes to the source website may affect functionality
  • Accuracy and completeness of retrieved data are not guaranteed. Please use together with official information

License

Apache License 2.0

  • BeautifulSoup (MIT License)

Contribution

Bug reports, feature requests, and pull requests are welcome. Please use GitHub Issues or Pull Requests.

Reference

Page Flow

The web pages to be scraped cannot be accessed directly by URL, but can be transitioned to the next page by making a POST request with a combination of parameters hidden within the page. Note: pubs_basic_frame.php can exceptionally be accessed via GET.

The web page contents can be obtained by accessing file using methods and payload. The contents include the payload's value, which is required for accessing other pages, in an element of parameters, which can be found using a selector.

category file method payload selector parameters
notices sel_pubs.php POST {"search_term": "near3", "search_no": "none", "search_pubs_type": "none", "sort_id": "5"} table.sel_pubs_list > tbody > input <input type="hidden" name="doc_id" value="15362">
submits pubs_dispatcher.php POST {"head_line": "", "doc_id": "15362"} table:nth-child(9) > tbody > tr > td.6 > a <a href="./pubs_basic_frame.php?inst_code=0153&amp;p_id=05&amp;pn=365597&amp;re=0">(別添)</a>
subjects pubs_basic_frame.php GET inst_code=0153&p_id=05&pn=365597&re=0 table:nth-child(12) > tbody > tr > td:nth-child(1) > input[type=submit] <form method="POST" name="list_form" action="./k_pubstype_04_detail.php" target="_blank"></form><br><input type="submit" name="r_no" value=" 2420-0153-0007 ">
accounts k_pubstype_00_detail.php POST {"r_no":"+2420-0153-0007+", "pn": "365597", "r_no": "2420-0153-0007", "p_id": "05", "re": "0", "referer": "0"}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sagikoza-2.1.0.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sagikoza-2.1.0-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file sagikoza-2.1.0.tar.gz.

File metadata

  • Download URL: sagikoza-2.1.0.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for sagikoza-2.1.0.tar.gz
Algorithm Hash digest
SHA256 826395fd8a8d0e39ee6c6238e8093e33a75c73478e044865c7efe37d200389e6
MD5 82021181fdb1b6f564362a8765e5ee92
BLAKE2b-256 930fc004dbe98195afc02404e4d43a1ed6b2f9d3c952b57bd89a6e73bc3734a7

See more details on using hashes here.

File details

Details for the file sagikoza-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: sagikoza-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for sagikoza-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb20674e55b3673fb5c63ad4fb56f053ee4a6deb2919d85b80a512033dfe5892
MD5 cdc2e07076a187b1ba7f70e49423c3c1
BLAKE2b-256 41603a5830f50af2a63186fe6c09ab225dcb3f676edd075a17e7952da99f7666

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page