A tool to scrape financial reports from TWSE
Project description
TW FinReport Scraper
An automated tool for scraping financial reports (Annual Reports, Quarterly Reports, and Shareholder Meeting documents) from the Taiwan Stock Exchange (TWSE) Market Observation Post System.
Features
- Automated Scraping: Supports automatic downloading of annual reports, quarterly reports, and shareholder meeting documents.
- Flexible Configuration: Specify years, quarters, and specific stock codes.
- Rate Limit Handling: Built-in detection and automatic retry mechanism to avoid being blocked by TWSE.
- Playwright Driven: Uses Playwright to simulate browser behavior for high stability.
Installation
1. Install Python Package
pip install tw-finreport-scraper
2. Install Playwright Browser
playwright install chromium
Usage
Library Usage
You can reference this package directly in your Python code.
from tw_finreport_scraper import run_scraper
# --- Scraper Settings ---
TARGET_YEAR = "114" # Minguo year
STOCK_CODES = ["2330"] # List of stock codes or ["ALL"]
TARGET_QUARTERS = [1, 2, 3]
DEFAULT_TYPE = "all" # "annual", "quarterly", or "all"
SLEEP_RANGE = (10.0, 15.0)
COOL_DOWN_BASE = 15
MAX_RETRIES = 10
if __name__ == "__main__":
run_scraper(
type=DEFAULT_TYPE,
year=TARGET_YEAR,
codes=STOCK_CODES,
quarters=TARGET_QUARTERS,
cooldown=COOL_DOWN_BASE,
retries=MAX_RETRIES,
sleep_range=SLEEP_RANGE,
base_path="./data" # Output directory
)
Function Arguments (run_scraper)
type: Scraping Type.annual: Annual reports and related documents (e.g., meeting minutes, top 10 shareholders).quarterly: Quarterly financial reports (AI1).all: Both annual and quarterly reports.- Default:
annual
year: Minguo Year (e.g.,114).codes: Stock Codes.- A list of strings, e.g.,
["2330", "2498"]. - Use
["ALL"]to automatically fetch all listed stocks.
- A list of strings, e.g.,
quarters: Quarters (1, 2, 3, 4).- Only valid when
typeisquarterlyorall.
- Only valid when
cooldown: Rate Limit Cooldown. Base seconds to wait when rate limited. Default is 15s.retries: Max Retries. Maximum attempts for rate limiting or "file processing" states. Default is 10.base_path: Output Root Directory. The tool will create atwse_outputfolder inside this path.
Efficiency and Design Philosophy
1. Execution Time Estimate
Based on current TWSE rate limits, scraping all listed stocks for a full year's data takes approximately 2 to 4 days.
2. Regarding Proxy Support
This tool does not include Proxy support for the following reasons:
- Low Frequency: Financial reports are updated infrequently (4 times a year for quarterly, once for annual).
- No Urgency: Stability is prioritized over speed.
3. Environment Recommendation (No Jupyter)
Strongly discouraged to run this in Jupyter Notebook due to conflicts between Playwright's async mechanism and Jupyter's event loop. Please use standard Python scripts (.py).
TW FinReport Scraper (台灣證交所財報抓取工具)
這是一個用於自動化抓取台灣證券交易所(TWSE)公開資訊觀測站中上市櫃公司財報(年報、季報、股東會文件)的工具。
功能特點
- 自動化抓取:支援年報、季報及股東會相關文件的自動下載。
- 彈性設定:可指定年份、季度以及特定的股票代碼。
- 限流處理:內建限流偵測與自動重試機制,避免被證交所封鎖。
- Playwright 驅動:使用 Playwright 模擬瀏覽器行為,穩定性高。
安裝方式
1. 安裝 Python 套件
pip install tw-finreport-scraper
2. 安裝 Playwright 瀏覽器
playwright install chromium
使用方法
作為 Python 套件使用 (Library)
你可以在你的 Python 程式碼中直接引用此套件。
from tw_finreport_scraper import run_scraper
# --- 爬蟲設定參數 ---
TARGET_YEAR = "114" # 民國年份
STOCK_CODES = ["2330"] # 股票代碼清單 或 ["ALL"]
TARGET_QUARTERS = [1, 2, 3]
DEFAULT_TYPE = "all" # "annual", "quarterly", 或 "all"
SLEEP_RANGE = (10.0, 15.0)
COOL_DOWN_BASE = 15
MAX_RETRIES = 10
if __name__ == "__main__":
run_scraper(
type=DEFAULT_TYPE,
year=TARGET_YEAR,
codes=STOCK_CODES,
quarters=TARGET_QUARTERS,
cooldown=COOL_DOWN_BASE,
retries=MAX_RETRIES,
sleep_range=SLEEP_RANGE,
base_path="./data" # 輸出目錄
)
函數參數說明 (run_scraper)
type: 抓取類型。annual: 僅抓取年報及股東會相關文件(如議事錄、前十大股東關係表)。quarterly: 僅抓取財務報告(季報)。all: 同時抓取年報與季報。- 預設值:
annual
year: 指定民國年份 (例如114)。codes: 指定股票代碼。- 傳入字串清單,例如
["2330", "2498"]。 - 傳入
["ALL"]則會自動獲取全台上市櫃股票清單進行抓取。
- 傳入字串清單,例如
quarters: 指定季度 (1, 2, 3, 4)。- 僅在
type為quarterly或all時有效。
- 僅在
cooldown: 限流冷卻時間。- 當偵測到限流時,程式每次休息的基礎秒數。預設為 15 秒。
retries: 最大重試次數。- 針對限流或檔案處理中狀態的重試次數。預設為 10 次。
base_path: 指定輸出根目錄。- 程式會在此目錄下建立一個
twse_output資料夾。
- 程式會在此目錄下建立一個
執行效率與設計理念
1. 執行時間預估
根據目前證交所允許的請求頻率與限流機制,若要抓取全台所有上市櫃股票當年度的完整資料,大約需要 2 至 4 天 才能完成。
2. 關於 Proxy 功能
本工具未內建 Proxy 功能,主要基於以下考量:
- 低頻率需求:財報屬於低頻率更新的資料,不具備即時競爭性。
- 無急迫性:穩定抓取即可滿足大多數分析需求。
3. 環境建議 (不建議使用 Jupyter)
強烈不建議在 Jupyter Notebook 環境下執行此工具。
- 由於 Playwright 的非同步機制與 Jupyter 的事件迴圈經常發生衝突。
- 請使用標準的 Python 腳本 (
.py) 在終端機執行。
專案結構
tw_finreport_scraper/: 套件核心目錄。main.py: 主程式邏輯。annual.py: 處理年報抓取。quarterly.py: 處理季報抓取。stocks.py: 獲取股票代碼清單。common.py: 共用工具函數。
免責聲明
本工具僅供學術研究與個人使用,請遵守證券交易所的使用規範。
授權條款
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tw_finreport_scraper-0.1.1.tar.gz.
File metadata
- Download URL: tw_finreport_scraper-0.1.1.tar.gz
- Upload date:
- Size: 13.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3af87bfa4d5252cd87b3b35a1d7a7e25510a20d40050f1957d072b08c616577
|
|
| MD5 |
349b467b0d2fb03ccea5be12de40948a
|
|
| BLAKE2b-256 |
2411e3a579df402f8a6932cf2650491ad500bcd3925eed212117656e0b39f15e
|
File details
Details for the file tw_finreport_scraper-0.1.1-py3-none-any.whl.
File metadata
- Download URL: tw_finreport_scraper-0.1.1-py3-none-any.whl
- Upload date:
- Size: 35.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43974328995767bd32f4820495e9dae689645c806b2995af93d40e0c7e56801f
|
|
| MD5 |
e879fff65cd6b4e32969409e8acb7564
|
|
| BLAKE2b-256 |
46b242eee641bdedb4a1393a009f5733d6d8088b58544daa3f003dcbf5e536e4
|