chakki Financial Report Corpus
Project description
chaFiC: chakki Financial Report Corpus
We organized Japanese financial reports to encourage applying NLP techniques to financial analytics.
Dataset
You can download dataset by command line tool.
pip install chafic
Please refer the usage by --
(using fire).
chafic --
Example command.
# Download raw file version dataset of 2014.
chafic download --kind F --year 2014
# Extract business.overview_of_result part of TIS.Inc (sec code=3626).
chafic parse business.overview_of_result --sec_code 3626
# Tokenize text by Janome (Janome or Sudachi is supported).
pip install janome
chafic tokenize --tokenizer janome
# Show tokenized result (words are separated by \t).
head -n 5 data/processed/2014/docs/S100552V_business_overview_of_result_tokenized.txt
1 【 業績 等 の 概要 】
( 1 ) 業績
当 連結 会計 年度 における 我が国 経済 は 、 消費 税率 引上げ に 伴う 駆け込み 需要 の 反動 や 海外 景気 動向 に対する 先行き 懸念 等 から 弱い 動き も 見 られ まし た が 、 企業 収益 の 改善 等 により 全体 ...
- About the parsable part, please refer the
edinet-python
.
Raw dataset file
The corpora are separated to each financial years.
fiscal_year | Raw file version (F) | Text extracted version (E) |
---|---|---|
2014 | .zip (9.3GB) | .zip (269.9MB) |
2015 | .zip (9.8GB) | .zip (291.1MB) |
2016 | .zip (10.2GB) | .zip (334.7MB) |
2017 | .zip (9.1GB) | .zip (309.4MB) |
2018 | .zip (10.5GB) | .zip (260.9MB) |
Statistics
fiscal_year | number_of_reports | has_csr_reports | has_financial_data | has_stock_data |
---|---|---|---|---|
2014 | 3,724 | 92 | 3,583 | 3,595 |
2015 | 3,870 | 96 | 3,725 | 3,751 |
2016 | 4,066 | 97 | 3,924 | 3,941 |
2017 | 3,578 | 89 | 3,441 | 3,472 |
2018 | 3,513 | 70 | 2,893 | 3,413 |
- financial data is from 決算短信情報.
- We use non-cosolidated data if it exist.
- stock data is from 月間相場表(内国株式).
close
is fiscal period end andopen
is 1 year before of it.
Content
Raw file version (--kind F
)
The structure of dataset is following.
chakki_esg_financial_{year}.zip
└──{year}
├── documents.csv
└── docs/
docs
includes XBRL and PDF file.
- XBRL file of annual reports (files are retrieved from [EDINET]).
- PDF file of CSR reports (additional content).
documents.csv
has metadata like following.
- edinet_code:
E0000X
- filer_name:
XXX株式会社
- fiscal_year:
201X
- fiscal_period:
FY
- doc_path:
docs/S000000X.xbrl
- csr_path:
docs/E0000X_201X_JP_36.pdf
Text extracted version (--kind E
)
Text extracted version includes txt
files that match each part of an annual report.
The extracted parts are defined at edinet-python
.
chakki_esg_financial_{year}_extracted.zip
└──{year}
├── documents.csv
└── docs/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
chafic-0.1.10.tar.gz
(9.2 kB
view hashes)