Skip to main content

Russian corporate reports 2012-2017

Project description

Build Status Coverage Status colab

boo

Python client to download annual corporate report data from Rosstat website.

boo creates a local CSV file with column names, importable as pandas dataframe.

The dataset contains balance sheet, profit and loss statement and cash flow statement variables.

Install

pip install boo

For development version:

pip install git+https://github.com/ru-corporate/boo.git@master

Usage

from boo import download, build, read_dataframe

download(2012)
build(2012)
df = read_dataframe(2012)
print(df.head())

Data model

CSV files are located at ~/.boo folder. boo.locate(year) will show exactly where they are.

File name Description Column count Created by
raw<year>.csv Original CSV file from Rosstat website. No header row. 266 download(year)
<year>.csv CSV file with column names in header row. 58 build(year)

df = read_dataframe(year) returns reference ("canonic") dataset. This function makes additional column transformations (eg. extracts region from inn) and applies error filters to <year>.csv.

Hints

User

  • CSV files are quite big, start with year 2012 to experiment.
  • Use link above for Google Colab to run package remotely.
  • Use read_dataframe(year) to read canonic CSV file.

Developper

  • boo.path.default_data_folder shows where the CSV files are on a computer.
  • boo.columns controls CSV column selection and naming.
  • boo.dataframe.canonic makes canonic CSV. By coincidence the outputhas same number of columns as <year>.csv, but the columns are slightly different as some columns are added and some removed.
  • boo.year.TIMESTAMPS help to find proper URLs, which change along with Rosstat website updates.
  • New annual dataset released around September-October.

Script

Rosstat publishes CSV files without column headers. When preparing a readable CSV file we assign a name to columns with variables of interest and cut away the rest of the columns.

This way we get a much smaller file (~50% of the size) which we can read and manipulate with pandas or R.

For illustration, batch script below creates 2012.csv file with column names.

set url=http://www.gks.ru/opendata/storage/7708234640-bdboo2012/data-20190329t000000-structure-20121231t000000.csv
set index=1,2,3,4,5,6,7,8,17,18,27,28,37,38,41,42,43,44,57,58,59,60,67,68,69,70,79,80,81,82,83,84,93,94,99,100,105,106,117,118,204,205,209,210,211,212,213,214,215,216,222,223,228,229,235,240,241,266 
set colnames=name,okpo,okopf,okfs,okved,inn,unit,report_type,of,of_lag,ta_fix,ta_fix_lag,cash,cash_lag,ta_nonfix,ta_nonfix_lag,ta,ta_lag,tp_capital,tp_capital_lag,debt_long,debt_long_lag,tp_long,tp_long_lag,debt_short,debt_short_lag,tp_short,tp_short_lag,tp,tp_lag,sales,sales_lag,profit_oper,profit_oper_lag,exp_interest,exp_interest_lag,profit_before_tax,profit_before_tax_lag,profit_after_tax,profit_after_tax_lag,cf_oper_in,cf_oper_in_sales,cf_oper_out,paid_to_supplier,paid_to_worker,paid_interest,paid_profit_tax,paid_other_costs,cf_oper,cf_inv_in,cf_inv_out,paid_fa_investment,cf_inv,cf_fin_in,cf_fin_out,cf_fin,cf,date_published

curl %url% > raw2012.csv

echo %colnames% > 2012.csv
cat raw2012.csv | csvcut -d; -e ansi -c%index%  | iconv -f cp1251 -t utf-8 >> 2012.csv

csvclean 2012.csv

Note: this is a Windows batch file, but it relies on GNU utilities (eg via Cygwin, MinGW or GOW) and csvkit. Similar script can be adapted for pure linux/bash. Google colab version allows a mixin of python and script code, similar to f-strings.

Batch file result is similar to running:

from boo import download, build
download(2012)
build(2012)

Limitations

  • No timeseries: we can access cross-section of all data by year, but not several years of data by each firm.
  • No database: we store files as plain CSV, not in a database.

Contributors

The package is maintained by Evgeniy Pogrebnyak.

Special thanks to Daniil Chizhevskij for PyPI collaboration. Without his support pip install boo would not be possible.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boo-0.0.9.tar.gz (65.0 kB view hashes)

Uploaded Source

Built Distribution

boo-0.0.9-py3-none-any.whl (69.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page